Future-proof your skills with Linux, Python, vim & git as I share with you the most timeless and love-worthy tools in tech through my two great projects that work great together.

More Deliberate With Category Mapping

I'm optimizing my blog post categorization system by grouping common keywords into parent categories. I'm learning about YAML dictionaries and sub-dictionaries to create a nested structure for my categories. I'm also making sure to keep the list of categories as short as possible by using the lemmatized version of words. Finally, I'm working on a custom object shape to keep the categorization system running smoothly.

Exploring YAML Dicts to Optimize Category Mapping

By Michael Levin

Saturday, May 6, 2023

I like the categories that my system is auto-selecting to sort my blog posts under. I’m itching to get to the particular linear sub-sequence of pages project that I alluded to in previous posts, but before I do that another light-touch 80/20-rule advancement is nagging at me. It’s a sort of clumping or grouping within the category mapping, where the keyword extraction being done by OpenAI isn’t bad, but it’s leaning towards single-words, which are often just too common to be category labels. I find myself filtering more and of these.

I’ve filtered a lot of good words from the categories because the term is too broad for a site category, not appropriate for my site or redundant with another term. Well, instead of just filtering it all out, I can roll them up into parent categories to optimize the category mapping. And I can use the same _config.yml file that I use to do this category filtering.

Hmm, I need to know about yaml dicts. That’s not something I thought I’d be saying. So, Google tell me a little about yaml dicts. Oh wait, Google doesn’t have conversational search yet (Bard doesn’t count). Bing: tell me about yaml dicts.

Bing: In YAML, dictionaries are represented as mappings. A mapping is a collection of key-value pairs, where each key is mapped to a value ¹.

Here is an example of a dictionary in YAML:

name: John
age: 30
city: New York

I hope this helps. Let me know if you have any other questions.

Source: Conversation with Bing, 5/6/2023

(1) How to represent dictionaries in YAML - Educative: Interactive Courses .... https://www.educative.io/answers/how-to-represent-dictionaries-in-yaml.
(2) What is YAML and How To Use It - Salt. https://docs.saltproject.io/en/latest/topics/yaml/index.html.
(3) Learn YAML tutorials with examples - W3schools. https://www.w3schools.io/file/yaml-arrays/.
(4) YAML Syntax — Ansible Documentation. https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html.
(5) Building an array of dictionary items in YAML? - Stack Overflow. https://stackoverflow.com/questions/30221348/building-an-array-of-dictionary-items-in-yaml.
(6) Expressions - Azure Pipelines | Microsoft Learn. https://learn.microsoft.com/en-us/azure/devops/pipelines/process/expressions?view=azure-devops.

Me: Yes, I know that much Bing, but thank you. Tell me about yaml sub-dicts.

Bing: In YAML, sub-dictionaries are represented as nested mappings. Here is an example of a sub-dictionary in YAML:

  name: John
  age: 30
    street: 123 Main St.
    city: New York
    state: NY
    zip: 10001

I hope this helps. Let me know if you have any other questions.

Yes, why that is exactly what I needed to know. Thank you Bing.

Now I can roll up the categories into parent categories. But I want to make sure I handle this correctly. Because I have an already existing “filter” list, there’s going to be colliding concepts. If I’m moving to a more nested or hierarchical dictionary structure… then… then… oh, of course!

One of the keys becomes the filter bucket. I still have a filter, it’s just that it’s sort of like the nil or null device, or whatever. It’s also one of my dictionary keys, and anything I sort under that key gets filtered. All the other keys have the effect of mapping the synonym or shorter version of the category into the desired category label.

Hmmm, another conflict occurs to me. My top_cats() function returns the list of top categories automatically through a sort of quick site interrogation. But this could move it to a potentially more pre-processed implementation. How would that look?

There’s a rather inclusive approach by which I output all the categories to a file and do an one-time import of them into the _config.yml file. Then I have a sort of rolling-up task which because YAML is such a simple format amounts to copy/pasting them around, plus some indenting. Yeah, let’s do that. I don’t even need to think through the implementation yet, as I can just plop all 5,000 or so that are going to be produced into a deep-sixed key, then move them out of there as desired. 1, 2, 3… 1?

One thing to keep in mind is that I only want to ever deal with the lemmatized version of the words to keep the list as short as possible.

Okay, so I turn off git pushing the site temporarily.

Okay, I move the check that I added recently to prevent numeric categories to a post-processing location so that it doesn’t interfere with keyword collection. I actually want the years like 2020 and Ubuntu versions like 20.04 to be mapped into parent categories, so the early-phase filtering of these numbers is no longer desirable. Keep the code in there and working the same way, but move it. Okay done and done. Next?

Look at where the lemmatized keywords are being collected. It’s in word_list (duh). Okay, so figure out the path to where you’d like a raw version of this list saved. Obviously in the _data directory. So, _data/categories.yml is where I’ll save it. Make that path.

cat_file = f"{REPO_DATA}categories.yml"

And now to dump the word_list (which isn’t really a list) into the file. I don’t kneed the whole dictionary, just the keys:

with open(cat_file, "w") as fh:
    yaml.dump(list(word_list.keys()), fh)

Nice. Now dump that output into a key called all_categories in the _config.yml file and do site generate. Nice, everything’s still working.

Next is a biggie. Instead of flat key-value pairs, I want to have a nested dictionary structure. The top-level shall be “categories” under which I’ll have an all and filter key. I’ll move the existing filter under that key and alter the code to use the filter from there and re-generate the site and see if the chosen top categories for the site changed by looking at it’s include file (no actual Github Pages release necessary).

Wow, working exactly as intended. Now think! Previously the whole category thing was all automatic and quite easy for anyone taking up the yamlchop system, but now it’s starting to require a custom object shape residing inside of _config.yml. Keep an eye on that, and maybe just turn off the whole categorization system if that custom shape isn’t present.

Hmmm, what I need is in cdict. This is where having the iPython engine of Jupyter or VSCode would be nice, but I can just pickle it too and then load it into Jupyter:

with open(f"{REPO_DATA}cdict.pkl", "wb") as f:
    pickle.dump(cdict, f)

Nice, now I can get all the keywords in descending order of frequency in Jupyter:

import pickle

with open("/home/ubuntu/repos/MikeLev.in/_data/cdict.pkl", "rb") as f:
    cdict = pickle.load(f)
[(x, cdict[x]["count"]) for x in cdict]

Now that I’m satisfied they’re in descending order of frequency, I can make a new version in Jupyter that pre-formats it to be copy/paste ready to go under the all key in _config.yml:

import pickle

with open("/home/ubuntu/repos/MikeLev.in/_data/cdict.pkl", "rb") as f:
    cdict = pickle.load(f)
for k in cdict:
    print(f"    - {k}")

And generate the site again (without git push)… Oops. Not good yaml. You can’t just go pasting unencoded values into a yaml file. Let’s go through proper encoding by making a yaml object and then dumping it:

import pickle
import yaml

with open("/home/ubuntu/repos/MikeLev.in/_data/cdict.pkl", "rb") as f:
    cdict = pickle.load(f)
all_cats = [k for k in cdict]

Much better. I can do the 4-space indenting across a few thousand rows in NeoVim just as easily as in Jupyter. Most importantly, the encoding is there and the descending by keyword frequency order is preserved.

And I’m not showing it here, but I did one more step to filter everything from the already existing list of filtered keyword from the all list since I’ll be sort of doing a list depletion task from all, mapping them into other sub-keys and I don’t want to deal with duplicates. I’ll process the ones from the filter key and the ones from the all key and never see the same keyword twice now.

Okay, I think I can get the best of both again. Because I always load the _config.yml file during site generation, I can just check for the presence of the categories key and if it’s not there, I can just skip the explicit categorization approach and use the implicit one that’s already there. It should basically be working that way now. Let’s see… yup. Nice.

Hmm. So _config.yml is starting to fulfill a lot of purposes. And as much as I don’t like introducing another global variable, it would be nice to have something like the site configuration always available, and not just loaded in the category finder. Think! Yup. The local _config variable is now the global CONFIG variable. That opens quite a few possibilities.

Okay, now work on the structure within the categories key. Put the logic in to defend against not having enough categories to support the category grid. Okay, that’s done. Now move 100 of the most frequent keywords into top-level categories. Okay, done. Do a real site gen letting it go out to github. Check the category grid.

It worked. And the same old crappy topics are back of the sort I’d start filtering. And that makes it a sort of new starting point. Instead of throwing things into a filter, I start sorting them into parent categories. That will start doing filtering again, but it’s not really filtering I want at this point. But things are working at least as good as the old way. So set up some test data! Do some of the mapping. Then change how you sort posts into categories using the new data structure.

I eliminated the all key under categories and moved everything under filter. They functionally were doing the same thing. Now I just sort things from filter into their best matching parent category.

I also see that rule-based matching of categories is on the table to alleviate even this manual grouping. But I do have a good amount of sample data in there now for better categorization. Find the location in the code where the new data would be used.

I’m reaching the point of diminishing returns. Cut the entry here and tackle the last step tomorrow morning. Sleep on it.
