MIKE LEVIN LPVG SEO

Future-proof your tech-skills with Linux, Python, vim & git as I share with you the most timeless and love-worthy tools in tech — and on staying valuable while machines learn... and beyond.

Keyword Histograms and Clusters

by Mike Levin

Tuesday, January 17, 2023

I got through making keyword histograms against content yesterday. Now I want to make keyword clusters.

Ugh, so much to work through for keyword clustering again. I’ve done it before, but my current day-job work is not calling for clustering at the moment. It may be calling for histogram keyword extraction, so I got that done using Yake KeywordExtractor. If I need to explain it, it’s on this page:

https://liaad.github.io/yake/

When work gets started today I should be ready to demonstrate performing extractions over the entire data-set, such as URLs and titles. That’s really in the bag already based on the work I did yesterday and over the weekend.

Be ready to discuss.

There are several steps for processing keywords, and it’s important to note that they each need slightly different text-preparation. For example, if you’re looking for commonly used or important keyword themes, the text directly extracted from an HTML page’s “body copy” alone might not suffice. Very important keywords are actually in the title tag, which do not automatically occur in the body element, so a soup.get_text() is particularly vulnerable to missing important page themes.

It’s also worth noting that soup.get_text() gets rid of too many spaces, appending words together as it strips out tags. It’s better to control the insertion of a space between every tag-stripped element yourself:

I’ve done really good work with these lambda functions that look just like applied text-processing.