Future-proof your skills with Linux, Python, vim & git as I share with you the most timeless and love-worthy tools in tech through my two great projects that work great together.

Keyword Histograms and Clusters

I've already done the hard work of extracting keywords using the Yake KeywordExtractor, and now I'm ready to discuss extracting them from the entire dataset. I've also used lambda functions to control the insertion of spaces between each tag-stripped element, and I'm eager to share the steps I've taken for keyword processing. Click here to read more!

Ready to Share My Steps for Keyword Processing with Yake KeywordExtractor!

By Michael Levin

Tuesday, January 17, 2023

I got through making keyword histograms against content yesterday. Now I want to make keyword clusters.

Ugh, so much to work through for keyword clustering again. I’ve done it before, but my current day-job work is not calling for clustering at the moment. It may be calling for histogram keyword extraction, so I got that done using Yake KeywordExtractor. If I need to explain it, it’s on this page:


When work gets started today I should be ready to demonstrate performing extractions over the entire data-set, such as URLs and titles. That’s really in the bag already based on the work I did yesterday and over the weekend.

Be ready to discuss.

There are several steps for processing keywords, and it’s important to note that they each need slightly different text-preparation. For example, if you’re looking for commonly used or important keyword themes, the text directly extracted from an HTML page’s “body copy” alone might not suffice. Very important keywords are actually in the title tag, which do not automatically occur in the body element, so a soup.get_text() is particularly vulnerable to missing important page themes.

It’s also worth noting that soup.get_text() gets rid of too many spaces, appending words together as it strips out tags. It’s better to control the insertion of a space between every tag-stripped element yourself:

I’ve done really good work with these lambda functions that look just like applied text-processing.