Home / Choose Python as your 1st Programming Language / Stripping HTML from Text and Markdown for Readability

Stripping HTML from Text and Markdown for Readability

NOTE: This is an epic post that builds up Python code bit by bit for a HTML2Markdown tool heavy on fancy Regular Expressions, then ends with fixing my git-to-WordPress publishing system to handle making this post. Unless you’re an über-geek, turn back now!

View Page Source = Vomit Illegible Code

Have you ever used a readability tool like Instapaper, Palm or Safari’s built-in Read Later? I recently had to drop some captured website text into a database for easy direct reading. This is different than simply stripping out HTML tags and hoping for the best, as with PHP’s strip_tags. This is a series of progressive sweeps that actually plugs in simple markdown code, which for those unfamiliar, is a nice easy text file that still manages to preserve headlines and such. It’s a lot like what web-readability tools do. If you ever wanted to learn useful nuances of RexEx in Python, this is the post for you!

No Ready-made Solutions (Even in PHP or JQuery)

Googling the topic produced a host of unacceptable solutions – dumb HTML tag-stripping, mostly, and nothing that converted to markdown. So I decided to spin my own that had more of a smart algorithm feel to it. There is extreme order-sensitivity doing work like this, which you wouldn’t think so at first. It makes a lot of difference in the finished product.

Also, this is a watershed post for me for several reasons:

  • My git-to-WordPress publisher gracefully handled code samples.
  • Code syntax highlighting added in WordPress for your reading pleasure.
  • I really captured the build-up "process" here.
  • Probably good enough to be my first github contribution.

The Challenge

I need to grab the “About-text” off of about-pages. This is where I begin thinking through the problem. Instapaper is doing this well. I think it’s a regular expression job. You don’t want to deal with choking and unsupported HTML parsers. And no matter how much people who like to dis’ Regular Expressions will guide you towards XML-parsers, they will always let you down. RegEx is forever.

An Argument For Regular Expressions

It’s just another example of a lesser-tool becoming the dominant winner because it was first and good enough. There’s better pattern matching systems out there, but they too are external libraries, unnecessary dependencies, and require developing know-how that will not be applicable on other systems, as RegEx will. Just Google SnoBol any SnoPy to see. RegEx is the devil you know. The way to cope with all the arguments RegEx detractors make is just to know your devil very well.

I already wrote some code to figure out the URL of the about-page and can fetch the HTML. That was my prior project. Now, the challenge is how to get just the useful readable text off of the page in preparation for dropping it into a database.

Coming To Grips With Disappointing XML/HTML Parsers

My brief googling of how to do this in Python leads you rapidly to HTML2Text by the recently and sadly deceased Aaron Swartz, in github, and I did a pip install to test it out, but immediately got errors that looked like HTML parsing errors. Further googling leads you to a variety of approaches, everything from simple regular expression matching to the Python Natural Language Toolkint (NLTK) and the usual suspects of XML parsing engines like BeautifulSoup.

I’ve played around enough with this sort of stuff in the past to believe XML-parsers are to be avoided due to their fragility and tendency to become deprecated and unsupported. BeautifulSoup has gone this route, and even if you stay close to the mainsteream supported stuff, you still have to choose between ElementTree, lxml, Expat, minidom, etc. All this stuff is external to Python, so you’re creating just another dependency to something that will inevitably have issues and nuance upon nuance. Conversely, Regular Expressions are built-in and forever. So, unless I really, really want to be able to grab text with XPath, I tend towards the RegEx solutions.

The Python RegEx Implementation Is A Secret Gift

Now, Regular Expressions have their own issues. Aside from them being pretty complex and confusing in the first place, the Python API to its built-in RegEx engine is not exactly intuitive. The situation isn’t nearly as bad as it is with urllib2 in Python (bad API), but it is in that neighborhood. The saving grace is that like RegEx itself, the Python API is incredibly powerful, and once you understand its nuances, you can do amazing things. And unlike the fly-by-night XML parsers where it isn’t worth investing into learning the nuances, it IS WORTH IT with RegEx in Python. It’s not going away and will become one of your most valuable tools.

Choosing Test Cases

Okay, so let’s get started. I’ll be using my ever-present deleteme.py file for this project. I could use the about page from pretty much any website for this project, and the sort of text-scrubbing the program will have to do will vary just as wildly. So, I’m going to get a list of some diverse about pages to test this on.

  • http://mikelev.in/about/
  • http://perezhilton.com/about
  • http://krebsonsecurity.com/about/
  • http://galadarling.com/static/about-gala
  • http://www.codinghorror.com/blog/2004/02/about-me.html

Okay, what about my Python imports? I think it’s going to be:

  1. import requests, re, htmlentitydefs
import requests, re, htmlentitydefs

Yeah, This Tutorial Is Still In Python 2.x

I’m jumping ahead a little bit, but from dealing with this stuff quite a bit in the past, I’m thinking ahead. There will be a point EVEN AFTER all the HTML code is scrubbed out that I will want to convert HTML entities back into their normal glyph representation, so long as it’s being held in a unicode string. The function that does the trick is located here, and is the reason that I’m importing htmlentitydefs already. Think ahead! Oh, and since we’re thinking ahead, when we move to Python 3.x, this import becomes html.entities.

http://stackoverflow.com/questions/1197981/convert-html-entities-to-ascii-in-python

Grabbing HTML with Python Requests Module

But back to Regular Expressions. I’ll build up the program slowly actually to make some points. The first version, just to grab the about page off my site. This code uses the previously installed Python Requests Module that makes things you would normally do with urllib or urllib2 a lot easier.

The __name__ == ‘__main__’ Exception To Pythonic Thinking

And for any Python newbies among you, this program sets forth the extremely “PYTHONIC” pattern for starting a new program, once you introduce functions. Namely, that odd function at the bottom that starts executing the “main” function. No one will outright tell you, but for all Python’s awesomeness, it is extremely order-sensitive in how it executes code, and you just have to do something like this to get your main program flow at the top of your text file if it refers to functions that top-to-bottom parsing hasn’t gotten to yet. So, when it’s executing the main function, it already has seen everything else in the file. Get it? “def main” goes at the top and the point-of-entry to main goes at the bottom. Do this and be happy.

  1. import requests
  2.  
  3. def main():
  4.   tests = ['http://mikelev.in/about/',
  5.     'http://perezhilton.com/about',
  6.     'http://krebsonsecurity.com/about/',
  7.     'http://galadarling.com/static/about-gala',
  8.     'http://www.codinghorror.com/blog/2004/02/about-me.html']
  9.   text = getHTML(tests[0])
  10.   print [text]
  11.  
  12. def getHTML(url):
  13.   '''Returns the HTML for URL'''
  14.   try:
  15.     response = requests.get(url)
  16.   except:
  17.     return ''
  18.   return response.text
  19.  
  20. if __name__ == '__main__':
  21.   main()
import requests

def main():
  tests = ['http://mikelev.in/about/',
    'http://perezhilton.com/about',
    'http://krebsonsecurity.com/about/',
    'http://galadarling.com/static/about-gala',
    'http://www.codinghorror.com/blog/2004/02/about-me.html']
  text = getHTML(tests[0])
  print [text]

def getHTML(url):
  '''Returns the HTML for URL'''
  try:
    response = requests.get(url)
  except:
    return ''
  return response.text

if __name__ == '__main__':
  main()

There’s Always Strange Encoding Issues When Scraping

For some reason, the print command can’t show contents of the text variable, so I put the notation for a list around text. When you do that, Python represents it with its own form of unicode encoding for display in a (roughly) ASCII environment. It’s pretty much the same trick as is done with JSON so that the text representation of objects can be sent around the Internet, surviving all the systems that are not Unicode-friendly. Python object notation and JavaScript object notation are uncannily similar.

This problem indicates to me that there’s probably encoding being misinterpreted somewhere along the way – which just happens when you scrape data off the Web. Yes, it’s unfortunate that there’s going to be some data-loss, but this makes me resort to some of the heavy-handed “thunking” tricks at my disposal.

Chopping Off Everything Not Inside Body Tags

Okay, so let’s start talking about regular expressions. This is something where building it up in a series of small steps will help understanding immensely. Regular Expressions quickly become seemingly impossible to read. This is the next iteration of the program that simply returns everything between the body tags.

Python Regular Expression Named Groups To Grab Sub-matches!

How? Really smart RegEx, of course! We’re going to use the sub method to do text-replacement, but we’re going to use a named group so we can grab just a slice of the capture. It’s sorta like non-capturing lookahead, but better because it takes advantage of Python’s list-handling strength. This is precisely where Python’s RegEx implementation looks so much harder than in JavaScript or Ruby, but it’s for a very good reason. Observe and learn the wonders of Python RegEx named groups for sub-matching:

  1. import requests, re
  2.  
  3. def main():
  4.   tests = ['http://mikelev.in/about/',
  5.     'http://perezhilton.com/about',
  6.     'http://krebsonsecurity.com/about/',
  7.     'http://galadarling.com/static/about-gala',
  8.     'http://www.codinghorror.com/blog/2004/02/about-me.html']
  9.   text = getHTML(tests[0])
  10.   print [justBody(text)]
  11.  
  12. def getHTML(url):
  13.   '''Returns the HTML for URL'''
  14.   try:
  15.     response = requests.get(url)
  16.   except:
  17.     return ''
  18.   return response.text
  19.  
  20. def justBody(text):
  21.   pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*.*?>"
  22.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  23.   bodymat = pat.search(text)
  24.   return bodymat.group('capture')
  25.  
  26. if __name__ == '__main__':
  27.   main()
import requests, re

def main():
  tests = ['http://mikelev.in/about/',
    'http://perezhilton.com/about',
    'http://krebsonsecurity.com/about/',
    'http://galadarling.com/static/about-gala',
    'http://www.codinghorror.com/blog/2004/02/about-me.html']
  text = getHTML(tests[0])
  print [justBody(text)]

def getHTML(url):
  '''Returns the HTML for URL'''
  try:
    response = requests.get(url)
  except:
    return ''
  return response.text

def justBody(text):
  pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*.*?>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  bodymat = pat.search(text)
  return bodymat.group('capture')

if __name__ == '__main__':
  main()

Deleting Script Tags And Everything In Between

Now we have the general look of things for sequential selective scrubbing, getting back only what comes BETWEEN the tags. The temptation now is to strip out all the HTML tags, but that’s still a little premature. We want to pick of the biggest offenders now, replacing entire tag enclosures. The script tag is probably the biggest offender. Now, we want to switch from the search method of the regex engine to sub, and just yank out everything in script tags.

Regular Expression Default Behavior "Consumes" Entire Match

It is worth noting that this is the easiest and most straight-forward use of Regular Expressions, and what everyone starts out with before you start thinking about the capturing-vs-non-capturing parts of the match. Everything captures! We’re matching on the open-script tag, the close-script tag, and everything in-between. Once we have the match, we replace it with the empty character! This is how most people think-of and use RegEx without all the non-capturing lookahead and, named group and backreference stuff. Sometimes, it’s actually what you want, as with chopping out JavaScript from HTML.

  1. import requests, re
  2.  
  3. def main():
  4.   tests = ['http://mikelev.in/about/',
  5.     'http://perezhilton.com/about',
  6.     'http://krebsonsecurity.com/about/',
  7.     'http://galadarling.com/static/about-gala',
  8.     'http://www.codinghorror.com/blog/2004/02/about-me.html']
  9.   text = getHTML(tests[0])
  10.   text = justBody(text)
  11.   text = stripScripts(text)
  12.   print [text]
  13.  
  14. def getHTML(url):
  15.   '''Returns the HTML for URL'''
  16.   try:
  17.     response = requests.get(url)
  18.   except:
  19.     return ''
  20.   return response.text
  21.  
  22. def justBody(text):
  23.   pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  24.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  25.   bodymat = pat.search(text)
  26.   return bodymat.group('capture')
  27.  
  28. def stripScripts(text):
  29.   pattern = r"<\s*script\s*.*?>.*<\s*/script\s*>"
  30.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  31.   sanscript = pat.sub('', text)
  32.   return sanscript
  33.  
  34. if __name__ == '__main__':
  35.   main()
import requests, re

def main():
  tests = ['http://mikelev.in/about/',
    'http://perezhilton.com/about',
    'http://krebsonsecurity.com/about/',
    'http://galadarling.com/static/about-gala',
    'http://www.codinghorror.com/blog/2004/02/about-me.html']
  text = getHTML(tests[0])
  text = justBody(text)
  text = stripScripts(text)
  print [text]

def getHTML(url):
  '''Returns the HTML for URL'''
  try:
    response = requests.get(url)
  except:
    return ''
  return response.text

def justBody(text):
  pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  bodymat = pat.search(text)
  return bodymat.group('capture')

def stripScripts(text):
  pattern = r"<\s*script\s*.*?>.*<\s*/script\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  sanscript = pat.sub('', text)
  return sanscript

if __name__ == '__main__':
  main()

Deleting HTML Tag Parameters But Keeping The Tags

Okay, now it gets really interesting! There’s tons of garbage in there including navigation that needs to be torn out. But our thinking has to get a bit more sophisticated. Namely, everything needs to become easier to look at, and the biggest offender now is not the tags themselves, but their arguments and parameters. Before stripping out more tags, I’m actually going to strip all tag parameters. Huh? Why would I do this seemingly extra step?

In The Beginning, HTML Was Easy To Read

Well, let me tell ya something. HTML in its original form, without all those class definitions and endless parameters is a hell of a lot easier to look at than the modern stuff. You can see document structure at-a-glance. If there’s bad or weird stuff going on, you will know instantly if you’re looking at everything with just the naked HTML tags there. It’s the perfect “set-up” for our next step, which is adding the markdown code. To know I’m doing things correctly, I want to start looking at the naked HTML before I start stripping some things out and converting others to markdown.

Non-matching Negative Lookahead With A Backreference?!?!

Something about the RegEx to point out here is that instead of using named groups, I’m using non-matching negative lookahead so that the closing-pointy-bracket is not part of the capture. Then in a bout of insidious RegEx awesomeness, I use a backreference (notice the \1 in the sub statement) to plug the opening part of the HTML tag back into the substitution, followed by an immediate close. BAM! Perfectly readable HTML code.

  1. import requests, re
  2.  
  3. def main():
  4.   tests = ['http://mikelev.in/about/',
  5.     'http://perezhilton.com/about',
  6.     'http://krebsonsecurity.com/about/',
  7.     'http://galadarling.com/static/about-gala',
  8.     'http://www.codinghorror.com/blog/2004/02/about-me.html']
  9.   text = getHTML(tests[0])
  10.   text = justBody(text)
  11.   text = stripScripts(text)
  12.   text = stripParams(text)
  13.   print [text]
  14.  
  15. def getHTML(url):
  16.   '''Returns the HTML for URL'''
  17.   try:
  18.     response = requests.get(url)
  19.   except:
  20.     return ''
  21.   return response.text
  22.  
  23. def justBody(text):
  24.   pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  25.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  26.   bodymat = pat.search(text)
  27.   return bodymat.group('capture')
  28.  
  29. def stripScripts(text):
  30.   pattern = r"<\s*script\s*.*?>.*<\s*/script\s*>"
  31.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  32.   sanscript = pat.sub('', text)
  33.   return sanscript
  34.  
  35. def stripParams(text):
  36.   pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  37.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  38.   nopram = pat.sub(r'\1>', text)
  39.   return nopram
  40.  
  41. if __name__ == '__main__':
  42.   main()
import requests, re

def main():
  tests = ['http://mikelev.in/about/',
    'http://perezhilton.com/about',
    'http://krebsonsecurity.com/about/',
    'http://galadarling.com/static/about-gala',
    'http://www.codinghorror.com/blog/2004/02/about-me.html']
  text = getHTML(tests[0])
  text = justBody(text)
  text = stripScripts(text)
  text = stripParams(text)
  print [text]

def getHTML(url):
  '''Returns the HTML for URL'''
  try:
    response = requests.get(url)
  except:
    return ''
  return response.text

def justBody(text):
  pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  bodymat = pat.search(text)
  return bodymat.group('capture')

def stripScripts(text):
  pattern = r"<\s*script\s*.*?>.*<\s*/script\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  sanscript = pat.sub('', text)
  return sanscript

def stripParams(text):
  pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  nopram = pat.sub(r'\1>', text)
  return nopram

if __name__ == '__main__':
  main()

Deleting Navigational Elements From HTML The Easy Way

Okay, now for some serious cleaning up of the navigation. No respectable scheme this day and age would occur without CSS styling to list elements. You could write a college dissertation on how to identify and separate a page’s navigational components from its main text content. You could look at the semantics of the tags, classnames, or employ natural language parsing techniques.

A List By Any Other Name… Is Probably Navigation

But this isn’t about scalpel work. This is about carving pumpkins with chainsaws. I’ll write a future post about how to identify the rabbit and NOT chase him down into the hole. This is a rabbit. We’re going for solution that gives you 80% of the benefit you want with 20% of the work. It’s called the 80/20 rule, and is an exercise in respecting your own time and pleasing your employers. Don’t chase the rabbit! Just delete the list elements that are almost always used for navigation these days. And so, the happenstance list that gets used on about pages are indeed going to be a casualty of this next cleanup thunk. This is a particular shame, since I could have easily styled lists with markdown in the following step. But alas, instead of lamenting a tiny bit of lost data, I’ll forge-on and make everyone happy.

Regular Expression Greedy Versus Non-Greedy

First we go for all the ol’s and ul’s and everything in-between. But don’t be greedy! When I say "don’t be greedy", I’m talking about the question mark I’m inserting after the dot-asterisk in line 46. Otherwise, with a multiline match, it would grab from the first open list element to the last close list element, and chop out maybe the entire page (i.e. greedy). So, we need to insert the non-greedy modifier character.

If we missed any stray li’s, we have to get them on the rebound. And 90% of the navigation out there gone!

  1. import requests, re
  2.  
  3. def main():
  4.   tests = ['http://mikelev.in/about/',
  5.     'http://perezhilton.com/about',
  6.     'http://krebsonsecurity.com/about/',
  7.     'http://galadarling.com/static/about-gala',
  8.     'http://www.codinghorror.com/blog/2004/02/about-me.html']
  9.   text = getHTML(tests[0])
  10.   text = justBody(text)
  11.   text = stripScripts(text)
  12.   text = stripParams(text)
  13.   text = listNuker(text)
  14.   print [text]
  15.  
  16. def getHTML(url):
  17.   '''Returns the HTML for URL'''
  18.   try:
  19.     response = requests.get(url)
  20.   except:
  21.     return ''
  22.   return response.text
  23.  
  24. def justBody(text):
  25.   pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  26.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  27.   bodymat = pat.search(text)
  28.   return bodymat.group('capture')
  29.  
  30. def stripScripts(text):
  31.   pattern = r"<\s*script\s*.*?>.*<\s*/script\s*>"
  32.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  33.   sanscript = pat.sub('', text)
  34.   return sanscript
  35.  
  36. def stripParams(text):
  37.   pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  38.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  39.   nopram = pat.sub(r'\1>', text)
  40.   return nopram
  41.  
  42. def listNuker(text):
  43.   pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  44.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  45.   listless = pat.sub('', text)
  46.   pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  47.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  48.   listless = pat.sub('', listless)
  49.   return listless
  50.  
  51. if __name__ == '__main__':
  52.   main()
import requests, re

def main():
  tests = ['http://mikelev.in/about/',
    'http://perezhilton.com/about',
    'http://krebsonsecurity.com/about/',
    'http://galadarling.com/static/about-gala',
    'http://www.codinghorror.com/blog/2004/02/about-me.html']
  text = getHTML(tests[0])
  text = justBody(text)
  text = stripScripts(text)
  text = stripParams(text)
  text = listNuker(text)
  print [text]

def getHTML(url):
  '''Returns the HTML for URL'''
  try:
    response = requests.get(url)
  except:
    return ''
  return response.text

def justBody(text):
  pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  bodymat = pat.search(text)
  return bodymat.group('capture')

def stripScripts(text):
  pattern = r"<\s*script\s*.*?>.*<\s*/script\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  sanscript = pat.sub('', text)
  return sanscript

def stripParams(text):
  pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  nopram = pat.sub(r'\1>', text)
  return nopram

def listNuker(text):
  pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', text)
  pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  return listless

if __name__ == '__main__':
  main()

Taming Of The Tabs, Carriage Returns and Line Feeds

Next, we have to take care of repetitive line returns, and unnecessary tabs and carriage returns. After that, things get a bit tricky. There’s stray garbage that has survived the previous sweeps, but they have a distinct signature. Oops, I ought to throw in a form stripper function much like stripScripts.

When Code Repetition Is Really Code Repetition

Notice how stripScripts and stripForms looks very similar? Well, this is asking to be a general function, but not yet. I’ll know when the time is right. Don’t waste time going and generalizing every bit of code for reuse until it’s actually going to improve your program or SUBSTANTIALLY cut down code repetition. Don’t do it just to collapse 2 functions into one. It’s not worth the extra complexity. One of the advantages of languages like Python and Ruby is that code is terse and a pleasure to write, so a little code repetition is not a chore. Once you get up to a THIRD repetition, the time may be right to make it a parameterized function.

Okay, done. This next code sample did a few things at once, but you can handle it. Ah! If we strip the divs out (but not their contents) BEFORE we dedupe line returns and such, the deduping of line returns will be more effective!

  1. import requests, re
  2.  
  3. def main():
  4.   tests = ['http://mikelev.in/about/',
  5.     'http://perezhilton.com/about',
  6.     'http://krebsonsecurity.com/about/',
  7.     'http://galadarling.com/static/about-gala',
  8.     'http://www.codinghorror.com/blog/2004/02/about-me.html']
  9.   text = getHTML(tests[0])
  10.   text = justBody(text)
  11.   text = stripScripts(text)
  12.   text = stripForms(text)
  13.   text = stripParams(text)
  14.   text = listNuker(text)
  15.   text = noDivs(text)
  16.   text = singleizer(text)
  17.   print [text]
  18.  
  19. def getHTML(url):
  20.   '''Returns the HTML for URL'''
  21.   try:
  22.     response = requests.get(url)
  23.   except:
  24.     return ''
  25.   return response.text
  26.  
  27. def justBody(text):
  28.   pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  29.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  30.   bodymat = pat.search(text)
  31.   return bodymat.group('capture')
  32.  
  33. def stripScripts(text):
  34.   pattern = r"<\s*script\s*.*?>.*<\s*/script\s*>"
  35.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  36.   sanscript = pat.sub('', text)
  37.   return sanscript
  38.  
  39. def stripForms(text):
  40.   pattern = r"<\s*form\s*.*?>.*<\s*/form\s*>"
  41.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  42.   formless = pat.sub('', text)
  43.   return formless
  44.  
  45. def stripParams(text):
  46.   pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  47.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  48.   nopram = pat.sub(r'\1>', text)
  49.   return nopram
  50.  
  51. def listNuker(text):
  52.   pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  53.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  54.   listless = pat.sub('', text)
  55.   pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  56.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  57.   listless = pat.sub('', listless)
  58.   pattern = r"(<\s*(li|ol|ul)\s*.*?>)|(<\s*/(li|ol|ul)\s*>)"
  59.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  60.   listless = pat.sub('', listless)
  61.   return listless
  62.  
  63. def noDivs(text):
  64.   pattern = r"(<\s*div\s*.*?>)|(<\s*/div\s*>)"
  65.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  66.   divless = pat.sub('', text)
  67.   return divless
  68.  
  69. def singleizer(text):
  70.   pattern = r"\t|\r"
  71.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  72.   singled = pat.sub('', text)
  73.   pattern = r"(\n{2,})"
  74.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  75.   singled = pat.sub(r'\n', singled)
  76.   pattern = r"^\n|\n$"
  77.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  78.   singled = pat.sub('', singled)
  79.   return singled
  80.  
  81. if __name__ == '__main__':
  82.   main()
import requests, re

def main():
  tests = ['http://mikelev.in/about/',
    'http://perezhilton.com/about',
    'http://krebsonsecurity.com/about/',
    'http://galadarling.com/static/about-gala',
    'http://www.codinghorror.com/blog/2004/02/about-me.html']
  text = getHTML(tests[0])
  text = justBody(text)
  text = stripScripts(text)
  text = stripForms(text)
  text = stripParams(text)
  text = listNuker(text)
  text = noDivs(text)
  text = singleizer(text)
  print [text]

def getHTML(url):
  '''Returns the HTML for URL'''
  try:
    response = requests.get(url)
  except:
    return ''
  return response.text

def justBody(text):
  pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  bodymat = pat.search(text)
  return bodymat.group('capture')

def stripScripts(text):
  pattern = r"<\s*script\s*.*?>.*<\s*/script\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  sanscript = pat.sub('', text)
  return sanscript

def stripForms(text):
  pattern = r"<\s*form\s*.*?>.*<\s*/form\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  formless = pat.sub('', text)
  return formless

def stripParams(text):
  pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  nopram = pat.sub(r'\1>', text)
  return nopram

def listNuker(text):
  pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', text)
  pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  pattern = r"(<\s*(li|ol|ul)\s*.*?>)|(<\s*/(li|ol|ul)\s*>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  return listless

def noDivs(text):
  pattern = r"(<\s*div\s*.*?>)|(<\s*/div\s*>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  divless = pat.sub('', text)
  return divless

def singleizer(text):
  pattern = r"\t|\r"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub('', text)
  pattern = r"(\n{2,})"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub(r'\n', singled)
  pattern = r"^\n|\n$"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub('', singled)
  return singled

if __name__ == '__main__':
  main()

Eliminating No Longer Meaningful Tabs

Pshwew! We’re getting so close. At this point, it no longer makes sense to have anchor tags. There are certain repeating patterns, like just deleting tags without touching their interior text and deleting tags including their interior text. And so, I’ll do a quick cleanup, so it’s easier to do either of these things. I also broke out a justText function from main:

  1. import requests, re
  2.  
  3. def main():
  4.   tests = ['http://mikelev.in/about/',
  5.     'http://perezhilton.com/about',
  6.     'http://krebsonsecurity.com/about/',
  7.     'http://galadarling.com/static/about-gala',
  8.     'http://www.codinghorror.com/blog/2004/02/about-me.html']
  9.   text = justText(tests[0])
  10.   print [text]
  11.  
  12. def justText(url):
  13.   text = getHTML(url)
  14.   text = justBody(text)
  15.   text = noTagBlock(text, "script")
  16.   text = noTagBlock(text, "form")
  17.   text = stripParams(text)
  18.   text = listNuker(text)
  19.   text = noTag(text, "div")
  20.   text = noTag(text, "span")
  21.   text = noTag(text, "img")
  22.   text = noTag(text, "a")
  23.   text = singleizer(text)
  24.   return text
  25.  
  26. def getHTML(url):
  27.   try:
  28.     response = requests.get(url)
  29.   except:
  30.     return ''
  31.   return response.text
  32.  
  33. def justBody(text):
  34.   pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  35.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  36.   bodymat = pat.search(text)
  37.   return bodymat.group('capture')
  38.  
  39. def stripParams(text):
  40.   pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  41.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  42.   nopram = pat.sub(r'\1>', text)
  43.   return nopram
  44.  
  45. def listNuker(text):
  46.   pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  47.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  48.   listless = pat.sub('', text)
  49.   pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  50.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  51.   listless = pat.sub('', listless)
  52.   pattern = r"(<\s*(li|ol|ul)\s*.*?>)|(<\s*/(li|ol|ul)\s*>)"
  53.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  54.   listless = pat.sub('', listless)
  55.   return listless
  56.  
  57. def noTag(text, tag):
  58.   pattern = r"(<\s*%s\s*.*?>)|(<\s*/%s\s*>)" % (tag, tag)
  59.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  60.   tagless = pat.sub('', text)
  61.   return tagless
  62.  
  63. def noTagBlock(text, tag):
  64.   pattern = r"<\s*%s\s*.*?>.*<\s*/%s\s*>" % (tag, tag)
  65.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  66.   byeblock = pat.sub('', text)
  67.   return byeblock
  68.  
  69. def singleizer(text):
  70.   pattern = r"\t|\r"
  71.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  72.   singled = pat.sub('', text)
  73.   pattern = r"(\n{2,})"
  74.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  75.   singled = pat.sub(r'\n', singled)
  76.   pattern = r"^\n|\n$"
  77.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  78.   singled = pat.sub('', singled)
  79.   return singled
  80.  
  81. if __name__ == '__main__':
  82.   main()
import requests, re

def main():
  tests = ['http://mikelev.in/about/',
    'http://perezhilton.com/about',
    'http://krebsonsecurity.com/about/',
    'http://galadarling.com/static/about-gala',
    'http://www.codinghorror.com/blog/2004/02/about-me.html']
  text = justText(tests[0])
  print [text]

def justText(url):
  text = getHTML(url)
  text = justBody(text)
  text = noTagBlock(text, "script")
  text = noTagBlock(text, "form")
  text = stripParams(text)
  text = listNuker(text)
  text = noTag(text, "div")
  text = noTag(text, "span")
  text = noTag(text, "img")
  text = noTag(text, "a")
  text = singleizer(text)
  return text

def getHTML(url):
  try:
    response = requests.get(url)
  except:
    return ''
  return response.text

def justBody(text):
  pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  bodymat = pat.search(text)
  return bodymat.group('capture')

def stripParams(text):
  pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  nopram = pat.sub(r'\1>', text)
  return nopram

def listNuker(text):
  pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', text)
  pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  pattern = r"(<\s*(li|ol|ul)\s*.*?>)|(<\s*/(li|ol|ul)\s*>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  return listless

def noTag(text, tag):
  pattern = r"(<\s*%s\s*.*?>)|(<\s*/%s\s*>)" % (tag, tag)
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  tagless = pat.sub('', text)
  return tagless

def noTagBlock(text, tag):
  pattern = r"<\s*%s\s*.*?>.*<\s*/%s\s*>" % (tag, tag)
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  byeblock = pat.sub('', text)
  return byeblock

def singleizer(text):
  pattern = r"\t|\r"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub('', text)
  pattern = r"(\n{2,})"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub(r'\n', singled)
  pattern = r"^\n|\n$"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub('', singled)
  return singled

if __name__ == '__main__':
  main()

Replacing HTML Entities And Marking Up With Markdown

We are so close, I can taste it. There is only one more sort of replacement we need to do at this point. First, I’m going to replace those html entities with the nifty code located at:

http://stackoverflow.com/questions/1197981/convert-html-entities-to-ascii-in-python

Next, I’m going to venture into the world of markdown, per:

http://daringfireball.net/projects/markdown/

Almost everything I’m going to use markdown for is going to put things on their own line, and sometimes put a token before it.

If I get just these, I should be in pretty good shape:

  1. <p> --> \n
  2. <blockquote> --> \n>
  3. <hr> --> \n---\n
  4. <h1>--> \n#
  5. <h2>--> \n##
  6. <h3>--> \n###
  7. <h4>--> \n####
<p> --> \n
<blockquote> --> \n>
<hr> --> \n---\n
<h1>--> \n#
<h2>--> \n##
<h3>--> \n###
<h4>--> \n####

Okay, it’s 5:00 PM. See if you can’t get this markdown done ASAP, and maybe even have this whole thing incorporated into the main code. Hmmmm. Well, I want to get away from regex. I have regex fatigue. And so long as the matches are reliable, I can just use Python’s built-in string replace function. And to that end, I have to make sure the tags I’m replacing have been properly lowercased. Okay, that’s done. Now… By Jove, I think I’ve got it:

  1. import requests, re
  2.  
  3. def main():
  4.   tests = ['http://mikelev.in/about/',
  5.     'http://perezhilton.com/about',
  6.     'http://krebsonsecurity.com/about/',
  7.     'http://galadarling.com/static/about-gala',
  8.     'http://www.codinghorror.com/blog/2004/02/about-me.html']
  9.   text = justText(tests[0])
  10.   print [text]
  11.  
  12. def justText(url):
  13.   text = getHTML(url)
  14.   text = justBody(text)
  15.   text = noTagBlock(text, "script")
  16.   text = noTagBlock(text, "form")
  17.   text = stripParams(text)
  18.   text = lowercaseTags(text)
  19.   text = listNuker(text)
  20.   for nuketag in ['div', 'span', 'img', 'a', 'b', 'i']:
  21.     text = noTag(text, nuketag)
  22.   text = singleizer(text)
  23.   text = convert_html_entities(text)
  24.   text = markdown(text)
  25.   return text
  26.  
  27. def getHTML(url):
  28.   try:
  29.     response = requests.get(url)
  30.   except:
  31.     return ''
  32.   return response.text
  33.  
  34. def justBody(text):
  35.   pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  36.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  37.   bodymat = pat.search(text)
  38.   return bodymat.group('capture')
  39.  
  40. def markdown(text):
  41.   text = text.replace('<p>', "\n")
  42.   text = text.replace('</p>', "\n")
  43.   text = text.replace('<hr>', "\n---\n> ")
  44.   text = text.replace('<blockquote>', "\n> ")
  45.   text = text.replace('</blockquote>', "\n")
  46.   text = text.replace('<h1>', "\n# ")
  47.   text = text.replace('<h2>', "\n## ")
  48.   text = text.replace('<h3>', "\n### ")
  49.   text = text.replace('<h4>', "\n#### ")
  50.   text = text.replace('</h1>', "\n")
  51.   text = text.replace('</h2>', "\n")
  52.   text = text.replace('</h3>', "\n")
  53.   text = text.replace('</h4>', "\n")
  54.   text = text.strip()
  55.   return text
  56.  
  57. def lowercaseTags(text):
  58.   pattern = r"<(/?[a-zA-Z0-9]+)>"
  59.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  60.   lowered = pat.sub(r'<\1>'.lower(), text)
  61.   return lowered
  62.  
  63. def stripParams(text):
  64.   pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  65.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  66.   nopram = pat.sub(r'\1>', text)
  67.   return nopram
  68.  
  69. def listNuker(text):
  70.   pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  71.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  72.   listless = pat.sub('', text)
  73.   pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  74.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  75.   listless = pat.sub('', listless)
  76.   pattern = r"(<\s*(li|ol|ul)\s*.*?>)|(<\s*/(li|ol|ul)\s*>)"
  77.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  78.   listless = pat.sub('', listless)
  79.   return listless
  80.  
  81. def noTag(text, tag):
  82.   pattern = r"(<\s*%s\s*.*?>)|(<\s*/%s\s*>)" % (tag, tag)
  83.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  84.   tagless = pat.sub('', text)
  85.   return tagless
  86.  
  87. def noTagBlock(text, tag):
  88.   pattern = r"<\s*%s\s*.*?>.*<\s*/%s\s*>" % (tag, tag)
  89.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  90.   byeblock = pat.sub('', text)
  91.   return byeblock
  92.  
  93. def singleizer(text):
  94.   pattern = r"\t|\r"
  95.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  96.   singled = pat.sub('', text)
  97.   pattern = r"(\n{2,})"
  98.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  99.   singled = pat.sub(r'\n', singled)
  100.   pattern = r"^\n|\n$"
  101.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  102.   singled = pat.sub('', singled)
  103.   return singled
  104.  
  105. def convert_html_entities(s):
  106.   matches = re.findall("&#\d+;", s)
  107.   if len(matches) > 0:
  108.     hits = set(matches)
  109.     for hit in hits:
  110.       name = hit[2:-1]
  111.       try:
  112.         entnum = int(name)
  113.         s = s.replace(hit, unichr(entnum))
  114.       except ValueError:
  115.         pass
  116.  
  117.   matches = re.findall("&#[xX][0-9a-fA-F]+;", s)
  118.   if len(matches) > 0:
  119.     hits = set(matches)
  120.     for hit in hits:
  121.       hex = hit[3:-1]
  122.       try:
  123.         entnum = int(hex, 16)
  124.         s = s.replace(hit, unichr(entnum))
  125.       except ValueError:
  126.         pass
  127.  
  128.   matches = re.findall("&\w+;", s)
  129.   hits = set(matches)
  130.   amp = "&"
  131.   if amp in hits:
  132.     hits.remove(amp)
  133.   for hit in hits:
  134.     name = hit[1:-1]
  135.     if htmlentitydefs.name2codepoint.has_key(name):
  136.       s = s.replace(hit, unichr(htmlentitydefs.name2codepoint[name]))
  137.   s = s.replace(amp, "&")
  138.   return s
  139.  
  140. if __name__ == '__main__':
  141.   main()
import requests, re

def main():
  tests = ['http://mikelev.in/about/',
    'http://perezhilton.com/about',
    'http://krebsonsecurity.com/about/',
    'http://galadarling.com/static/about-gala',
    'http://www.codinghorror.com/blog/2004/02/about-me.html']
  text = justText(tests[0])
  print [text]

def justText(url):
  text = getHTML(url)
  text = justBody(text)
  text = noTagBlock(text, "script")
  text = noTagBlock(text, "form")
  text = stripParams(text)
  text = lowercaseTags(text)
  text = listNuker(text)
  for nuketag in ['div', 'span', 'img', 'a', 'b', 'i']:
    text = noTag(text, nuketag)
  text = singleizer(text)
  text = convert_html_entities(text)
  text = markdown(text)
  return text

def getHTML(url):
  try:
    response = requests.get(url)
  except:
    return ''
  return response.text

def justBody(text):
  pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  bodymat = pat.search(text)
  return bodymat.group('capture')

def markdown(text):
  text = text.replace('<p>', "\n")
  text = text.replace('</p>', "\n")
  text = text.replace('<hr>', "\n---\n> ")
  text = text.replace('<blockquote>', "\n> ")
  text = text.replace('</blockquote>', "\n")
  text = text.replace('<h1>', "\n# ")
  text = text.replace('<h2>', "\n## ")
  text = text.replace('<h3>', "\n### ")
  text = text.replace('<h4>', "\n#### ")
  text = text.replace('</h1>', "\n")
  text = text.replace('</h2>', "\n")
  text = text.replace('</h3>', "\n")
  text = text.replace('</h4>', "\n")
  text = text.strip()
  return text

def lowercaseTags(text):
  pattern = r"<(/?[a-zA-Z0-9]+)>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  lowered = pat.sub(r'<\1>'.lower(), text)
  return lowered

def stripParams(text):
  pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  nopram = pat.sub(r'\1>', text)
  return nopram

def listNuker(text):
  pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', text)
  pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  pattern = r"(<\s*(li|ol|ul)\s*.*?>)|(<\s*/(li|ol|ul)\s*>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  return listless

def noTag(text, tag):
  pattern = r"(<\s*%s\s*.*?>)|(<\s*/%s\s*>)" % (tag, tag)
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  tagless = pat.sub('', text)
  return tagless

def noTagBlock(text, tag):
  pattern = r"<\s*%s\s*.*?>.*<\s*/%s\s*>" % (tag, tag)
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  byeblock = pat.sub('', text)
  return byeblock

def singleizer(text):
  pattern = r"\t|\r"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub('', text)
  pattern = r"(\n{2,})"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub(r'\n', singled)
  pattern = r"^\n|\n$"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub('', singled)
  return singled

def convert_html_entities(s):
  matches = re.findall("&#\d+;", s)
  if len(matches) > 0:
    hits = set(matches)
    for hit in hits:
      name = hit[2:-1]
      try:
        entnum = int(name)
        s = s.replace(hit, unichr(entnum))
      except ValueError:
        pass

  matches = re.findall("&#[xX][0-9a-fA-F]+;", s)
  if len(matches) > 0:
    hits = set(matches)
    for hit in hits:
      hex = hit[3:-1]
      try:
        entnum = int(hex, 16)
        s = s.replace(hit, unichr(entnum))
      except ValueError:
        pass

  matches = re.findall("&\w+;", s)
  hits = set(matches)
  amp = "&"
  if amp in hits:
    hits.remove(amp)
  for hit in hits:
    name = hit[1:-1]
    if htmlentitydefs.name2codepoint.has_key(name):
      s = s.replace(hit, unichr(htmlentitydefs.name2codepoint[name]))
  s = s.replace(amp, "&")
  return s

if __name__ == '__main__':
  main()

Close To Final Form

Okay, I’ve got just under a half-hour to see if I can’t successfully put this into the man code. Copy and paste or make it a module? It’s special enough that I think I should make it a separate file. This is the final form as of today. The last thing I did was chop off what could be shoved into a Google Spreadsheet cell (the ultimate destination for the returned value) at 5000 characters. I may adjust this in the future, but this seems like a good number of characters for now. My final steps tomorrow will be to copy this file, get rid of the main function call, and make it a library intended to be imported as an include. I’m also feeling this may be my first contribution to github.

  1. import requests, re, htmlentitydefs
  2.  
  3. def main():
  4.   tests = ['http://mikelev.in/about/',
  5.     'http://perezhilton.com/about',
  6.     'http://krebsonsecurity.com/about/',
  7.     'http://galadarling.com/static/about-gala',
  8.     'http://www.codinghorror.com/blog/2004/02/about-me.html']
  9.   text = justText(tests[3])
  10.   print text
  11.  
  12. def justText(url):
  13.   text = getHTML(url)
  14.   for nuketagblock in ['title', 'head']:
  15.     text = noTagBlock(text, nuketagblock)
  16.   text = justBody(text)
  17.   text = noComments(text)
  18.   for nuketagblock in ['script', 'noscript', 'form', 'object', 'embed',
  19.     'select']:
  20.     text = noTagBlock(text, nuketagblock)
  21.   text = stripParams(text)
  22.   text = lowercaseTags(text)
  23.   text = listNuker(text)
  24.   for nuketag in ['div', 'span', 'img', 'a', 'b', 'i', 'param', 'table',
  25.     'td', 'tr', 'font', 'title', 'head', 'meta', 'strong', 'em', 'iframe']:
  26.     text = noTag(text, nuketag)
  27.   text = singleizer(text)
  28.   text = convert_html_entities(text)
  29.   text = markdown(text)
  30.   text = just2LR(text)
  31.   text = text[:5000]
  32.   return text
  33.  
  34. def getHTML(url):
  35.   try:
  36.     response = requests.get(url)
  37.   except:
  38.     return ''
  39.   return response.text
  40.  
  41. def justBody(text):
  42.   pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  43.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  44.   bodymat = pat.search(text)
  45.   if bodymat:
  46.     return bodymat.group('capture')
  47.   else:
  48.     return ''
  49.  
  50. def markdown(text):
  51.   text = text.replace('<p>', "\n")
  52.   text = text.replace('</p>', "")
  53.   text = text.replace('<hr>', "\n---\n")
  54.   text = text.replace('<blockquote>', "\n> ")
  55.   text = text.replace('</blockquote>', "")
  56.   text = text.replace('<h1>', "\n# ")
  57.   text = text.replace('<h2>', "\n## ")
  58.   text = text.replace('<h3>', "\n### ")
  59.   text = text.replace('<h4>', "\n#### ")
  60.   text = text.replace('</h1>', "")
  61.   text = text.replace('</h2>', "")
  62.   text = text.replace('</h3>', "")
  63.   text = text.replace('</h4>', "")
  64.   text = text.strip()
  65.   return text
  66.  
  67. def just2LR(text):
  68.   pattern = r"\n{2,}"
  69.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  70.   less = pat.sub(r'\n\n', text)
  71.   pattern = " {2,}"
  72.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  73.   less = pat.sub(' ', less)
  74.   return less
  75.  
  76. def lowercaseTags(text):
  77.   pattern = r"<(/?[a-zA-Z0-9]+)>"
  78.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  79.   lowered = pat.sub(r'<\1>'.lower(), text)
  80.   return lowered
  81.  
  82. def stripParams(text):
  83.   pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  84.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  85.   nopram = pat.sub(r'\1>', text)
  86.   return nopram
  87.  
  88. def listNuker(text):
  89.   pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  90.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  91.   listless = pat.sub('', text)
  92.   pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  93.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  94.   listless = pat.sub('', listless)
  95.   pattern = r"(<\s*(li|ol|ul)\s*.*?>)|(<\s*/(li|ol|ul)\s*>)"
  96.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  97.   listless = pat.sub('', listless)
  98.   return listless
  99.  
  100. def noTag(text, tag):
  101.   pattern = r"(<\s*%s\s*.*?>)|(<\s*/%s\s*>)" % (tag, tag)
  102.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  103.   tagless = pat.sub('', text)
  104.   return tagless
  105.  
  106. def noTagBlock(text, tag):
  107.   pattern = r"<\s*%s\s*.*?>.*?<\s*/%s\s*>" % (tag, tag)
  108.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  109.   byeblock = pat.sub('', text)
  110.   return byeblock
  111.  
  112. def noComments(text):
  113.   pattern = r"<!--.*?-->"
  114.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  115.   byeblock = pat.sub('', text)
  116.   return byeblock
  117.  
  118. def singleizer(text):
  119.   pattern = r"\t|\r"
  120.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  121.   singled = pat.sub('', text)
  122.   pattern = r"^.{,30}$"
  123.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL | re.MULTILINE)
  124.   singled = pat.sub('', singled)
  125.   pattern = r"\n{2,}"
  126.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  127.   singled = pat.sub(r'\n', singled)
  128.   pattern = r"\n.{,10}\n"
  129.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  130.   singled = pat.sub(r'\n', singled)
  131.   pattern = r"^\n|\n$"
  132.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  133.   singled = pat.sub('', singled)
  134.   return singled
  135.  
  136. def convert_html_entities(s):
  137.   matches = re.findall("&#\d+;", s)
  138.   if len(matches) > 0:
  139.     hits = set(matches)
  140.     for hit in hits:
  141.       name = hit[2:-1]
  142.       try:
  143.         entnum = int(name)
  144.         s = s.replace(hit, unichr(entnum))
  145.       except ValueError:
  146.         pass
  147.  
  148.   matches = re.findall("&#[xX][0-9a-fA-F]+;", s)
  149.   if len(matches) > 0:
  150.     hits = set(matches)
  151.     for hit in hits:
  152.       hex = hit[3:-1]
  153.       try:
  154.         entnum = int(hex, 16)
  155.         s = s.replace(hit, unichr(entnum))
  156.       except ValueError:
  157.         pass
  158.  
  159.   matches = re.findall("&\w+;", s)
  160.   hits = set(matches)
  161.   amp = "&"
  162.   if amp in hits:
  163.     hits.remove(amp)
  164.   for hit in hits:
  165.     name = hit[1:-1]
  166.     if htmlentitydefs.name2codepoint.has_key(name):
  167.       s = s.replace(hit, unichr(htmlentitydefs.name2codepoint[name]))
  168.   s = s.replace(amp, "&")
  169.   return s
  170.  
  171. if __name__ == '__main__':
  172.   main()
import requests, re, htmlentitydefs

def main():
  tests = ['http://mikelev.in/about/',
    'http://perezhilton.com/about',
    'http://krebsonsecurity.com/about/',
    'http://galadarling.com/static/about-gala',
    'http://www.codinghorror.com/blog/2004/02/about-me.html']
  text = justText(tests[3])
  print text

def justText(url):
  text = getHTML(url)
  for nuketagblock in ['title', 'head']:
    text = noTagBlock(text, nuketagblock)
  text = justBody(text)
  text = noComments(text)
  for nuketagblock in ['script', 'noscript', 'form', 'object', 'embed',
    'select']:
    text = noTagBlock(text, nuketagblock)
  text = stripParams(text)
  text = lowercaseTags(text)
  text = listNuker(text)
  for nuketag in ['div', 'span', 'img', 'a', 'b', 'i', 'param', 'table',
    'td', 'tr', 'font', 'title', 'head', 'meta', 'strong', 'em', 'iframe']:
    text = noTag(text, nuketag)
  text = singleizer(text)
  text = convert_html_entities(text)
  text = markdown(text)
  text = just2LR(text)
  text = text[:5000]
  return text

def getHTML(url):
  try:
    response = requests.get(url)
  except:
    return ''
  return response.text

def justBody(text):
  pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  bodymat = pat.search(text)
  if bodymat:
    return bodymat.group('capture')
  else:
    return ''

def markdown(text):
  text = text.replace('<p>', "\n")
  text = text.replace('</p>', "")
  text = text.replace('<hr>', "\n---\n")
  text = text.replace('<blockquote>', "\n> ")
  text = text.replace('</blockquote>', "")
  text = text.replace('<h1>', "\n# ")
  text = text.replace('<h2>', "\n## ")
  text = text.replace('<h3>', "\n### ")
  text = text.replace('<h4>', "\n#### ")
  text = text.replace('</h1>', "")
  text = text.replace('</h2>', "")
  text = text.replace('</h3>', "")
  text = text.replace('</h4>', "")
  text = text.strip()
  return text

def just2LR(text):
  pattern = r"\n{2,}"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  less = pat.sub(r'\n\n', text)
  pattern = " {2,}"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  less = pat.sub(' ', less)
  return less

def lowercaseTags(text):
  pattern = r"<(/?[a-zA-Z0-9]+)>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  lowered = pat.sub(r'<\1>'.lower(), text)
  return lowered

def stripParams(text):
  pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  nopram = pat.sub(r'\1>', text)
  return nopram

def listNuker(text):
  pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', text)
  pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  pattern = r"(<\s*(li|ol|ul)\s*.*?>)|(<\s*/(li|ol|ul)\s*>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  return listless

def noTag(text, tag):
  pattern = r"(<\s*%s\s*.*?>)|(<\s*/%s\s*>)" % (tag, tag)
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  tagless = pat.sub('', text)
  return tagless

def noTagBlock(text, tag):
  pattern = r"<\s*%s\s*.*?>.*?<\s*/%s\s*>" % (tag, tag)
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  byeblock = pat.sub('', text)
  return byeblock

def noComments(text):
  pattern = r"<!--.*?-->"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  byeblock = pat.sub('', text)
  return byeblock

def singleizer(text):
  pattern = r"\t|\r"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub('', text)
  pattern = r"^.{,30}$"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL | re.MULTILINE)
  singled = pat.sub('', singled)
  pattern = r"\n{2,}"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub(r'\n', singled)
  pattern = r"\n.{,10}\n"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub(r'\n', singled)
  pattern = r"^\n|\n$"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub('', singled)
  return singled

def convert_html_entities(s):
  matches = re.findall("&#\d+;", s)
  if len(matches) > 0:
    hits = set(matches)
    for hit in hits:
      name = hit[2:-1]
      try:
        entnum = int(name)
        s = s.replace(hit, unichr(entnum))
      except ValueError:
        pass

  matches = re.findall("&#[xX][0-9a-fA-F]+;", s)
  if len(matches) > 0:
    hits = set(matches)
    for hit in hits:
      hex = hit[3:-1]
      try:
        entnum = int(hex, 16)
        s = s.replace(hit, unichr(entnum))
      except ValueError:
        pass

  matches = re.findall("&\w+;", s)
  hits = set(matches)
  amp = "&"
  if amp in hits:
    hits.remove(amp)
  for hit in hits:
    name = hit[1:-1]
    if htmlentitydefs.name2codepoint.has_key(name):
      s = s.replace(hit, unichr(htmlentitydefs.name2codepoint[name]))
  s = s.replace(amp, "&")
  return s

if __name__ == '__main__':
  main()

Fixing Line Wrapping in Code Samples When Publishing From Git to WordPress

Okay, this will be a real test of my git publishing system that I just created. There have been multiple commits and tons of line wrapping issues. I’m getting rid of the hard returns that I insert on all of my line-wraps, per my default vim configuration and preferred coding environment. But in order to make content flow well into a WordPress blog, I strip out the hard returns using sed (the stream editor built into the Unix command set). It will be a pleasure to push this as a draft into WordPress without having to worry about text wrapping issues. No, I have to fix this. This is the post to fit it on.

Right now, my sed command is this:

sed ‘s/^+//’

Sed is simply stripping off the leading pluses. Sed is simply stripping off the leading pluses. I’m piping that into the fmt command to do the line re-wrapping:

fmt -w 2500

The line re-wrapping is where the problem occurs, because WordPress preserves the hard returns. I should rethink how I handle this. Plugging code-samples through fmt is a bad idea, because it will nuke the line returns in code samples. Stripping off the leading-pluses with sed is still a good idea. But it may be better to use the power of Python to strip the line-wraps. Take the piping into fmt out of post.py. Edit postit.py. Add re to the imports. Identify the condition under which line breaks should be stripped.

Paragraphs NEVER have lines that begin with whitespace. They ALWAYS by their nature begin with paragraphs. Thankfully, Python enforces indention, so you will NEVER have a Python function with consecutive linebreaks followed by non-whitespace characters. Python functions can be protected from the regex match. However, stacked lines are not so easy. I will have to always contain stacked lines in pre or code tags in order to protect them. That’s double-protection: not matching the line break pattern AND always being contained in special formatting tags.

Okay, so this is going to be a regex substitution pattern. What’s the pattern?

You want to strip out a line return that’s followed by another line return that is immediately followed by a non-whitespace character, but you don’t want to capture the second line return and it’s non-whitespace character. Okay, this has some simple lookahead and the negation of a metacharacter. This is a really fancy piece of work here. I couldn’t get negative lookahead to work, so this is a workaround where I let the first negated matching-group consume, then insert it back in during the substitution. BAM!

  1. pattern = r"([^\n])\n(?=^[^\s])"
  2. pat = re.compile(pattern, re.IGNORECASE | re.DOTALL | re.MULTILINE)
  3. content = pat.sub(r'\1 ', content)
pattern = r"([^\n])\n(?=^[^\s])"
pat = re.compile(pattern, re.IGNORECASE | re.DOTALL | re.MULTILINE)
content = pat.sub(r'\1 ', content)

This is a tricky bit of work. I should probably update my previous post on this git publishing system now that I’ve revised it. But this will serve me very, very well. I’m going to be able to push out Python code samples very easily now. It may not be so easy with other languages. The next bit of vocabulary that would make code-tags “protect” blocks would be way to complicated to think through right now. Bank your win and move on!

My last thing is to strip out the Main function and turn this into an import-able library to incorporate it into my main work. I will probably update this post with that final step. I hope your head didn’t explode from this post.

Okay, here’s the final form of the program. It can be imported with the statement:

from markdown import *

…then you can just call:

myMarkdown = markdown(url)

  1. import requests, re, htmlentitydefs
  2.  
  3. def markdown(url):
  4.   text = getHTML(url)
  5.   for nuketagblock in ['title', 'head']:
  6.     text = noTagBlock(text, nuketagblock)
  7.   text = justBody(text)
  8.   text = noComments(text)
  9.   for nuketagblock in ['script', 'style', 'noscript', 'form',
  10.     'object', 'embed', 'select']:
  11.     text = noTagBlock(text, nuketagblock)
  12.   text = stripParams(text)
  13.   text = lowercaseTags(text)
  14.   text = listNuker(text)
  15.   for nuketag in ['div', 'span', 'img', 'a', 'b', 'i', 'param', 'table',
  16.     'td', 'tr', 'font', 'title', 'head', 'meta', 'strong', 'em', 'iframe']:
  17.     text = noTag(text, nuketag)
  18.   text = singleizer(text)
  19.   text = convert_html_entities(text)
  20.   text = addmarkdown(text)
  21.   text = just2LR(text)
  22.   if len(text) < 5000:
  23.     return text
  24.   else:
  25.     return text[:5000]+'...'
  26.  
  27. def getHTML(url):
  28.   try:
  29.     response = requests.get(url)
  30.   except:
  31.     return ''
  32.   return response.text
  33.  
  34. def justBody(text):
  35.   pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  36.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  37.   bodymat = pat.search(text)
  38.   if bodymat:
  39.     return bodymat.group('capture')
  40.   else:
  41.     return ''
  42.  
  43. def addmarkdown(text):
  44.   text = text.replace('<p>', "\n")
  45.   text = text.replace('</p>', "")
  46.   text = text.replace('<hr>', "\n---\n")
  47.   text = text.replace('<blockquote>', "\n> ")
  48.   text = text.replace('</blockquote>', "")
  49.   text = text.replace('<h1>', "\n# ")
  50.   text = text.replace('<h2>', "\n## ")
  51.   text = text.replace('<h3>', "\n### ")
  52.   text = text.replace('<h4>', "\n#### ")
  53.   text = text.replace('</h1>', "")
  54.   text = text.replace('</h2>', "")
  55.   text = text.replace('</h3>', "")
  56.   text = text.replace('</h4>', "")
  57.   text = text.strip()
  58.   return text
  59.  
  60. def just2LR(text):
  61.   pattern = r"\n{2,}"
  62.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  63.   less = pat.sub(r'\n\n', text)
  64.   pattern = " {2,}"
  65.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  66.   less = pat.sub(' ', less)
  67.   return less
  68.  
  69. def lowercaseTags(text):
  70.   pattern = r"<(/?[a-zA-Z0-9]+)>"
  71.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  72.   lowered = pat.sub(r'<\1>'.lower(), text)
  73.   return lowered
  74.  
  75. def stripParams(text):
  76.   pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  77.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  78.   nopram = pat.sub(r'\1>', text)
  79.   return nopram
  80.  
  81. def listNuker(text):
  82.   pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  83.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  84.   listless = pat.sub('', text)
  85.   pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  86.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  87.   listless = pat.sub('', listless)
  88.   pattern = r"(<\s*(li|ol|ul)\s*.*?>)|(<\s*/(li|ol|ul)\s*>)"
  89.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  90.   listless = pat.sub('', listless)
  91.   return listless
  92.  
  93. def noTag(text, tag):
  94.   pattern = r"(<\s*%s\s*.*?>)|(<\s*/%s\s*>)" % (tag, tag)
  95.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  96.   tagless = pat.sub('', text)
  97.   return tagless
  98.  
  99. def noTagBlock(text, tag):
  100.   pattern = r"<\s*%s\s*.*?>.*?<\s*/%s\s*>" % (tag, tag)
  101.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  102.   byeblock = pat.sub('', text)
  103.   return byeblock
  104.  
  105. def noComments(text):
  106.   pattern = r"<!--.*?-->"
  107.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  108.   byeblock = pat.sub('', text)
  109.   return byeblock
  110.  
  111. def singleizer(text):
  112.   pattern = r"\t|\r"
  113.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  114.   singled = pat.sub('', text)
  115.   pattern = r"^.{,30}$"
  116.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL | re.MULTILINE)
  117.   singled = pat.sub('', singled)
  118.   pattern = r"\n{2,}"
  119.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  120.   singled = pat.sub(r'\n', singled)
  121.   pattern = r"\n.{,10}\n"
  122.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  123.   singled = pat.sub(r'\n', singled)
  124.   pattern = r"^\n|\n$"
  125.   pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  126.   singled = pat.sub('', singled)
  127.   return singled
  128.  
  129. def convert_html_entities(s):
  130.   matches = re.findall("&#\d+;", s)
  131.   if len(matches) > 0:
  132.     hits = set(matches)
  133.     for hit in hits:
  134.       name = hit[2:-1]
  135.       try:
  136.         entnum = int(name)
  137.         s = s.replace(hit, unichr(entnum))
  138.       except ValueError:
  139.         pass
  140.  
  141.   matches = re.findall("&#[xX][0-9a-fA-F]+;", s)
  142.   if len(matches) > 0:
  143.     hits = set(matches)
  144.     for hit in hits:
  145.       hex = hit[3:-1]
  146.       try:
  147.         entnum = int(hex, 16)
  148.         s = s.replace(hit, unichr(entnum))
  149.       except ValueError:
  150.         pass
  151.  
  152.   matches = re.findall("&\w+;", s)
  153.   hits = set(matches)
  154.   amp = "&"
  155.   if amp in hits:
  156.     hits.remove(amp)
  157.   for hit in hits:
  158.     name = hit[1:-1]
  159.     if htmlentitydefs.name2codepoint.has_key(name):
  160.       s = s.replace(hit, unichr(htmlentitydefs.name2codepoint[name]))
  161.   s = s.replace(amp, "&")
  162.   return s
import requests, re, htmlentitydefs

def markdown(url):
  text = getHTML(url)
  for nuketagblock in ['title', 'head']:
    text = noTagBlock(text, nuketagblock)
  text = justBody(text)
  text = noComments(text)
  for nuketagblock in ['script', 'style', 'noscript', 'form',
    'object', 'embed', 'select']:
    text = noTagBlock(text, nuketagblock)
  text = stripParams(text)
  text = lowercaseTags(text)
  text = listNuker(text)
  for nuketag in ['div', 'span', 'img', 'a', 'b', 'i', 'param', 'table',
    'td', 'tr', 'font', 'title', 'head', 'meta', 'strong', 'em', 'iframe']:
    text = noTag(text, nuketag)
  text = singleizer(text)
  text = convert_html_entities(text)
  text = addmarkdown(text)
  text = just2LR(text)
  if len(text) < 5000:
    return text
  else:
    return text[:5000]+'...'

def getHTML(url):
  try:
    response = requests.get(url)
  except:
    return ''
  return response.text

def justBody(text):
  pattern = r"<\s*body\s*.*?>(?P<capture>.*)<\s*/body\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  bodymat = pat.search(text)
  if bodymat:
    return bodymat.group('capture')
  else:
    return ''

def addmarkdown(text):
  text = text.replace('<p>', "\n")
  text = text.replace('</p>', "")
  text = text.replace('<hr>', "\n---\n")
  text = text.replace('<blockquote>', "\n> ")
  text = text.replace('</blockquote>', "")
  text = text.replace('<h1>', "\n# ")
  text = text.replace('<h2>', "\n## ")
  text = text.replace('<h3>', "\n### ")
  text = text.replace('<h4>', "\n#### ")
  text = text.replace('</h1>', "")
  text = text.replace('</h2>', "")
  text = text.replace('</h3>', "")
  text = text.replace('</h4>', "")
  text = text.strip()
  return text

def just2LR(text):
  pattern = r"\n{2,}"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  less = pat.sub(r'\n\n', text)
  pattern = " {2,}"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  less = pat.sub(' ', less)
  return less

def lowercaseTags(text):
  pattern = r"<(/?[a-zA-Z0-9]+)>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  lowered = pat.sub(r'<\1>'.lower(), text)
  return lowered

def stripParams(text):
  pattern = r"(<\s*[a-zA-Z0-9]+).*?(?:>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  nopram = pat.sub(r'\1>', text)
  return nopram

def listNuker(text):
  pattern = r"<\s*(ol|ul)\s*.*?>.*?<\s*/(ol|ul)\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', text)
  pattern = r"<\s*li\s*.*?>.*?<\s*/li\s*>"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  pattern = r"(<\s*(li|ol|ul)\s*.*?>)|(<\s*/(li|ol|ul)\s*>)"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  listless = pat.sub('', listless)
  return listless

def noTag(text, tag):
  pattern = r"(<\s*%s\s*.*?>)|(<\s*/%s\s*>)" % (tag, tag)
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  tagless = pat.sub('', text)
  return tagless

def noTagBlock(text, tag):
  pattern = r"<\s*%s\s*.*?>.*?<\s*/%s\s*>" % (tag, tag)
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  byeblock = pat.sub('', text)
  return byeblock

def noComments(text):
  pattern = r"<!--.*?-->"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  byeblock = pat.sub('', text)
  return byeblock

def singleizer(text):
  pattern = r"\t|\r"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub('', text)
  pattern = r"^.{,30}$"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL | re.MULTILINE)
  singled = pat.sub('', singled)
  pattern = r"\n{2,}"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub(r'\n', singled)
  pattern = r"\n.{,10}\n"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub(r'\n', singled)
  pattern = r"^\n|\n$"
  pat = re.compile(pattern, re.IGNORECASE | re.DOTALL)
  singled = pat.sub('', singled)
  return singled

def convert_html_entities(s):
  matches = re.findall("&#\d+;", s)
  if len(matches) > 0:
    hits = set(matches)
    for hit in hits:
      name = hit[2:-1]
      try:
        entnum = int(name)
        s = s.replace(hit, unichr(entnum))
      except ValueError:
        pass

  matches = re.findall("&#[xX][0-9a-fA-F]+;", s)
  if len(matches) > 0:
    hits = set(matches)
    for hit in hits:
      hex = hit[3:-1]
      try:
        entnum = int(hex, 16)
        s = s.replace(hit, unichr(entnum))
      except ValueError:
        pass

  matches = re.findall("&\w+;", s)
  hits = set(matches)
  amp = "&"
  if amp in hits:
    hits.remove(amp)
  for hit in hits:
    name = hit[1:-1]
    if htmlentitydefs.name2codepoint.has_key(name):
      s = s.replace(hit, unichr(htmlentitydefs.name2codepoint[name]))
  s = s.replace(amp, "&")
  return s

The End Result

So, just plop that code into a file named markdown.py, make sure you have the Python Request module installed, and you can write code like this:

  1. from markdown import *
  2. myMarkdown = markdown("http://mikelev.in/")
  3. print myMarkdown
from markdown import *
myMarkdown = markdown("http://mikelev.in/")
print myMarkdown 

Comments

comments

Previous
Next