With a bit of python, lynx, and tidy I was able to pull very clean plain text versions of my WordPress posts. The sparse HTML can be found at http://tokyogringo.myjp.net and the markdown text version can be found on my gopher site at gopher://sdf.org:70/0/users/tokyogringo/
How did I do it? This site has full text RSS for everyone’s enjoyment. No one has to actually visit https://www.prjorgensen.com in order to consume the high value content I generate. The feed contains everything needed for this plain text life. How to make use of it?
I fumbled through my first in a long time python script relying heavily on the very powerful feedparser module.
This Just In: python’s documentation is terse almost to the point of incomprehension While accurate, the documentation does not help beginning (and maybe middling) python coders get to solving problems. Oddly, the Reddits and StackExchange sites are also of limited utility as the answers there often point back to or copy the documentation.
Anyway, taking a very Unix approach I decided not to do everything in python. I know tidy for making valid HTML. I know lynx for terminal-based web browsing, and the ‘-dump’ option produces markdown versions of web pages.
Once I got the script to the point of providing the website data in a reliable and eventually parse-able way, then I turned to getting all my posts.
I cranked the RSS feed of prjorgensen.com up to 20,000 to make sure the feed briefly included all of my posts. I moved my parsing script to my MacBook Pro because I didn’t want to choke the sdf.org servers with my madness. I installed modules and localized the script to run on the MBP.
I ran the script. I checked my email. I then got up to … hmmm. The script finished in under two minutes. Suddenly I had all of my posts back to 2011 in both very clean HTML and in plain text. I synced them to their proper home. I reset my website feed back to a more reasonable number.
There are any number of improvements I can make:
- My script does not grab images
- I capture categories and tags from WordPress but don’t do anything useful with them
- I need to include modifying my gophermap and my index.html (as appropriate)
- A full text RSS feed of the plain HTML site
- A full text RSS feed of the gopher site
- Maybe use a static web site generator like Jekyll for the plain HTML site
- Maybe use this for tokyogringo.com and PVCSec.com? If so, then I need to handle …
- Media enclosures
Watch this space for the link to my script on GitHub. Which is here!