Blog books

Updated: Feb 20, 2018 by Pradeep Gowda.

Update: 2014/9/24: I completed the automatic conversion. The scripts and the finished artifact can be found on blogbooks bitbucket repository.

OBJECTIVE: To make a blog book out of http://normaldeviate.wordpress.com/.

Get a list of the archive pages:

Use pup to extract the URLs.

curl  http://normaldeviate.wordpress.com/ | pup li#archives-2 'a[href'] attr{href} > archivepages.txt

Inside each of these URLs the Posts are available under h3.entry-title > a:

curl normaldeviate.wordpress.com/2012/07/ | pup h3.entry-title 'a[href]' attr{href}

We will use GNU parallel to print all the links:

cat archivepages.txt| parallel curl {1} | pup h3.entry-title 'a[href]' attr{href} > all-links.txt

use a small python script to generate a bash script which will then be used to download the HTML files.

# proc1.py
import string

with open('all-links.txt', 'r') as f:
    lines = f.readlines()
    for l in lines:
        url = l.strip()
        tmp = url.strip('/')
        fname = string.replace(tmp, 'http://normaldeviate.wordpress.com/', '')
    fname = string.replace(fname, '/', '-')
        fname = '%s.1' % (fname, )
        print "wget %s -O %s " % (url,fname)

Run this script:

python proc1.py > download.sh

And then run the shell script:

bash download.sh

Inside individual posts:

Comments are inside id="comments-list". Each comment has an id = id="comment-nn"

Challenges: