Extracting useful stuff from a SnipSnap wiki

The MSL lab at CMU (where I work) uses wiki software called SnipSnap. SnipSnap is Java-based, apparently standalone (you don't need a webserver), and as far as I can tell stores all its files in a database somewhere. We had a scare this spring where the wikis went down, and we didn't know how to get them back up. Thankfully, the guy who installed them was still around, and he fixed things, but the experience cued me to maybe back up our important files in some format that didn't require a working copy of SnipSnap to use.

SnipSnap allows you to back up the wiki to XML, but for some reason this wasn't working on our site. I could've gone through each page, right-clicked each attachment, and copy-pasted all the page text into text files (or even lazier: just saved all the html somewhere), but this didn't sound like fun. Instead, I wrote a python script to do all of that for me. It converts a SnipSnap wiki into text files and comments and attachments in a file system hierarchy format, that anyone with a computer can look at.

snipsnap_extract.py uses the excellent Beautiful Soup for screenscraping, so you'll need to download that and put it in your path. You give the script the web address of the site, and a file with the names of all the pages you'd like to download, and it sucks them all down, maintaining the page hierarchy that's in the wiki. For each page, you get a directory containing all the attachments to that page, as well as the files snipsnap_$pagename.txt with the body text and snipsnap_comments.txt with the comments listed by poster and date.

 $ ./snipsnap_extract.py {-p,--prefix} URL prefix {-n,--name} page name
 $ ./snipsnap_extract.py {-p,--prefix} URL prefix {-f,--file} file listing page names
 $ ./snipsnap_extract.py {-u,--url} full URL

 $ ./snipsnap_extract.py -p http://www.foo.com/path/comments -n our_startpage
 $ ./snipsnap_extract.py -p http://www.foo.com/to/space/ -f files_i_want.txt
 $ ./snipsnap_extract.py -u http://www.foo.com/wiki/comments/our_startpage

If you find this useful, please drop me an email at katie (thingy) rivard.org.