Skip to content. | Skip to navigation

Personal tools

>>> ''.join(word[:3].lower() for word in 'David Isaac Glick'.split())

‘davisagli’

Navigation

You are here: Home / Blog / Notes on migrating this blog from Wordpress to Plone

Notes on migrating this blog from Wordpress to Plone

by David Glick posted Apr 11, 2010 02:06 AM
Includes example code for exporting blog posts and comments from Wordpress and importing them into Plone.

I'm proud to be now running this blog on Plone 4, having just completed a migration from Wordpress.  Here are a few notes on the migration.

Initial steps that I won't go into detail on, at least for now:

  • Obtained a Linode VPS with enough RAM to feed Plone (it's looking like my small Plone 4 site will need about 100MB to itself ... a greedy baseline, but that does include the database).
  • Created a Plone 4 buildout with ZEO and one Zope instance being run under supervisord.
  • Installed plone.app.caching, the new caching framework Martin Aspeli and Ric Newbery have been working on.  This still requires a number of svn checkouts at this point, but it's coming along nicely.
  • Installed Scrawl so that I can manage blog entries as a separate content type.
  • Installed plone.app.discussion, the new comment and discussion add-on created by Timo Stollenwerk, and its recaptcha add-on.  Made sure comments were enabled for the Blog Entry type, and that comment moderation was not turned on for the duration of the migration.
  • Turned off much of Plone's default HTML filtering.  Security is not a huge concern since I'm going to be the only one editing the site, and I tend to want to use fancy stuff in posts sometimes.

Data Export

Moving the posts and comments from my old blog's MySQL database was easier than I feared, though I did have to do a bit of coding.

I decided up front to do this by way of dumping the data to CSV files, rather than writing import code that read directly from MySQL. That was mostly a visceral reaction to a memory of a hard time getting MySQL-python installed and working properly once previously, and may have been an irrational fear.  But dumping the data from MySQL to CSV was easy enough, with the following two queries that grabbed just the data I needed:

SELECT ID, post_date, CONVERT(post_content USING latin1), post_title
INTO OUTFILE '/tmp/musings.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'
FROM wp_posts, wp_post2cat
WHERE wp_posts.ID=post_id AND category_id=48
AND post_status='publish';

SELECT comment_post_ID, comment_author, comment_author_email, comment_author_url, comment_date, CONVERT(comment_content USING latin1), user_id
INTO OUTFILE '/tmp/comments.csv'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' ESCAPED BY '\\' LINES TERMINATED BY '\n'
FROM wp_comments
WHERE comment_approved = '1' AND comment_type='';

This grabs a bunch of fields and dumps them into a CSV file with the given CSV dialect parameters.  The only tricky bit here is the call to CONVERT, which was needed because my raw data in MySQL had been improperly encoded.  A normal connection to MySQL defaults to the latin1 encoding (which is what MySQL calls windows-1252).  But Wordpress had been sending it utf8-encoded data, and the MySQL table had been configured to store things as utf8.  So when I stored data, MySQL was decoding the utf8 input as windows-1252, and then re-encoding as utf8.  On retrieval via a normal connection the reverse transformation was applied and it didn't matter, but SELECT INTO OUTFILE just copies the raw data from the table, which was effectively gobbledygook.  So I had to explicitly make MySQL convert the stored value to latin1 (read: decode it as utf8 and then encode as windows-1252) in order to end up with the utf8 I wanted.  This would have been needed for the other fields as well, except I wasn't using non-ASCII characters in them.

The category restriction on the first query makes sure that I only got the Plone-related posts from the old blog (I had been using it for personal blogging as well).  The comment type restriction on the second query excludes pingbacks.

Data Import

I ended up writing this custom External Method to import the CSV data into Plone.  (I looked at transmogrifier, csvreplicata, and ArcheCSV, but for all of these it looked like I would have ended up writing a significant amount of code in the end anyway to get them to do what I wanted. And I knew I could do that "from scratch" in just a page or two of Python...)

import csv
import re
from DateTime import DateTime
from zope.component import queryUtility, createObject
from plone.i18n.normalizer.interfaces import IIDNormalizer
from plone.app.discussion.interfaces import IConversation

PRE_RE = re.compile(r'(<pre>.*?</pre>)', re.IGNORECASE | re.DOTALL)

def cleanup_wordpress_text(text):
    text = PRE_RE.sub(lambda x: x.group(1).replace('\r\n\r\n', '\n\n'), text)
    return text.replace('\r\n\r\n', '<p>').replace('\r\n','\n').decode('utf-8')

def importmusings(self):
    context = self
    reader = csv.reader(open('/tmp/musings.csv'), delimiter=',', quotechar='"', doublequote=False, escapechar='\\')
    posts = {}
    for row in reader:
        id, date, text, title = row
        short = queryUtility(IIDNormalizer).normalize(title)
        if short in context:
            del context[short]
        post = context[context.invokeFactory('Blog Entry', short)]
        post.setCreators(['davisagli'])
        post.setEffectiveDate(DateTime(date))
        post.setTitle(title)
        text = cleanup_wordpress_text(text)
        post.setText(text, mimetype='text/html')
        post.reindexObject()
        context.portal_workflow.doActionFor(post, 'publish')
        posts[id] = post

    reader = csv.reader(open('/tmp/comments.csv'), delimiter=',', quotechar='"', doublequote=False, escapechar='\\')
    for row in reader:
        post_id, author, email, url, date, text, uid = row
        try:
            post = posts[post_id]
        except KeyError:
            continue
        conversation = IConversation(post)
        comment = createObject('plone.Comment')
        comment.text = cleanup_wordpress_text(text)
        if uid == '1':
            comment.creator = comment.author_username = 'davisagli'
            comment.author_name = 'David Glick'
            comment.author_email = 'dglick@gmail.com'
        else:
            comment.creator = None
            comment.author_name = author
            comment.author_email = email
        date = DateTime(date).asdatetime()
        comment.creation_date = comment.modification_date = date
        conversation.addComment(comment)
 
    return 'Done.'

That cleanup_wordpress_text function turns double newlines from Wordpress into proper paragraph tags -- unless they're within a PRE tag.  The rest of the code is pretty readable -- yay, Python.

Syntax Highlighting

You probably noticed one of the new site features -- syntax highlighting for blocks of code.  This is provided by the Pygments module, applied as a transformation to the entire response just before the Zope publisher returns it.  I achieved that via a plugin (called collective.pygmentstransform, available in the collective, and not released so far or probably ever) for Martin's plone.transformchain (also unreleased so far).  It's imperfect (not least because it guesses the language heuristically), but good enough for now I think.  Yes, I should probably be doing this as WSGI middleware, but I haven't spent the time to figure out how to run Zope 2.12 under WSGI yet.

Filed under: , ,
Anonymous says:
Apr 11, 2010 05:50 AM
Very nicely done, and a great writeup! FWIW, my 256MB Rackspace Cloud VPS burns over 100MB running Wordpress + MySQL + Apache, and I've had to tweak back its Apache settings quite a bit to keep it from running out of memory under load. (That said, I'm a lousy sysadmin, and a smarter one could probably get it tuned better.)
davisagli says:
Apr 11, 2010 06:01 AM
Right, I think the main economy of scale that bulk PHP hosts get is by sharing mysql and apache processes between accounts.

I've heard apache's "worker" MPM uses less RAM than the "prefork one", but it looks like ubuntu's PHP package only works with prefork.
Anonymous says:
Apr 11, 2010 12:39 PM
Yep, I realized as I was fiddling last night that I needed to move from worker to prefork, but haven't yet investigated whether it's possible. Looks like you have once again saved me some work, thanks! I did manage to get things in control by ramping down the number of idle servers. But, let's put another nail in the coffin of the myth that Plone is unusually RAM-hungry.
Anonymous says:
Apr 11, 2010 09:02 AM
i'm having some similar thoughts about migrating back my blog to plone (from zine) and finally become Plone user :P

I'm just sorry you didn't use transmogrifier since i saw transmogrify.sqlalchemy which is exactly what could be use in your case (and will be in mine as well).

for commenting i'm still not sure. i'm thinking everythign to be done by external services like disqus... well will see
Anonymous says:
Apr 11, 2010 11:20 AM
Hi David,

Looks good. I like the regular expression. ;-) It's great to have core devs run so many cutting edge things on their own sites, to help us improve them. Do let us know how p.a.caching is working (which reminds me, we need to release that).

I would say next time, take a look at Transmogrifier. It is actually really nice, and takes very little time to set up and use. I'd say it's probably comparable in code/config length to what you did above, but less likely to go wrong. :)

Martin
Manuel says:
Jul 10, 2011 09:27 AM
Hey David, I based the migration of my site (http://neon-cathedral.net/) on your code and it went really well. Just made a few modifications to fit my personal taste. :) Great work! Manuel
zedr says:
Mar 30, 2012 08:03 AM
I've expanded davisagli's work and created a standalone script here:

    https://github.com/zedr/wordpress2plone

Thanks David! You saved me a lot of work ;)
qfundxtx says:
Sep 04, 2013 01:45 PM
Klein believes that "the youth market is an untapped wellspring of new revenue" (Klein, 43) and further suggests that "as educational institutions surrender to the manic march of branding, a new language is emerging" (Klein, 46). Use of this gel enhances the bloodstream flow by dilating the bloodstream ships. <a href=http://www.shillitoe.co.uk>louis vuitton uk outlet</a> It is a story we are all too familiar with by now; destruction of habitat, dispersal ofweakenedpopulation, and restriction of carnal desires to do unspeakable things to people. Financial institutions in Europe, primarily Austria, Liechtenstein and Switzerland are famous for having restricted gold bullion bars counters where gold ingots can be acquired and sold without having to pay a dollar worth of tax..
<a href=http://www.linguisticoservices.co.uk>tiffany jewelry sale</a> If your child comes home with a bad grade, don get mad. On the first night we just drove up and down the main drag a little and then picked a place called AJ Spurs to eat at.
<a href=http://www.racersedgemx.co.uk>hollister scarves</a> Pantyhose also have a medical benefit and support tights are commonly worn by older people or those suffering from circulatory problems in the legs. <a href=http://www.katiehildebrand.co.uk>swarovski jewelry</a>
Too much choice can lead us to make a wrong choice simply because we are overwhelmed by the sheer numbers. The fashion hijab is changing at a faster rate with demand for the hijab on the rise.
<a href=http://www.trentishoedoodles.co.uk>belstaff uk</a> Large cardboard mannequins wearing the designer's skirts vertically stick out from the three walls.
Navigation