Using LINQ to XML to migrate blog posts

During my recent dalliance with other blog hosts, I ended up playing around with LINQ to XML and was quite impressed with how much nicer it was than the pre-LINQ .NET classes for XML parsing. Here’s a quick overview of how I migrated all my Blogger posts to Community Server (before deciding to stick with Blogger :)), with a focus on the LINQ to XML bits.

First my compulsory disclaimers: this was a rush job. This was the most disgusting, hacky code I have ever written. I have written C64 Basic programs with better separation of concerns. If you use any of this code for anything at all you’d have to be insane! I did write the code test-first, but definitely skipped the all-important "design" part of TDD. Ugly ugly ugly. With that said, let’s try and get something useful out of the spaghetti.

Getting Blogger posts in XML format

First step was to retrieve all the posts in XML. You can get this directly from a Blogger url:

http://(your_blog).blogspot.com/atom.xml?redirect=false&start-index=1&max-results=500

If you have more than 500 posts, adjust the querystring or use a couple of batches to get the results. You can download this using .NET (webClient.DownloadFile(url, feedFile);, using System.Net.WebClient), or use the highly technical method of downloading from your browser :-) The resulting XML looks something like this:

<?xml version='1.0' encoding='UTF-8'?>
<?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?>
<feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/'>
  <!-- Some nodes relating to your blog here. -->
  <entry>
    <id>...</id>
    <published>...</published>
    <updated>...</updated>
    <category scheme='http://www.blogger.com/atom/ns#' term='daves drivel'/>
    <category scheme='http://www.blogger.com/atom/ns#' term='shameless self promotion'/>    
    <title type='text'>...</title>
    <content type='html'>(blog post content here)</content>
    <link rel='alternate' type='text/html' href='http://davesquared.net/2008/02/sample-post.html' title='Sample post'/>
    ...
  </entry>
  <!-- Lots more <entry /> nodes -->
</feed>    
    

Basic XML parsing using LINQ to SQL

As you can see from the XML above, the posts are described by <entry /> nodes. Normally we'd be breaking out the XmlDocument or XmlTextReader or similar and working some XPath magic. Or we could just use LINQ to XML. Here's a simple example:

XDocument xml = XDocument.Load(new StringReader(feed));
XNamespace xmlns = "http://www.w3.org/2005/Atom";
var entries = from feedElements in xml.Descendants(xmlns + "entry")
              select new {
                Title = feedElements.Element(xmlns + "title").Value,
                Content = feedElements.Element(xmlns + "content").Value
              };                  
 

The first thing we do is load the XML into an XDocument (in System.Xml.Linq). Looking at the XML feed, the XML namespace used is "http://www.w3.org/2005/Atom". I haven't found an XmlNamespaceManager-style approach to handling namespaces in LINQ to XML, so I just put this in a xmlns variable and will append it as I go. If you know how to specify a default namespace for XDocuments, please let me know :)

The next step is selecting the title and content elements from the each <entry/> node. The xml.Descendants(XName) method returns an IEnumerable<XElement>, which we tell to filter out all elements other than “entry” nodes.  The entries variable will now contain an IEnumerable<> of an anonymous type. Each item in the enumeration will have a Title and Content property representing the blog title and content parsed from the XML. We can iterate over this, use entries.ToArray() to perform our query, or use other LINQ goodness like entries.Take(5) (cue jazz) to further filter our results.

Notice how we specify all the node names as strings? They are actually of type XName (or XNamespace in the case of the xmlns variable). There is no public constructor to create an XName, but instead an implicit conversion from String is defined. This gives us the ease of using strings to specify node names (which is quite natural when working with XML), with the benefits of having strong typing around the name to access properties like LocalName and Namespace. In our query we have to prefix the XName, like entry, with the namespace xmlns, to make sure our nodes resolve properly, hence all the xmlns + "entry" style code.

Getting all the post data using LINQ to XML

Now let’s get strongly typed objects from our XML feed.

public class BlogEntry {
  public String Title;
  public String Content;
  public String[] Categories;
  public String OriginalLink;
  public String Published;
  public String Updated;
}

I’ve been lazy here and am parsing the published and updated dates as simple strings (see, I told you this was hacky!). There are two field declarations of interest here (the rest have one-to-one relationships with XML elements). The first is the Categories array. These are specified in the XML as children of <entry/> nodes, with the term attribute holding the pertinent information:

<entry>
  ...
  <category scheme='http://www.blogger.com/atom/ns#' term='daves drivel'/>
  <category scheme='http://www.blogger.com/atom/ns#' term='shameless self promotion'/>    
  ... 
 

The other is the OriginalLink field. I wanted to put a link back to the original post from the new blog to be clear about the source and so that people could see any comments they made (I would have taken the comments over as well, but only had the MetaWeblog API to work with). So I needed the original post link, which I could parse out of one of the <link /> nodes that has a rel attribute value of “alternate”:

<entry>
  ...
  <link rel='alternate' type='text/html' 
      href='http://davesquared.net/2008/02/sample-post.html' title='Sample post'/>
  ...
 

Armed with this knowledge, let’s tackled the new LINQ to XML query:

var entries = 
  from feedElements in xml.Descendants(xmlns + "entry")
  select new BlogEntry() {
    Title = feedElements.Element(xmlns + "title").Value,
    Content = feedElements.Element(xmlns + "content").Value,
    OriginalLink = feedElements.Elements(xmlns + "link")
             .Where(link => link.Attribute("rel").Value == "alternate")
             .Select(link => link.Attribute("href").Value)
             .First(),
    Published = feedElements.Element(xmlns + "published").Value,
    Updated = feedElements.Element(xmlns + "updated").Value,
    Categories = feedElements.Elements(xmlns + "category")
                   .Select(category => category.Attribute("term").Value)
                   .ToArray()
  };
 

The main changes from our original query are emphasised. First up, we are now working with strongly typed BlogEntry objects, rather than anonymous types. The entries variable is now an IEnumberable<BlogEntry>, which we can actually return from our parser method (vars only work locally).

We are also using nested queries to drag out the Categories and OriginalLink (you can do this using the sugary “from … in … select” style as well, but I found it easier to use the methods explicitly in this case). For categories we are simply selecting the term attributes from all the <category/> nodes in our entry. For the original link, we use .Where() to filter all the <link/> nodes to only include ones with a rel attribute equal to “alternate”, take the .First() (there should only be one), and then select the value of the href element.

Finishing up

The final steps in the migration where doing some regex-ing of the content to get the post id (slug for Wordpressors), and translate links to my own articles to point to the new site (so clicking on internal links on the new blog kept readers in the new blog. The last bit in particular was a bit tricky, as it needed a first pass to parse into a dictionary to I could look up the new urls as required. Here’s the horribly hacky code if you’re interested:

BloggerParser parser = new BloggerParser();
IDictionary<String, BlogEntry> entries = parser.ParseFeed(feed).ToDictionary(entry => entry.GetSlug());
foreach (BlogEntry entry in entries.Values) {
  entry.Content =
    Regex.Replace(entry.Content,
    @"http://davesquared.net/\d{4}/\d{2}/(?<slug>.*?)\.html",
    delegate(Match match) {
      return entries[match.Groups["slug"].Value].GetNewLink();
    },
    RegexOptions.IgnoreCase | RegexOptions.Singleline);
}
    

After parsing into a dictionary (parser.ParseFeed() returns our IEnumerable<BlogEntry> from our earlier LINQ to XML adventures), we try and replace any internal links in each post’s content with the link to the new post, using the slug as a unique index to lookup BlogEntryS. Ugly but effective.

The final step in all of this was to post all the transformed posts to the new site, which you can do using the MetaWeblog API, which, as far as I can tell, all went remarkably well. :-)

So there you go. This was my first real experience working with LINQ to XML, and I found it a fair bit easier than XmlDocument tweaking and mumbling various XPath incantations. Hope this helps. :-)

Comments