April 03, 2006

Turning RSS Feeds in to Movable Type Entries

A while ago, I created the site PsychicProgrammer.com as a place to gather up programming-related stories from various corners of the Internet. I whipped up some code to automatically gather RSS feeds from various programming-related websites and pull the stories to turn them into Movable Type postings. I've had a few people ask how I did this, so I thought I'd post the code with some explanation of what's going on.
#!/usr/bin/perl

use strict;
use XML::RSS::Parser;
use RPC::XML;
use RPC::XML::Client;
use Date::Manip;
use LWP::Simple;
use DBI;
The script is written in Perl, and I'm using a few nifty Perl modules available from CPAN.
XML::RSS::Parser - a great module for dealing with RSS feeds.
RPC::XML - used to communicate with Movable Type via RPC.
LWP::Simple - used to retreive the RSS feed document.
my @info=({'url'      => "http://www.oreillynet.com/pub/feed/16?format=rss2",
           'name'     => "Perl.com",
           'category' => "Perl.com",
           'datetype' => 2},
          {'url'      => "http://www.digg.com/rss/indexprogramming.xml",
           'name'     => "Digg.com",
           'category' => "Digg.com",
           'datetype' => 1},
          {'url'      => "http://www.oreillynet.com/pub/feed/20?format=rss2",
           'name'     => "Xml.com",
           'category' => "Xml.com",
           'datetype' => 2},
          {'url'      => "http://www.dotnetjunkies.com/WebLog/saasheim/rss.aspx",
           'name'     => "Steinar Aasheim's Blog",
           'category' => "Steinar Aasheim's Blog",
           'datetype' => 1},
          {'url'      => "http://tomcopeland.blogs.com/juniordeveloper/rss.xml",
           'name'     => "Junior Developer",
           'category' => "Junior Developer",
           'datetype' => 1},
          {'url'      => "http://programming.newsforge.com/programming.rss",
           'name'     => "Newsforge.com",
           'category' => "Newsforge.com",
           'datetype' => 2});
Here, I set up a structure containing the RSS feeds that I'm going to retreive. Of course, to scale this, it would be better to store this information in a SQL table.
my $username='user';
my $password='password';
my %category;
my $i;
my $dbh;
my $sth;
my $q;
my $seencount;
my $feed;
my $site;
my $xmldoc;
Some variable definitions.
# Set up database connection
$dbh=DBI->connect("dbi:Pg:dbname=p","psy","password") or die "Can't open database";

# Set up XML-RPC interface
my $cli=RPC::XML::Client->new('http://www.psychicprogrammer.com/mt/mt-xmlrpc.cgi');

# Set up XML parser
my $p=new XML::RSS::Parser;

# Get category list
my $req=RPC::XML::request->new('mt.getCategoryList','1',$username,$password);
my $resp=$cli->simple_request($req);
foreach $i (@$resp)
{
  $category{$i->{categoryName}}=$i->{categoryId};
}
Here, we set up a database connection. The database is used to record which articles have been seen before, so that we don't have any duplicates. Next, we set up the RPC interface to the Movable Type blog. Make sure you put your correct URL in here for your blog. Then, we talk to MT to get a list of the categories. We do this because we need to map the RSS feed name to the appropriate MT category.
if($DEBUG)
{
  foreach $i (%category)
  {
    printf("$category{$i} $i\n");
  }
}
Some debugging script to dump the category information. This is a good check to make sure that the RPC interface is working. This script assumes that there is a pre-existing category defined for each RSS feed. Use the MT interface to create new categories.
foreach $site(@info)
{
  printf("*** Processing for site %s\n\n",$site->{'name'}) if $DEBUG;

  $xmldoc=get $site->{'url'};
  $feed=$p->parse($xmldoc);
This starts a loop for each RSS feed defined above.
  foreach my $i ( $feed->query('//item') )
  {
    my $datenode;
    my $date;
    my $titlenode = $i->query('title');
    my $linknode = $i->query('link');
    my $descnode = $i->query('description');

    if(($site->{'datetype'})==1)
    {
      $datenode = $i->query('pubDate');
      $date=UnixDate($datenode->text_content,"%Y-%m-%dT%H:%M:%S");
    }

    if(($site->{'datetype'})==2)
    {
      $datenode = $i->query('dc:date');
      $date=UnixDate($datenode->text_content,"%Y-%m-%dT%H:%M:%S");
    }

    my $dd = $descnode->text_content .
      "<br>Link: <a href=\"" . $linknode->text_content . "\">" .
      $linknode->text_content . "</a>";
For each entry in the RSS file, start pulling out information we're interested in. I came across a problem with recording the date. If found both pubDate and dc:date tags. I use a variable called datetype to determine which one to look for in the RSS feed. I add a line of HTMl to the end of the article content that includes a link back to the original article.
    # Check to see if we've seen this one yet
    $q="SELECT count(source) FROM seen WHERE index=" . $dbh->quote($linknode->text_content);
    $sth=$dbh->prepare($q);
    $sth->execute();
    ($seencount)=$sth->fetchrow();
    $sth->finish();
    if($seencount==0)
    {
This is where we check our SQL table to see if we've seen this article URL before.
      # Post article
      printf("Posting %s\n",$titlenode->text_content) if $DEBUG;
      my $req=RPC::XML::request->new('metaWeblog.newPost',
                                     '1',
                                     $username,
                                     $password,
                                     RPC::XML::struct->new(
                                       'title' => RPC::XML::string->new($titlenode->text_content),
                                       'description' => RPC::XML::string->new($dd),
                                       'dateCreated' => RPC::XML::string->new($date),
                                       'mt_tb_ping_urls' => RPC::XML::array->new(
                                         $linknode->text_content)
                                     ),
                                     RPC::XML::boolean->new(1)
                                    );
      my $resp=$cli->simple_request($req);
This executes the RPC call to MT to actually post the article. Note that we attempt a trackback ping to the original article. This array can also be populated with other tracking sites, such as Technorati.
      # Change category
      $req=RPC::XML::request->new('mt.setPostCategories',
                                  $resp,
                                  $username,
                                  $password,
                                  RPC::XML::array->new(
                                    RPC::XML::struct->new(
                                      'categoryId' => $category{$site->{'category'}},
                                      'isPrimary' => RPC::XML::boolean->new(1)
                                    )
                                  )
                                 );
      $resp=$cli->simple_request($req);
Now we change the article's category to be the same as the feed's name.
      $q="INSERT INTO seen (index, source) VALUES (" . $dbh->quote($linknode->text_content) .
         ", " . $dbh->quote($site->{'name'}) . ")";
      $sth=$dbh->prepare($q);
      $sth->execute();
      $sth->finish();
    }
  }
}

# Close database
$dbh->disconnect();
We write a line into the database to say that we've seen this article before. We loop back for the rest of the articles, for the rest of the feeds. Finally, we close the database connection.

That's it! I run this code from a crontab entry every hour or so. As soon as new articles are discovered in the RSS feeds, they will be magically turned into postings on a Movable Type blog, thanks to the wonders of XML-RPC.

I'd appreciate any comments or feedback if you decide to use this code in your own projects. Have fun!

Tags: | | | | |

Posted by Ian at 02:21 PM | Comments (2) | TrackBack

August 30, 2005

Aggregated Programming News

I'm a programmer by trade. As such, I find myself visiting several different web sites just to get my daily fill of programming-related news stories. To help me out (and to experiment with Movable Type's XML-RPC interfaces), I put together PsychicProgrammer.com. It uses RSS feeds from popular programming news sites to populate a Movable Type blog using XML-RPC. Feel free to try it out.

Tags:

Posted by Ian at 10:46 AM | Comments (0) | TrackBack

July 13, 2003

Sharecropping

Slashdot had an interesting article that likens developing for a closed source platform to sharecropping. The article rang true with several things about Microsoft that I have come across recently. One of these is Microsoft's recent purchase of GeCad Software. Microsoft specifically mentions that it's a purchase of intellectual property, and that they have no plans to continue to develop GeCad's products. This is rather unfortunate, because GeCad's RAV antivirus software is a wonderful Linux-based antivirus mail scanner product. I wonder if the prospect of destroying a Linux product had anything to do with Microsoft's decision. If I was Symantec right now, I'd be very worried about Microsoft muscling in on territory that's been fair game for years.

Tags:

Posted by Ian at 03:05 PM | Comments (0) | TrackBack

June 10, 2003

Emulators

Programming. That's what I do for a living. Well, that's what I try to do for a living. Most of the time I'm just managing people. But at least my job description says I'm a programmer. It's funny that programming is what I want to do when I'm trying to escape.

Several years ago, I put together a 6502 8-bit CPU emulator core. I used this core inside my Compukit emulator.

That was a few years ago. I've barely touched it since. But recently, I've had thoughts of putting together an Atari emulator. I know there's plenty of Atari emulators out there, so this isn't new ground I'm covering. This time, it's not going to be on a Windows platform. I'm using this as an opportunity to get my feet wet in X programming under Linux. I'm playing with Glade as a UI builder. I'll let you know how I get on.

Tags:

Posted by Ian at 10:34 PM | Comments (0) | TrackBack