Screen Scrape an RSS feed with PHP: A Guide

I was looking for a guide to making an RSS feed for a site that didn’t have one* and found the internet very much lacking in what it provided. It turns out that there is not as much to choose from as a Google search would lead you to believe. Dennis Pallett seems to be one of the only people to have written a review and it is featured on many, many different sites. While it was very helpful I still thought it was lacking in its explanation.

So I’m going to use Dennis’ code along with how I altered it and go over what I did and what things mean and hopefully paint a clearer picture of how to turn the world in to an RSS feed.

Screen scrapping an RSS feed is based on some simple concepts. Grab each individual post in an array and then grab the title, permalink, and full text out and throw them in their respective RSS tags.

< ?php

</p>

$url = “http://www.gailgauthier.com/blogger.html”;

$data = getUrlTEXT($url);

// Get content items
preg_match_all ("/<div class=\"posts\">([^`]*?)< \/div>/", $data, $matches);

GetUrlTEXT is a quick function defined at the bottom of the script that uses cURL to get the full text of any url.

preg_match_all and preg_match are both functions that I don’t fully understand but they uses strings of characters called ‘regular expressions’ that tell PHP what text to include and not include (regular expressions guide here).

While I don’t completely understand it I can point some things out. preg_match takes 3 arguments. The first is the string you are looking for, the second is the full text you are searching, and the third is an array that all the found strings are put in.

The first argument are in double quotes as well as forward and back slashes (ex. “/REGULAR EXPRESIONS HERE/”). You also see the start and end of a div as part of the expression. This tells preg_match what to look in between to find the text you are looking for. The regular expressions between the div tags, ([^`]*?), definitely have a meaning but you’ll have to read up to figure it out. Needless to say it works in finding whatever is in between whatever you put on either side of the regular expressions.

Next you can just set up the RSS header information.

// Begin feed
header (“Content-Type: text/xml; charset=ISO-8859-1”);
echo “< ?xml version="1.0" encoding="ISO-8859-1" ?>\n”;
?>

<rss version=”2.0”
xmlns:dc=”http://purl.org/dc/elements/1.1/”
xmlns:content=”http://purl.org/rss/1.0/modules/content/”
xmlns:admin=”http://webns.net/mvcb/”
xmlns:rdf=”http://www.w3.org/1999/02/22-rdf-syntax-ns#”>
<channel>
<title>Original Content—A Gail Gauthier Blog—Latest Content</title>
<description>The latest content from Gail Gauthier (http://www.gailgauthier.com/blogger.html), screen scraped! </description>
<link>http://www.gailgauthier.com/blogger.html</link>
<language>en-us</language>

Nothing hard here, just change the title, description and link. The rest of the information is standard for the RSS file.

Next we loop through the ‘matches’ array we made to extract the title, permalink, full text and author name.

<?

// Loop through each content item
foreach ($matches[0] as $match) {

// First, get title
preg_match (”/<h3>([^`]*?)</h3>/”, $match, $temp);
$title = $temp[‘1’];
$title = strip_tags($title);
$title = trim($title);

// Second, get url
preg_match (”/<span class="byline">posted by gail at <a href="([^`]*?)">/”, $match, $temp);
$url = $temp[‘1’];
$url = trim($url);

// Third, get text
preg_match (”/< /h3>([^`]*?)<span class="byline">/”, $match, $temp);
$text = $temp['1']; $text = trim($text); $text = str_replace('<br />', '<br />', $text);

// Fourth, and finally, get author
preg_match (”/<span class="byline">By ([^`]*?)</span>/”, $match, $temp);
$author = $temp[‘1’];
$author = trim($author);

As you can see getting the title is simple enough, it’s the only thing inside of <h3> tags. The permalink was much harder though since it was not between any specific tags. Instead I used the string ‘<span class="byline">posted by gail at <a href="’ this is obviously a very bad hack to get what you want but not all site will be nicely set up for you to make it in to an RSS feed.

if (!($title '') && !($text ‘’))
{
// Echo RSS XML
echo “<item>\n”;
echo “\t\t\t<title>” . strip_tags($title) . “</title>\n”;
echo “\t\t\t<link>” . strip_tags($url) . “</link>\n”;
echo “\t\t\t<description>” . strip_tags($text) . “</description>\n”;
echo “\t\t\t<content :encoded>< ![CDATA[ n”;
echo $text . “\n”;
echo ” ]]></content>\n”;
echo “\t\t\t<dc :creator>” . “Gail Gauthier” . “</dc>\n”;
echo “\t\t</item>\n”;
}//end if
}//end foreach

?>

After we find all our information we just need to print it out in RSS format. First I do a quick check that the post has actual information in it and then it is just outputted in to the tags.

That about it for this method of screen scrapping an RSS feed. I’ve heard Kottke mention ways of using the DOM from PHP5 but I am still working on learning DOM in general. You can download this full PHP script here.

  • It turns out what Blogger was FTPing to Gail Gauthier’s site was just not listing the ATOM feed in the html. I later found the feed and had to just be happy that I learned something.

Flawed but fun

The amount of literature that has been published about Revenge of the Sith really should not surprise me. Every Star Wars release I read less and less of it as I care less and less about why people hate the movie and how it destroys the original trilogy.

Perhaps its just because I’m only reading people who agree with me but it seems there is a large outpouring of love for this movie and nothing makes me happier.

Perhaps now that it is over the reconciliation can happen between the movies and People who had such large issues with them. In another 30 years I hope these movies are still in our minds like the originals always will be.

I'm just not interested in trashing the dialogue and ticking off all of this movie's weaknesses. I give extra credit for good intentions, and I like hanging out with my friends even when they are total clowns. Star Wars has been a cool friend to hang out with, and I'll miss it. -Mr. Sun

Stefan’s symbol signs

AIGA has a great list of symbol signs that are free for every one but I don’t think most people take advantage of it. While a lifesaver for anyone making a map they are useful in other situations as well.

Working on a site that has to do with recycling I thought there was no better place to look for a quick dingbat then the symbol signs that AIGA provides. Much to my surprise they did not have the recycle sign. In fact while the list is a great resource there seems to be a lot of useful symbols not available such as the biohazard, the nuclear, and handicapped symbols.

In an effort to help other designers not have to spend time designing little dingbat here is my list of useful dingbats:

Recycle Symbol eps
Biohazard Symbol eps

Blogging abound

A blogging milestone happened in the art department today as I helped Jen, Katie, Christina, and even Mattyo setup a wordpress blog through Dreamhost as you can see from the right side links bar. If this community of TCNJ blogging stays active it will be a great resource for both current and future art graduates from TCNJ.

Blogging will only get bigger and hopefully this will be the first step toward a very vibrant and living community at TCNJ that I hope to cultivate and build.

FCKeditor – review and help

I’ve wanted to clean up the College Union Board website for a long time now and only now got around to it by making it apart of my senior portfolio. I not only gave it a redesign with a much stronger hierarchy but also added a lot of back end functionality that is needed to keep the site running with out any one overly knowledgeable about html.

After wrestling with FCKeditor between the hours of 12 and 5am I finally got it working. FCKeditor is a WYSIWYG editor that can be implemented in a number of ways. It comes in flavors such as javascript, asp, php, coldfusion, and perl. The documentation was a bit sparse and only included an install walk through for javascript. This was luckily what I needed as what I wanted to do was replace a TEXTAREA with a WYSIWYG editor and javascript seemed to be the only one that did this, though it was the only one documented.

Though it took me a long time to pin point the main problem I was having was with the path to the .js file that was supposed to be running the show. The linked javascript file ended up being src=”FCKeditor/fckeditor.js” while the base path ended up being oFCKeditor.BasePath = ‘./FCKeditor/’;

I don’t know if this is a little quirky or the way it is supposed to act. The documentation was very short for JavaScript with no troubleshooting and the sourceforge forum had no problem that seemed similar to mine.

Another issue I had was that the options that were listed were simply held in an array in the config file and if you did not what an option you had to delete it from the array instead of turning the option true or false which would be better for getting options back if you wanted.

Now that it’s all working I can’t be happier and I know how much easier it will make it for my wonderful event planning organization to update the website.