Saturday, August 4, 2012

Building your first web scraper with PHP

Web Scraping is the process of extracting information from websites or webpages. Web scraping can be compared to crawlers, bots or web indexing, however in web scraping the data is filtered from all the html tags, javascripts, etc, and stored in database, spreadsheet or a document.

Many web scraping techniques exist, the basic one is just a copy/paste done by a human, another method which is probably the most used one is by using regular expression matching, or yet another method by just using the basic php built in functions.

In this post I will use php built in functions (such as explode) and loops (for, while...). I will cover regular expressions in a future post.

In the below example, I'm going to scrape a page from yahoo news. But before doing that, please note that this example is only meant to provide information on how to build a scraper, using the below code in an abusive way or with the intention of stealing, copying or republishing an article is strictly prohibited.

I'm going to attempt and scrape the latest published page on yahoo news by the time of this post, the url is:

First let's find some patterns by looking in the html source code of the page.
For the title of the article, you will notice that it's surrounded by a header 1 (h1),
<h1 class="headline">Title goes here</h1>

That's great, we can extract the title now, let's look for the content:
The content starts with:
<p class="first">

Let the coding begin, for this example, I will just display the content on the screen by using "echo", you can always save it in a database, spreadsheet, document and so on. You might also want to scrape the image and save it too, if you want to save it in a wordpress installation, this post will definitely provide some help:


$url = '';

//get html source code
$webSource = file_get_contents($url);

//get title
$title = getTitle($webSource);

//get the content of the article
$content = getContent($webSource);

//Display the scraped data
echo $url . '<br />';
//echo $websource . '<br />';
echo $title . "<br />";
echo $content . "<br />";

function getTitle($webSource)

    //search for the <h1 class="headline">
    $split = explode('<h1 class="headline">',$webSource);

    /* Split[0] will contain all the text before '<h1 class="headline">', and split[1] will contain everything after it */

    //if <h1 class="headline"> is found
    if(count($split) > 1)
        //find the end of the title </h1>
        $end = explode('</h1>', $split[1]);

        /* end[0] will contain all the text after '<h1 class="headline">' and before </h1>, and end[1] will contain everything after it */

        //title should be in the first part
        $title = $end[0];
        return "";

    return trim($title);

function getContent($webSource)
    $content = '';
    $split = explode('<p class="first">',$webSource);

    if(count($split) > 1)
        $end = explode('</p>',$split[1]);
        $content = $end[0];
        return "";

    return trim($content);

Once your run the above code, you will see the url displayed, the title and the content.

How cool is that?

If you have any ideas or suggestions, please post your comment below!