Many web scraping techniques exist, the basic one is just a copy/paste done by a human, another method which is probably the most used one is by using regular expression matching, or yet another method by just using the basic php built in functions.
In this post I will use php built in functions (such as explode) and loops (for, while…). I will cover regular expressions in a future post.
In the below example, I’m going to scrape a page from yahoo news. But before doing that, please note that this example is only meant to provide information on how to build a scraper, using the below code in an abusive way or with the intention of stealing, copying or republishing an article is strictly prohibited.
I’m going to attempt and scrape the latest published page on yahoo news by the time of this post, the url is:
First let’s find some patterns by looking in the html source code of the page.
For the title of the article, you will notice that it’s surrounded by a header 1 (h1),
<h1 class="headline">Title goes here</h1>
That’s great, we can extract the title now, let’s look for the content:
The content starts with:
Let the coding begin, for this example, I will just display the content on the screen by using “echo”, you can always save it in a database, spreadsheet, document and so on. You might also want to scrape the image and save it too, if you want to save it in a wordpress installation, this post will definitely provide some help: http://www.tech-and-dev.com/2012/07/uploading-picture-in-wordpress-using-xmlrpc.html
Once your run the above code, you will see the url displayed, the title and the content.
How cool is that?
If you have any ideas or suggestions, please post your comment below!