Tuesday, October 5, 2010

HTML Parsing and Screen Scraping with the Simple HTML DOM Library

If you've ever need to do some screen scraping or HTML parsing you know it can be a pain to code everything all on your own.

If you're a PHP developer you need to get your hand on this tutorial for a Simple HTML DOM Library (and here's a download link to the source code, open source & easily learned parser).

Once you've downloaded the simpleHTMLdom library you can jump right in and start using it! The tutorial will show you how to get started but the great thing is you can load HTML from a string, file or URL so you're spoilt for choice.

There are so man real world use cases for this library but if you don't have quite the imagination you can follow the tutorial which includes a real world example.

Scraping other websites for content can be a useful way to automate tasks or present information in a more useful way. You could even scrape multiple website and collection information in one central place. The thing to keep in mind here is that we still live in a real world, and you could get into legal trouble or worse if you engage is less than ethical activities.

How does screen scraping work?
The trick to screen scraping is to grab a webpage and read the code, then do something (useful) with that data. If you want to get stuck right into it you probably don't want to have to code the whole thing yourself.

The Simple HTML DOM Library allows you to make a call to the library in your php file. You then create an instance of simple_html_dom and load the URL you're interested in.

Now comes the fun part, you can use handy code such as:
$html->find('div[class=preview]'); or
$html->children(2)->outertext; and get back any bit of data you want.

For ease of use and quick setup this get the thumbs up!





1 comments:

  1. Thanks for the info. I usually get search results using an online tool that has some kind of Google HTML scraping system that parses the URLs.

    Now that I read your post, I think that tool must use something similar to the library that you talk.

    ReplyDelete