Michael Friis' Blog

About


.Net/Firefox Screen Scraping

Need to scrape a website? I have two links for you:

Solvent from the MIT SIMILI project. In combination with Piggy Bank it’s a scraper on it’s own, but I only use for its superb XPath generator. Just activate the sprayer and click on an element you want and Solvent will generate an intelligent XPath expression to get at it. Solvent will also higlight other elements on the page that would be returned with by the query and you can even dynamically edit the expression to narrow or broaden the result. All visible right there in the browser window.

Now switch to Visual Studio and Html Agility Pack, a great project that lets you parse HTML documents and query them using XPath just like they where XML. Solvent and Html Agility Packs (HAP) perception of the the DOM may sometimes differ slightly but, with a bit of tweaking, the visually generated XPath from Solvent works just great with your HAP code.

Truly a match made in heaven…

Leave a Reply

Your email address will not be published. Required fields are marked *