Games For Windows Support

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Thursday, 28 February 2013

Starters guide to web scraping with the HTML Agility Pack for .NET

Posted on 11:11 by Unknown
I recently wanted to get a rough average MPG for each car available on the website fuelly.com, yet unfortunately there was no API for me to access the values, so I turned to Google and came across the NuGet package HTML Agility Pack. This post will get you up to speed on using HTML Agility Pack, basic XPath and some LINQ.

Before we start, please make sure to check the terms and conditions and any possible copyright terms that may be applicable to the data you are retrieving. You should be able to view this on the website, however it may vary from country to country. Please also keep in mind that you will effectively be accessing the site at a rapid rate, and so it would be sensible to save any communications to your local disk for later usage, and adding a delay between page downloads.

Getting Started

Identifying the data

The first thing you need to do is find where in the HTML the data is you want to download. Let's try going to fuelly and browsing all the cars. As you can see there are a large variety of cars available, and if you click on one of the models it takes you to a page that displays the year of the model and the average MPG for it. For my personal project, I wanted to obtain four values: Manufacturer, Model, Year and Average MPG. With this data I can then perform queries such as what vehicles between 2003 and 2008 give an MPG figure of above 50? - However, for the purpose of this blog post and simplicity, lets simply just retrieve a list of some models from a manufacturer.

So, lets start on the browse all cars page, and view the page source. If we look at the first manufacturer header on the page, we see Abarth, followed by AC, etc.

Open the page source (for Google Chrome: Settings > Tools > View Source), and find "Abarth":


You will see that the Manufacturer is wrapped in a <h3> tag. Below this, is a <div> that contains each model under the Abarth name: 500, Grande Punto and Punto Evo. Let's try and get hold of these models, but first, we need to setup the project.

Setting up the project

Create a new project in Visual Studio, a simple console project should suffice for this blog post. Add the HTML Agility Pack to the project via NuGet and add the following code:
1:    class Program  
2: {
3: static void Main(string[] args)
4: {
5: const string WEBSITE_LOCATION = @"http://www.fuelly.com/car/";
6: var htmlDocument = new HtmlAgilityPack.HtmlDocument();
7: using (var webClient = new System.Net.WebClient())
8: {
9: using (var stream = webClient.OpenRead(WEBSITE_LOCATION))
10: {
11: htmlDocument.Load(stream);
12: }
13: }
14: }
15: }

This simply loads the HTML page in to the HtmlDocument type so that we can run XPath queries against it to eventually get the value we are looking for.

Navigating Nodes with XPath

So, as noted earlier, we want to get a load of models from a manufacturer. Each <div> tag has a specific ID which we can use in our query (see "inline-list" below).
1:  <h3><a href="/car/abarth" style="text-decoration:none;color:#000;">Abarth</a></h3>  
2: <div id="inline-list">
3: <ul>
4: <li><nobr><a href="/car/abarth/500">500</a> <span class="smallcopy">(34)</span> &nbsp; </nobr></li>
5: <li><nobr><a href="/car/abarth/grande punto">Grande Punto</a> <span class="smallcopy">(1)</span> &nbsp;</nobr></li>
6: <li><nobr><a href="/car/abarth/punto evo">Punto EVO</a> <span class="smallcopy">(3)</span> &nbsp; </nobr></li>
7: </ul>
8: </div>

We can use this specific hook to get the values we want. So lets add the code:
1:  HtmlAgilityPack.HtmlNodeCollection divTags = htmlDocument.DocumentNode.SelectNodes("//div[@id='inline-list']");  

The above code returns a collection of <div> tags that represent each Manufacturer listed on the page. 

Let's break up the XPath syntax to make sense of it:
// - Selects nodes in the document from the current node that match the selection no matter where they are.
div - The specific nodes we are interested in.
[@id] - Predicate that defines a specific node.
[@id='inline-list'] - Predicate that defines a specific node with a specific value.

We can now dig deeper into each div tag, using a little more XPath and some LINQ to get the values we want.

Accessing the data using LINQ

OK, so within each item of the <div> tags we still have a load of rubbish we don't really need. All we want is the car models. Well we know each car model is between <a> (hyperlink) tags, with a href value. So using XPath and a little LINQ we can extract the data we need:


1:  HtmlAgilityPack.HtmlNode aTags = divTags.FirstOrDefault();  
2: var manufacturerList = from hyperlink in aTags.SelectNodes(".//a[@href]")
3: where hyperlink != null
4: select hyperlink.InnerText;

Line 1: Get the first manufacturer in the list 
Line 2: for each hyperlink inside the div, select all <a> tags with a 'href' node
Line 3: where the hyperlink isn't null (i.e. a href node was found), then:
Line 4: select the text inside of the hyperlink.

Conclusion

So, there you have it. You learned how to get specific values within a piece of HTML, a little XPath and some LINQ. Although the end data in this example probably isn't too useful, hopefully you can now see how you would expand upon this to find specific values and URLs to build a more complex system.

You can download the source here.

Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in .net, agility, c#, developer, html, htmlagilitypack, linq, pack, programming | No comments
Newer Post Older Post Home
View mobile version

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Apple starts shipping new iPod touch, iPod nano
    W hen Apple  revealed  the new fifth-generation iPod Touch and the seventh-generation iPod nano at the widely-covered event in San Francisco...
  • Apple Will Announce iPhone 5 On September 12th Event
    Apple is known to be extremely secretive about upcoming product launches. We all know that the company takes the wraps off one flagship ha...
  • IFA 2012: HTC Announces Desire X Android 4.0 Budget Mobile Phone With 4" Screen
    After garnering some not-so-great reviews on its flagship handset the  One X ,  HTC  seems to be pinning its hopes on its mid-range Android ...
  • Best free iPhone apps of all time
    What are the best free iPhone apps worth downloading right now? Although there are a lot of great free iPhone apps to choose from, time is m...
  • IFA 2012: 10 best gadgets on display
                                                                                                                                                ...
  • Download: nexGTv Watch free live TV on your smartphone.
    View the full image nexGTv is a video streaming app that offers live TV on the move. Unlike other apps though, this one actually works, and ...
  • Sony Rolls Out Android 4.0 Upgrade For Xperia go, U, sola, And Tablet S
    Sony  has just given its users something to smile about. Ever since the  Xperia U ,  sola ,  and  go  were released , owners had been stuck ...
  • Review: Hitman: Sniper Challenge (PS3)
      Pros: Super addictive; Sniping is satisfying; Unlockable abilities; Requires good strategy to achieve high scores. Cons: Occasionally dumb...
  • F1 2012 – FULL – UNLOCKED – MULTI8
    F 1 2012 is set to deliver the most accessible and immersive FORMULA ONE game for fans of all abilities. In F1 2012 players will feel the un...
  • Microsoft launches Windows Server 2012
    M icrosoft officially released Windows Server 2012 today. The new OS shares quite a few things with Windows 8 such as the kernel and the UI....

Categories

  • .net
  • a
  • aakash
  • acer
  • agility
  • anchor free hotspot sheild app
  • andoid
  • android
  • android buyers
  • android update on samsung galaxy tab
  • apple
  • apps
  • asus
  • blackberry
  • blog
  • buy best android
  • c#
  • cpp
  • dell
  • developer
  • facebook
  • galaxy s3 explodes
  • gaming
  • get android
  • google
  • google android 4.1
  • google news
  • googletab
  • hathway
  • hcl
  • hp
  • htc
  • html
  • htmlagilitypack
  • huawei
  • iball
  • indian isp block
  • intex
  • ios apps
  • ios london 2012
  • iphone
  • iphone 5
  • jokes
  • junior
  • karbonn
  • lenovo
  • lg
  • linq
  • london 2012 mobile game
  • maps
  • mercury
  • micromax
  • microsoft
  • microsoft news
  • microsoft smartphone
  • motorola
  • new android 4.1
  • new apple iphone 5
  • new microsoft phone
  • new windows phone
  • news
  • nokia
  • nokia 808 pureview
  • nokia vs samsung
  • pack
  • programming
  • samsung
  • samsung galaxy s3
  • samsung galaxy s3 vs nokia 808
  • samsung galaxy tab
  • shopping
  • sony
  • sony xperia s ics update
  • sonygame
  • spice
  • swiftkey 3
  • swiftkey keyboard for android
  • tablets
  • torrent blocked sites
  • Tricks And Other Info
  • tumblr for ios
  • unblock torrent sites
  • upcoming android
  • videocon
  • windows
  • windows 7.8 update
  • windows phone 8
  • wishtel
  • wp 7.8
  • xbox
  • xperia s update
  • yahoo

Blog Archive

  • ▼  2013 (45)
    • ►  August (2)
    • ►  July (2)
    • ►  June (1)
    • ►  May (7)
    • ►  April (1)
    • ►  March (15)
    • ▼  February (14)
      • Starters guide to web scraping with the HTML Agili...
      • Reading other peoples code
      • First post!
      • Download: Snapchat (Android, iOS)
      • Plants Vs Zombies HD, Now Free For iOS Devices
      • Rumour: Supposed Images of Nokia Lumia 520 and 720...
      • DOMO Launches 7" Slate X2G
      • Google Celebrates Copernicus' 540th Birthday With ...
      • The Blackberry BB10 Launch Summary
      • Recommended Apps For Android Phones And Tablets
      • Samsung Launches REX Series Of Feature Phones
      • Technologies That Make Sports Fair And Fun
      • Rumour: Microsoft Office For Linux In 2014?
      • CES 2013: A Tablet That Bends And Folds Like Paper
    • ►  January (3)
  • ►  2012 (462)
    • ►  November (2)
    • ►  October (124)
    • ►  September (93)
    • ►  August (60)
    • ►  July (106)
    • ►  June (77)
Powered by Blogger.

About Me

Unknown
View my complete profile