Games For Windows Support

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Thursday, 28 February 2013

Starters guide to web scraping with the HTML Agility Pack for .NET

Posted on 11:11 by Unknown
I recently wanted to get a rough average MPG for each car available on the website fuelly.com, yet unfortunately there was no API for me to access the values, so I turned to Google and came across the NuGet package HTML Agility Pack. This post will get you up to speed on using HTML Agility Pack, basic XPath and some LINQ.

Before we start, please make sure to check the terms and conditions and any possible copyright terms that may be applicable to the data you are retrieving. You should be able to view this on the website, however it may vary from country to country. Please also keep in mind that you will effectively be accessing the site at a rapid rate, and so it would be sensible to save any communications to your local disk for later usage, and adding a delay between page downloads.

Getting Started

Identifying the data

The first thing you need to do is find where in the HTML the data is you want to download. Let's try going to fuelly and browsing all the cars. As you can see there are a large variety of cars available, and if you click on one of the models it takes you to a page that displays the year of the model and the average MPG for it. For my personal project, I wanted to obtain four values: Manufacturer, Model, Year and Average MPG. With this data I can then perform queries such as what vehicles between 2003 and 2008 give an MPG figure of above 50? - However, for the purpose of this blog post and simplicity, lets simply just retrieve a list of some models from a manufacturer.

So, lets start on the browse all cars page, and view the page source. If we look at the first manufacturer header on the page, we see Abarth, followed by AC, etc.

Open the page source (for Google Chrome: Settings > Tools > View Source), and find "Abarth":


You will see that the Manufacturer is wrapped in a <h3> tag. Below this, is a <div> that contains each model under the Abarth name: 500, Grande Punto and Punto Evo. Let's try and get hold of these models, but first, we need to setup the project.

Setting up the project

Create a new project in Visual Studio, a simple console project should suffice for this blog post. Add the HTML Agility Pack to the project via NuGet and add the following code:
1:    class Program  
2: {
3: static void Main(string[] args)
4: {
5: const string WEBSITE_LOCATION = @"http://www.fuelly.com/car/";
6: var htmlDocument = new HtmlAgilityPack.HtmlDocument();
7: using (var webClient = new System.Net.WebClient())
8: {
9: using (var stream = webClient.OpenRead(WEBSITE_LOCATION))
10: {
11: htmlDocument.Load(stream);
12: }
13: }
14: }
15: }

This simply loads the HTML page in to the HtmlDocument type so that we can run XPath queries against it to eventually get the value we are looking for.

Navigating Nodes with XPath

So, as noted earlier, we want to get a load of models from a manufacturer. Each <div> tag has a specific ID which we can use in our query (see "inline-list" below).
1:  <h3><a href="/car/abarth" style="text-decoration:none;color:#000;">Abarth</a></h3>  
2: <div id="inline-list">
3: <ul>
4: <li><nobr><a href="/car/abarth/500">500</a> <span class="smallcopy">(34)</span> &nbsp; </nobr></li>
5: <li><nobr><a href="/car/abarth/grande punto">Grande Punto</a> <span class="smallcopy">(1)</span> &nbsp;</nobr></li>
6: <li><nobr><a href="/car/abarth/punto evo">Punto EVO</a> <span class="smallcopy">(3)</span> &nbsp; </nobr></li>
7: </ul>
8: </div>

We can use this specific hook to get the values we want. So lets add the code:
1:  HtmlAgilityPack.HtmlNodeCollection divTags = htmlDocument.DocumentNode.SelectNodes("//div[@id='inline-list']");  

The above code returns a collection of <div> tags that represent each Manufacturer listed on the page. 

Let's break up the XPath syntax to make sense of it:
// - Selects nodes in the document from the current node that match the selection no matter where they are.
div - The specific nodes we are interested in.
[@id] - Predicate that defines a specific node.
[@id='inline-list'] - Predicate that defines a specific node with a specific value.

We can now dig deeper into each div tag, using a little more XPath and some LINQ to get the values we want.

Accessing the data using LINQ

OK, so within each item of the <div> tags we still have a load of rubbish we don't really need. All we want is the car models. Well we know each car model is between <a> (hyperlink) tags, with a href value. So using XPath and a little LINQ we can extract the data we need:


1:  HtmlAgilityPack.HtmlNode aTags = divTags.FirstOrDefault();  
2: var manufacturerList = from hyperlink in aTags.SelectNodes(".//a[@href]")
3: where hyperlink != null
4: select hyperlink.InnerText;

Line 1: Get the first manufacturer in the list 
Line 2: for each hyperlink inside the div, select all <a> tags with a 'href' node
Line 3: where the hyperlink isn't null (i.e. a href node was found), then:
Line 4: select the text inside of the hyperlink.

Conclusion

So, there you have it. You learned how to get specific values within a piece of HTML, a little XPath and some LINQ. Although the end data in this example probably isn't too useful, hopefully you can now see how you would expand upon this to find specific values and URLs to build a more complex system.

You can download the source here.

Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest
Posted in .net, agility, c#, developer, html, htmlagilitypack, linq, pack, programming | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Rumour: Microsoft Preparing To Launch Its Own Smartphone
    Microsoft   has once again entered the hardware manufacturing business in a big way, with the   launch of its  Surface  tablet series   on M...
  • LG Optimus Vu With Android 4.0 And 5" Screen Lands In India
    Announced at MWC 2012 in February,  LG's  first phablet,  the Optimus Vu , has finally hit the Indian market. The handset has a 5" ...
  • Google Nexus 7 Tablet Announced At Google I/O
    We've been hearing about the  Nexus  tablet for a while now, and thanks to some slip-ups, we had a hint of what was to come. As expected...
  • Rumour: Base Versions Of Microsoft Surface And Surface Pro Will Cost $600 And $1000
    With little information available about the prices of the  Microsoft Surface ,  TheNextWeb.com  has received  input from sources close to MS...
  • LG Optimus G
      LG's next flagship phone, the Optimus G. (Credit: Brian Bennett/CNET) LG officially announced its next flagship phone, the LG Optimus ...
  • Apple Will Announce iPhone 5 On September 12th Event
    Apple is known to be extremely secretive about upcoming product launches. We all know that the company takes the wraps off one flagship ha...
  • Eric Schmidt Gets Vocal About Maps
    Some of us have figured out by now that we're all part of a giant experiment for  Google .  Executive Chairman Eric Schmidt  has often b...
  • Download VLC Beta (NEON version) for Android from Google Play officially Now.
    VLC Media Player, one of the most popular Desktop media player, is finally available for Download on Android smart-phones via Google Play. A...
  • Download: PDF Combine (Windows)
    PDF (Portable Document Format) is a popular file type because it is cross-platform, has a fixed format, and can be viewed in almost any devi...
  • Samsung Launches 8" GALAXY Note 510 With Phone Functionality For Rs 30,900
    If the Note II wasn't big enough for you, you'd rejoice to see the Samsung GALAXY Note 510  (also known as Note 8.0). This 8" d...

Categories

  • .net
  • a
  • aakash
  • acer
  • agility
  • anchor free hotspot sheild app
  • andoid
  • android
  • android buyers
  • android update on samsung galaxy tab
  • apple
  • apps
  • asus
  • blackberry
  • blog
  • buy best android
  • c#
  • cpp
  • dell
  • developer
  • facebook
  • galaxy s3 explodes
  • gaming
  • get android
  • google
  • google android 4.1
  • google news
  • googletab
  • hathway
  • hcl
  • hp
  • htc
  • html
  • htmlagilitypack
  • huawei
  • iball
  • indian isp block
  • intex
  • ios apps
  • ios london 2012
  • iphone
  • iphone 5
  • jokes
  • junior
  • karbonn
  • lenovo
  • lg
  • linq
  • london 2012 mobile game
  • maps
  • mercury
  • micromax
  • microsoft
  • microsoft news
  • microsoft smartphone
  • motorola
  • new android 4.1
  • new apple iphone 5
  • new microsoft phone
  • new windows phone
  • news
  • nokia
  • nokia 808 pureview
  • nokia vs samsung
  • pack
  • programming
  • samsung
  • samsung galaxy s3
  • samsung galaxy s3 vs nokia 808
  • samsung galaxy tab
  • shopping
  • sony
  • sony xperia s ics update
  • sonygame
  • spice
  • swiftkey 3
  • swiftkey keyboard for android
  • tablets
  • torrent blocked sites
  • Tricks And Other Info
  • tumblr for ios
  • unblock torrent sites
  • upcoming android
  • videocon
  • windows
  • windows 7.8 update
  • windows phone 8
  • wishtel
  • wp 7.8
  • xbox
  • xperia s update
  • yahoo

Blog Archive

  • ▼  2013 (45)
    • ►  August (2)
    • ►  July (2)
    • ►  June (1)
    • ►  May (7)
    • ►  April (1)
    • ►  March (15)
    • ▼  February (14)
      • Starters guide to web scraping with the HTML Agili...
      • Reading other peoples code
      • First post!
      • Download: Snapchat (Android, iOS)
      • Plants Vs Zombies HD, Now Free For iOS Devices
      • Rumour: Supposed Images of Nokia Lumia 520 and 720...
      • DOMO Launches 7" Slate X2G
      • Google Celebrates Copernicus' 540th Birthday With ...
      • The Blackberry BB10 Launch Summary
      • Recommended Apps For Android Phones And Tablets
      • Samsung Launches REX Series Of Feature Phones
      • Technologies That Make Sports Fair And Fun
      • Rumour: Microsoft Office For Linux In 2014?
      • CES 2013: A Tablet That Bends And Folds Like Paper
    • ►  January (3)
  • ►  2012 (462)
    • ►  November (2)
    • ►  October (124)
    • ►  September (93)
    • ►  August (60)
    • ►  July (106)
    • ►  June (77)
Powered by Blogger.

About Me

Unknown
View my complete profile