Parsing HTML Documents with the Html Agility Pack
By Scott Mitchell
Introduction
Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser - namely, by making an HTTP request from code and then parsing and analyzing the returned HTML.
The .NET Framework offers a variety of classes for accessing data from a remote website, namely the
WebClient class and the
HttpWebRequest class. These classes are useful for making an HTTP
request to a remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead, developers commonly
rely on string parsing methods like String.IndexOf, String.Substring, and the like, or through the use of regular expressions.
Another option for parsing HTML documents is to use the Html Agility Pack, a free, open-source library designed to simplify reading from and writing to HTML documents. The Html Agility Pack constructs a Document Object Model (DOM) view of the HTML document being parsed. With a few lines of code, developers can walk through the DOM, moving from a node to its children, or vice versa. Also, the Html Agility Pack can return specific nodes in the DOM through the use of XPath expressions. (The Html Agility Pack also includes a class for downloading an HTML document from a remote website; this means you can both download and parse an external web page using the Html Agility Pack.)
This article shows how to get started using the Html Agility Pack and includes a number of real-world examples that illustrate this library's utility. A complete, working demo is available for download at the end of this article. Read on to learn more!
Getting Started: Downloading and Using the Html Agility Pack
The Html Agility Pack is a free, open-source library that parses an HTML document and constructs a Document Object Model (DOM) that can be traversed manually or by using XPath expressions. (To use the Html Agility Pack you must be using ASP.NET version 3.5 or later.) In a nutshell, the Html Agility Pack makes it easy to examine an HTML document for particular content, and to extract or modify that markup.
The Html Agility Pack is wrapped inside a single assembly, HtmlAgilityPack.dll. To use the Html Agility Pack from your website you'll need to copy this
assembly into your website's Bin folder. You can download the latest version of HtmlAgilityPack.dll from the
Html Agility Pack project page; alternatively, you can download the demo available at the end of this article,
which includes HtmlAgilityPack.dll version 1.4.0 in the Bin folder.
With the Html Agility Pack assembly in the Bin folder you're ready to start downloading and parsing HTML documents. This article shows how to use the
Html Agility Pack to perform three different HTML parsing tasks.
Listing the Meta Tags on a Remote Web Page
Screen scraping usually involves downloading the HTML for a specific web page and picking out particular pieces of information. This first demo shows how to use the Html Agility Pack to download a remote web page and enumerate the
<meta> tags, displaying those <meta> tags that contain both
a name and content attribute.
The Html Agility Pack contains a number of classes, all in the HtmlAgilityPack namespace. Therefore, start by adding a using statement (or
Imports statement if you are using VB) to the top of your code-behind class:
using HtmlAgilityPack;
|
To download a web page from a remote server, use the HtmlWeb class's Load method, passing in the URL to download.
var webGet = new HtmlWeb();
|
The Load method returns an HtmlDocument object. In the above code snippet we've assigned this returned object to the local variable
document. The HtmlDocument class represents a complete HTML document and contains a DocumentNode property, which returns
an HtmlNode object that represents the root node of the document.
The HtmlNode class has several germane properties worth noting. There are properties for traversing the DOM, including:
ParentNode,ChildNodes,NextSibling, andPreviousSibling
Name- gets or sets the node's name. For HTML elements this property returns (or assigns) the name of the tag - "body" for the<body>tag, "p" for a<p>tag, and so on.Attributes- returns the collection of attributes for this element, if any.InnerHtml- gets or sets the HTML content within the node.InnerText- returns the text within the node.NodeType- indicates the type of the node. Can beDocument,Element,Comment, orText.
Ancestors method returns a collection of all ancestor nodes.
And the SelectNodes method returns a collection of nodes that match a specified XPath expression.
Given all of these methods and properties, there are a variety of ways you could get a list of all <meta> tags in the HTML document. For this demo I
decided to use the SelectNodes method. The statement below calls the SelectNodes method of the document object's
DocumentNode property, using the XPath expression "//meta", which returns all of the <meta> tags in the document.
var metaTags = document.DocumentNode.SelectNodes("//meta");
|
If there are no <meta> tags in the document then, at this point, metaTags will be null. But if there are one or more
<meta> tags then metaTags will be a collection of matching HtmlNode objects. We can enumerate these matching nodes
an display their attributes.
| For More On XPath... |
|---|
If you are not familiar with XPath then the syntax - //meta - may look a little Greek. XPath is a special syntax used to navigate through elements and
attributes in an XML document. The statement "//meta" says, in English, give me any nodes in the document from the current node (DocumentNode) that have
the name "meta" no matter where they appear in the DOM." The XPath tutorial at w3schools.com offers a good overview
of the XPath standard. If you are new to XPath or a bit rusty, you'll find the XPath Syntax tutorial
invaluable.
|
The following foreach loop enumerates the items in metaTags (if it's not null) and checks to see that there exists
name and content attributes. Presuming these attributes exist, the <meta> tag information is emitted.
(Note how the value of an attribute is accessed using the syntax tag.Attributes["attributeName"].Value.)
if (metaTags != null)
|
And that's all there is to it! No messy regular expressions, no tangle of string parsing method calls, but rather a concise, readable syntax for accessing the HTML document's contents.
The following screen shot shows the above code snippet in action. Here, the user enters a URL into the textbox and clicks the Get Meta Tags button. Clicking this button
causes a postback and on postback the code we examined above is executed. Namely, the Html Agility Pack is used to download the content from the specified URL and
the SelectNodes method is used to get back all <meta> tags. Those <meta> tags with name and
content attributes are displayed in a bulleted list.
Listing the Links on a Remote Web Page
The previous demo showed how to use the
SelectNodes method and an XPath expression to search the document for a particular set of nodes. Another approach
is to use LINQ. The HtmlNode class's methods that return a collection of nodes -
such as Ancestors and Descendants - return the collection as IEnumerable<HtmlNode> objects. If you are familiar with LINQ
you are aware that LINQ is setup to work with any object of type IEnumerable<T>. Consequently, we can use LINQ to query an HTML document's nodes.
To demonstrate accessing node information using LINQ, I created a demo that retrieves the text and href values for all hyperlinks (<a> tags)
on a page. The code starts out the same way as the previous demo - create an HtmlWeb object and call its Load method:
var webGet = new HtmlWeb();
|
But then it uses the document object's Descendants method and LINQ's query syntax to get all of the hyperlinks on the page. More specifically,
it gets all <a> tags on the page that have an href attribute and contain something other than white-space for their inner text and returns
a new, anonymous type that has two properties: Url and Text.
var linksOnPage = from lnks in document.DocumentNode.Descendants()
|
At this point you can enumerate over linksOnPage to see all of the links on the specified web page. In the demo available for download, I displayed this
information by binding linksOnPage to a ListView control named lvLinks:
lvLinks.DataSource = linksOnPage;
|
The ListView's template is simple enough - it displays each item in a bulleted list:
<asp:ListView ID="lvLinks" runat="server">
|
The screen shot below shows the output when run on the 4GuysFromRolla.com homepage.
Modifying and Saving an HTML Document
The previous two demos illustrated how the Html Agility Pack takes HTML from a remote website and constructs a DOM that can be read from, but it's also possible to modify the DOM and save the updated DOM to disk (or to any stream, for that matter). This third and final demo starts like the other two - the user is prompted to enter a URL and that HTML document is downloaded. Once downloaded, it is modified in two ways:
- A new element in constructed programmatically and added as the first child of the
<body>element, and - All of the hyperlinks in the page are updated so that, when clicked, they are opened in another window. This is accomplished by setting each link's
targetattribute to_blank.
This demo starts the same way as the previous two, by creating an HtmlWeb object and calling its Load method:
var webGet = new HtmlWeb();
|
Next, the <body> element is accessed. This is done using LINQ but this time using the extension methods (rather than the query syntax). The below line
of code says, in English, "From all of the descendants of the document node, give me the first node whose name equals 'body'. If no such node exists, give me back
the value null."
var body = document.DocumentNode.Descendants()
|
If there is a <body> element then we next need to create an HTML element and add it as the first child element of the <body>.
The following code creates a new HTML element node (messageElement), adds a style attribute, specifies the new element's name ("div"), and
then assigns its inner HTML. After this, the new element is inserted at the beginning of body's ChildNodes collection.
if (body != null)
|
Next, the SelectNodes method is used to retrieve all <a> tags that have an href attribute specified. Presuming any such tags
were found, they are enumerated. For each link a check is performed to see if there is already a target attribute defined. If not, the target
attribute is added with a value of _blank. If the target attribute already exists it is set to _blank.
var linksThatDoNotOpenInNewWindow = document.DocumentNode.SelectNodes("//a[@href]");
|
At this point the document has been modified, but all of these modifications have occurred in memory. To save the modified document we call the document
object's Save method, passing in the file name. In this demo I place the modified markup in the ~/ModifiedPages folder using a file name of the form
guid.htm where guid is a globally unique identifier (e.g., a value like 02cdb8d8-3a01-4076-baaa-f7a8bd6b22ea).
var fileName = string.Format("~/ModifiedPages/{0}.htm", Guid.NewGuid().ToString());
|
The following screen shot shows the contents of the saved, modified 4GuysFromRolla.com homepage. The big block of text at the top is the HTML element we added at the
start of the <body>, and clicking on any link in the page opens the link in a new window. (The modified version, when viewed through a browser, has
many broken images and styling issues because the 4Guys homepage, like many other sites, uses relative paths for images and external resources. Because I didn't also
download the associated images and external resources, these are not found when viewing the modified page.)
If you do a View/Source on the modified web page you'll see that the HTML content we added and modified is reflected there. Here is the markup emitted by the
messageElement node we added:
<div style="width:95%;border:solid black 2px;background-color:#ffc;font-size:xx-large;text-align:center"><p>Hello! This page was modified by the Html Agility Pack!</p><p>Click on a link below... it should open in a new window!</p></div>
|
And here is the markup of one of the many links on the page. Note the presence of the target="_blank" attribute - this isn't found in the original markup.
<a href="http://www.4guysfromrolla.com/articles/122910-1.aspx" class="headlines" target="_blank">2010's Most Popular Articles</a>
|
Happy Programming!
Attachments:
Further Reading



