How to scrape a website

May 11, 2015

If you ever wanted to create an application that can compare information from other websites, you can do so by implementing code that uses web scraping.

To achieve this, you can use one of the NuGet packages called "HTMLAgilityPack." This will allow you to retrieve the data in mind and is sometimes called scraping. You must also ensure that the sites you want to get information from allows you to use this method on their site by reading the terms of usage.

In my example, I will use a form with a WebBrowser control, to display the website and a button that will issue the scraping to start.

Once you have the package installed, you can reference the library by adding the namespace to the needed class. It will be to your benefit to add the XML namespaces to so you can accurately retrieve the needed elements.

Once you have that referenced, you can make sure your WebBrowser control displays the correct site by adjusting the URL property.

First, when you want to use scraping, you have to give your application a htmlDocument to work with. This document will represent the website you are retrieving information from.

After you passed in the document to be used, work out the absolute path to the HTML element that you want to work with and save it in a string value. This path will be used by our application to extract the information we want. If you have worked with Google chrome and its element inspector tool, you can right click the element and click on Copy Xpath.

Once you have the element you are looking for, you can then start working with the element. As we are working with a table, we must set our code up to use iteration on it. To do this, we create a NodeCollection object. scrape-website-04

Once we have the collection populated, we can form a loop to iterate through the collection and grab each individual record. In the next piece of code, I represent the loop statement with full logic.

The loop condition is set to the number of items in the collection so we iterate through all of them. We then create an HtmlNode object to represent each row in the collection.

Inside of our table row we have td elements, so once again we set an Xpath string to the specific element in mind. This is td because we want the information from the columns. In my case, I have 5 columns so I create an HtmlNode array with a size of 5, to represent the value of each column.

Then create a collection to represent each of the columns in a table row, these can be retrieved by using the ChildNodes property on the HtmlNode object called TR (the one representing the table row), and using the index number of each column. By using this index number the first column will be number 0.

Once you have these fields captured, you can decide what you want to do with the individual fields.

Microsoft Office 365 - get to the next level

Mar 19, 2021

Microsoft Office 365 - get to the next level

Mar 19, 2021

Welcome to Oracle 18c

Mar 16, 2021

Welcome to Oracle 18c

Mar 16, 2021

The Modern Age of Intranets

Mar 14, 2021

The Modern Age of Intranets

Mar 14, 2021

Slow Connection is an Instant Deal Breaker

Mar 07, 2021

Slow Connection is an Instant Deal Breaker

Mar 07, 2021

Cisco releases Annual Cybersecurity Report

Feb 27, 2021

Cisco releases Annual Cybersecurity Report

Feb 27, 2021

About the Author:

Auret Swanepoel

As a recent addition to the New Horizons team, Auret is a highly skilled and qualified IT Technical trainer. He has been a Microsoft Certified Trainer (MCT) since 2008 and has since then, also become a Microsoft Certified Professional (MCP), a Microsoft Certified Technology Specialist (MCTS) and a Microsoft Certified Information Technology Professional (MCITP). With his international experience as a trainer in South Africa, Auret is able to adapt his teaching style to different audiences in the classroom and ensure that students are learning in a positive and collaborative environment.

Read full bio

Next up:

Previously

Please select your state:

Professional Development

IT Technical

MICROSOFT

OTHER TECHNOLOGIES

Process Management

Graphic Design

Office Applications

Vocational Education

Solutions for...

Training Services

TRAINING MODALITIES

Auret Swanepoel

How to scrape a website

Auret Swanepoel

How to scrape a website

May 11, 2015

Tags

Upcoming Event

Webinar: SQL Server 2016 - Get ready for everything built-in

Webinar: SQL Server 2016 - Get ready for everything built-in

Subscribe to New Horizons' Newsletters Join 300,000+ fellow IT Professionals! Get New Horizons' latest IT articles straight to your inbox. Enter your email address below:

Leave a Comment:

About the Author:

Please select your state:

MICROSOFT

OTHER TECHNOLOGIES

Solutions for...

Training Services

TRAINING MODALITIES

How to scrape a website

How to scrape a website

May 11, 2015

Related Articles

Tags

Upcoming Event

Webinar: SQL Server 2016 - Get ready for everything built-in

Webinar: SQL Server 2016 - Get ready for everything built-in

Subscribe to New Horizons' Newsletters Join 300,000+ fellow IT Professionals! Get New Horizons' latest IT articles straight to your inbox. Enter your email address below:

Leave a Comment:

About the Author: