Web page scraping, the easy way

Sunday, 07 January 2018 00:46 Stefano Tommesani
Print

There are many ways to extract data elements from web pages, almost all of them prettier and cooler than the method proposed here, but as we are in an hurry, let's get that data quickly, ok? Suppose we have to extract the profile names from a well-known site:

Scrape1

The first step is identifying the element in the HTML page containing the profile name. In Chrome, select the name of the profile, right-click on it and select Analyze element in the popup menu, and we will get here:

Scrape2

We have all the bits of information we need, the name of the profile is hosted in an h1 element with a class containing the pv-top-card-section__name attribute. The troubling point is that the class contains more than an attribute, and we are not interested in specifying the other ones when searching for the right element in the page. Time to write some C# code. Assuming that the HTML code of the page was already digested by the HTMLAgilityPack package, the following function will search the correct element inside the page and return the inner text data:

private string GetItemText(HtmlDocument htmlDoc, string itemType, string classValue)
{
    if (htmlDoc.DocumentNode != null)
    {
        var findclasses = htmlDoc.DocumentNode
            .Descendants(itemType)
            .Where(d =>
                d.Attributes.Contains("class")
                &&
                d.Attributes["class"].Value.Contains(classValue)
            );
        var itemList = findclasses.ToList();
        if (itemList.Any())
        {
            return CleanUpItem(itemList.First().InnerText);
        }                             
    }
    return String.Empty;
}

The function will search for a specific element type (in this case, h1) that contains the given attribute (in this case, pv-top-card-section__name), so this invoke will return the name of the profile:

parsedProfile.Name = GetItemText(htmlDoc, "h1""pv-top-card-section__name");

What does the CleanUpItem call do, you say? Just some cleaning of the inner text of the element:

private string CleanUpItem(string item)
{
    string[] lines = item.Split(
        new[] { "\r\n""\r""\n""\\n" },
        StringSplitOptions.None
    );
    StringBuilder sb = new StringBuilder();
    foreach (var line in lines)
    {
        if (!String.IsNullOrWhiteSpace(line))
            sb.Append(line.Trim());
    }
 
    return sb.ToString();
}

So now, by a quick investigation with Chrome, and a C# code fragment, we can easily scrape the information from the web pages. Job done!

Quote this article on your site

To create link towards this article on your website,
copy and paste the text below in your page.




Preview :

Web page scraping, the easy way
Sunday, 07 January 2018

© 2019 - Stefano Tommesani


Powered by QuoteThis © 2008
Last Updated on Sunday, 07 January 2018 01:12