Introduction To Web Scraping



Table of Contents

  1. Introduction To Web Scraping With Python
  2. Introduction To Web Scraping Pdf
  3. Web Scraping Free
  4. Introduction To Web Scraping
  5. Web Scraping Tools

Introduction to web scraping

Introduction to web scraping Web scraping is the process of extracting data from websites.

Web scraping is one of the tools at a developer’s disposal when looking to gather data from the internet. While consuming data via an API has become commonplace, most of the websites online don’t have an API for delivering data to consumers. In order to access the data they’re looking for, web scrapers and crawlers read a website’s pages and feeds, analyzing the site’s structure and markup language for clues. Generally speaking, information collected from scraping is fed into other programs for validation, cleaning, and input into a datastore or its fed onto other processes such as natural language processing (NLP) toolchains or machine learning (ML) models. There are a few Python packages we could use to illustrate with, but we’ll focus on Scrapy for these examples. Scrapy makes it very easy for us to quickly prototype and develop web scrapers with Python.

  • Aug 30, 2017 What is web scraping? Web scraping is a technique for gathering data or information on web pages. A scraper is a script that parses an html site. Scrapers are bound to fail in cases of site.
  • “Web scraping (web harvesting or web data extraction) is a computer software technique of extracting information from websites.” (Source: Wikipedia) Web scraping typically targets one web site at a time to extract unstructured information and put it in a structured form for reuse.
  • Introduction to Web Scraping Internet is an ocean of information spread accross various websites, where it is categorized, interlinked and mostly freely available for everyone.

Scrapy vs. Selenium and Beautiful Soup

If you’re interested in getting into Python’s other packages for web scraping, we’ve laid it out here:

Scrapy concepts

Before we start looking at specific examples and use cases, let’s brush up a bit on Scrapy and how it works.

Spiders: Scrapy uses Spiders to define how a site (or a bunch of sites) should be scraped for information. Scrapy lets us determine how we want the spider to crawl, what information we want to extract, and how we can extract it. Specifically, Spiders are Python classes where we’ll put all of our custom logic and behavior.

Selectors:Selectors are Scrapy’s mechanisms for finding data within the website’s pages. They’re called selectors because they provide an interface for “selecting” certain parts of the HTML page, and these selectors can be in either CSS or XPath expressions.

Items:Items are the data that is extracted from selectors in a common data model. Since our goal is a structured result from unstructured inputs, Scrapy provides an Item class which we can use to define how our scraped data should be structured and what fields it should have.

Reddit-less front page

Suppose we love the images posted to Reddit, but don’t want any of the comments or self posts. We can use Scrapy to make a Reddit Spider that will fetch all the photos from the front page and put them on our own HTML page which we can then browse instead of Reddit.

To start, we’ll create a RedditSpider which we can use traverse the front page and handle custom behavior.

Above, we’ve defined a RedditSpider, inheriting Scrapy’s Spider. We’ve named it reddit and have populated the class’ start_urls attribute with a URL to Reddit from which we’ll extract the images.

At this point, we’ll need to begin defining our parsing logic. We need to figure out an expression that the RedditSpider can use to determine whether it’s found an image. If we look at Reddit’s robots.txt file, we can see that our spider can’t crawl any comment pages without being in violation of the robots.txt file, so we’ll need to grab our image URLs without following through to the comment pages.

By looking at Reddit, we can see that external links are included on the homepage directly next to the post’s title. We’ll update RedditSpider to include a parser to grab this URL. Reddit includes the external URL as a link on the page, so we should be able to just loop through the links on the page and find URLs that are for images.

In a parse method on our RedditSpider class, I’ve started to define how we’ll be parsing our response for results. To start, we grab all of the href attributes from the page’s links using a basic XPath selector. Now that we’re enumerating the page’s links, we can start to analyze the links for images.

To actually access the text information from the link’s href attribute, we use Scrapy’s .get() function which will return the link destination as a string. Next, we check to see if the URL contains an image file extension. We use Python’s any() built-in function for this. This isn’t all-encompassing for all image file extensions, but it’s a start. From here we can push our images into a local HTML file for viewing.

To start, we begin collecting the HTML file contents as a string which will be written to a file called frontpage.html at the end of the process. You’ll notice that instead of pulling the image location from the ‘//a/@href/‘, we’ve updated our links selector to use the image’s src attribute: ‘//img/@src’. This will give us more consistent results, and select only images.

As our RedditSpider’s parser finds images it builds a link with a preview image and dumps the string to our html variable. Once we’ve collected all of the images and generated the HTML, we open the local HTML file (or create it) and overwrite it with our new HTML content before closing the file again with page.close(). If we run scrapy runspider reddit.py, we can see that this file is built properly and contains images from Reddit’s front page.

But, it looks like it contains all of the images from Reddit’s front page – not just user-posted content. Let’s update our parse command a bit to blacklist certain domains from our results.

If we look at frontpage.html, we can see that most of Reddit’s assets come from redditstatic.com and redditmedia.com. We’ll just filter those results out and retain everything else. With these updates, our RedditSpider class now looks like the below:

We’re simply adding our domain whitelist to an exclusionary any()expression. These statements could be tweaked to read from a separate configuration file, local database, or cache – if need be.

Kite is a plugin for PyCharm, Atom, Vim, VSCode, Sublime Text, and IntelliJ that uses machine learning to provide you with code completions in real time sorted by relevance. Start coding faster today.

Extracting Amazon price data

Parallels desktop for mac 4. If you’re running an ecommerce website, intelligence is key. With Scrapy we can easily automate the process of collecting information about our competitors, our market, or our listings.

Scraping

For this task, we’ll extract pricing data from search listings on Amazon and use the results to provide some basic insights. If we visit Amazon’s search results page and inspect it, we notice that Amazon stores the price in a series of divs, most notably using a class called .a-offscreen. We can formulate a CSS selector that extracts the price off the page:

With this CSS selector in mind, let’s build our AmazonSpider.

A few things to note about our AmazonSpider class: convert_money(): This helper simply converts strings formatted like ‘$45.67’ and casts them to a Python Decimal type which can be used for computations and avoids issues with locale by not including a ‘$’ anywhere in the regular expression. getall(): The .getall() function is a Scrapy function that works similar to the .get() function we used before, but this returns all the extracted values as a list which we can work with. Running the command scrapy runspider amazon.py in the project folder will dump output resembling the following:

It’s easy to imagine building a dashboard that allows you to store scraped values in a datastore and visualize data as you see fit.

Considerations at scale

As you build more web crawlers and you continue to follow more advanced scraping workflows you’ll likely notice a few things:

  1. Sites change, now more than ever.
  2. Getting consistent results across thousands of pages is tricky.
  3. Performance considerations can be crucial.

Sites change, now more than ever

On occasion, AliExpress for example, will return a login page rather than search listings. Sometimes Amazon will decide to raise a Captcha, or Twitter will return an error. While these errors can sometimes simply be flickers, others will require a complete re-architecture of your web scrapers. Nowadays, modern front-end frameworks are oftentimes pre-compiled for the browser which can mangle class names and ID strings, sometimes a designer or developer will change an HTML class name during a redesign. It’s important that our Scrapy crawlers are resilient, but keep in mind that changes will occur over time.

Getting consistent results across thousands of pages is tricky

Slight variations of user-inputted text can really add up. Think of all of the different spellings and capitalizations you may encounter in just usernames. Pre-processing text, normalizing text, and standardizing text before performing an action or storing the value is best practice before most NLP or ML software processes for best results.

Performance considerations can be crucial

You’ll want to make sure you’re operating at least moderately efficiently before attempting to process 10,000 websites from your laptop one night. As your dataset grows it becomes more and more costly to manipulate it in terms of memory or processing power. In a similar regard, you may want to extract the text from one news article at a time, rather than downloading all 10,000 articles at once. As we’ve seen in this tutorial, performing advanced scraping operations is actually quite easy using Scrapy’s framework. Some advanced next steps might include loading selectors from a database and scraping using very generic Spider classes, or by using proxies or modified user-agents to see if the HTML changes based on location or device type. Scraping in the real world becomes complicated because of all the edge cases, Scrapy provides an easy way to build this logic in Python.

This post is a part of Kite’s new series on Python. You can check out the code from this and other posts on our GitHub repository.

Company

Product

Resources

Stay in touch

Get Kite updates & coding tips

Made with in San Francisco

Our Beginner's Guide to Web Scraping

The internet has become such a powerful tool because there is so much information on there. Many marketers, web developers, investors and data scientists use web scraping to collect online data to help them make valuable decisions.

But if you’re not sure how to use a web scraper tool, it can be intermediating and discouraging. The goal of this beginner's guide is to help introduce web scraping to people who are new to it or for those who don't know where to exactly start.

We’ll even go through an example together to give a basic understanding of it. So I recommended downloading our free web scraping tool so you can follow along.

So, let’s get into it.

Web scraping free

Introduction to Web scraping

First, it's important to discuss what is web scraping and what you can do with it. Whether this is your first time hearing about web scraping, or you have but have no idea what it is, this beginner's guide will help guide you to discover what Web scraping is capable of doing!

What is Web Scraping?

Web scraping or also known as web harvesting is a powerful tool that can help you collect data online and transfer the information in either an excel, CSV or JSON file to help you better understand the information you’ve gathered.

Although web scraping can be done manually, this can be a long and tedious process. That’s why using data extraction tools are preferred when scraping online data as they can be more accurate and more efficient.

Web scraping is incredibly common and can be used to create APIs out of almost any website.

How do web scrapers work?


Automatic web scraping can be simple but also complex at the same time. But once you understand and get the hang of it, it’ll become a lot easier to understand. Just like anything in life, you need practice to make it perfect. At first, you’re not going to understand it but the more you do it, the more you’ll get the hang of it.

The web scraper will be given one or more URLs to load before scraping. The scraper then loads the entire HTML code for the page in question. More advanced scrapers will render the entire website, including CSS and JavaScript elements.

Then the scraper will either extract all the data on the page or specific data selected by the user before the project is run.

Ideally, you want to go through the process of selecting which data you want to collect from the page. This can be texts, images, prices, ratings, ASIN, addresses, URLs etc.

Once you have everything you want to extract selected, you can then place it on an excel/CSV file for you to analyze all of the data. Some advanced web scrapers can convert the data into a JSON file which can be used as an API.

If you want to learn more, you can read our guide on What is Web Scraping and what it’s used for

Is Web Scraping Legal?

What is web scraping

With you being able to attract public information off of competitors or other websites, is web scraping legal?

Any publicly available data that can be accessed by everyone on the internet can be legally extracted.

The data has to follow these 3 criteria for it to be legally extracted:

Introduction To Web Scraping With Python

  • User has made the data public
  • No account required for access
  • Not blocked by robots.txt file
Introduction

As long as it follows these 3 rules, it's legal!

You can learn more about the rules of web scraping here: Is web scraping legal?

Web scraping for beginners

Now that we understand what web scraping is and how it works. Let’s use it in action to get the hang of it!

For this example, we are going to extract all of the blog posts ParseHub has created, how long they take to read, who wrote them and URLs. Not sure what you will use with this information, but we just want to show you what you can do with web scraping and how easy it can be!

First, download our free web scraping tool.

You’ll need to set up ParseHub on your desktop so here’s the guide to help you: Downloading and getting started.

Once ParesHub is ready, we can now begin scraping data.

If it’s your first time using ParseHub, we recommend following the tutorial just to give you an idea of how it works.

But let’s scrape an actual website like our Blog.

Introduction To Web Scraping

For this example, we want to extract all of the blogs we have written, the URL of the blog, who wrote the blog, and how long it takes to read.

Your first web scraping project

Mac os for acer v3. 1. Open up ParseHub and create a new project by selecting “New Project”

2. Copy this URL: https://www.parsehub.com/blog/ and place it in the text box on the left-hand side and then click on the “Start project on this URL” button.

3. Once the page is loaded on ParseHub there will be 3 sections:

  • Command Section
  • The wbe page you're extracting from
  • Preview of what the data will look like

The command section is where you will tell the software what you want to do, whether this is a click making a selection, or the advanced features ParseHub can do.

4. To begin extracting data, you will need to click on what exactly you want to extract, in this case, the blog title. Click on the first blog title you see.

Once clicked, the selection you made will turn green. ParseHub will then make suggestions of what it thinks you want to extract.

The suggested data will be in a yellow container. Click on a title that is in a yellow container then all blog titles will be selected. Scroll down a bit to make sure there is no blog title missing.

Now that you have some data, you can see a preview of what it will look like when it's exported.

5. Let’s rename our selection to something that will help us keep our data organized. To do this, just double click on the selection, the name will be highlighted and you can now rename it. In this case, we are going to name it “blog_name”.

Quick note, whenever renaming your selections or data to have no spaces i.e. Blog names won't work but blog_names will.

Now that all blog titles are selected, we also want to extract who wrote them, and how long they take to read. We will need to make a relative selection.

6. On the left sidebar, click the PLUS (+) sign next to the blog name selection and choose the Relative Select command.

7. Using the Relative Select command, click on the first blog name and then the author. You will see an arrow connect the two selections. Samsung kies for mac. You should see something like this:

Let’s rename the relative selection to blog_author

Since we don’t need the image URL let’s get rid of it. To do this you want to click on the expand button on the “relative blog_author” selection.

Now select the trash can beside “extract blog_author”

8. Repeat steps 6 and 7 to get the length of the blog, you won't need to delete the URL since we are extracting a text. Let's name this selection “blog_length”

It should look like this.

Since our blog is a scrolling page (scroll to load more) we will need to tell the software to scroll to get all the content.

If you were to run the project now you would only get the first few blogs extracted.

Introduction To Web Scraping Pdf

9. To do this, click on the PLUS + sign beside the page selection and click select. You will need to select the main element to this, in this case, it will look like this.

10. Once you have the main Div clicked you can add the scroll function, to do this On the left sidebar, click the PLUS (+) sign next to the main selection, click on advanced, then select the scroll function.

You will need to tell how long the software to scroll, depending on how big the blog is you may need a bigger number. But for now, let’s put it 5 times and make sure it's aligned to the bottom.

If you still need help with the scroll option you can click here to learn more.

We will need to move the main scroll option above blog names, it should look like this now:

11. Now that we have everything we want to be extracted; we can now let ParseHub do its magic. Click on the “Get data” button

12. You’ll be taken to this page.

You can test your extraction to make sure it’s working properly. For bigger projects, we recommend doing a test run first. But for this project let's press “run” so ParseHub can extract the online data.

13. This project shouldn’t take too long, but once ParseHub is done extracting the data, you can now download it and export it into a CSV/Excel, JSON, or API. But we just need a CSV/ Excel file for this project.

And there you have it! You’ve completed your first web scraping project. Pretty simple huh? But ParseHub can do so much more!

What else can you do with web scraping?

Now that we scraped our blog and movie titles (if you did the tutorial), you can try to implement web scraping in more of a business-related setting. Our mission is to help you make better decisions and to make better decisions you need data.

ParseHub can help you make valuable decisions by doing efficient competitor research, brand monitoring and management, lead generation, finding investment opportunities and many more!

Whatever you choose to do with web scraping, ParseHub can Help!

Check out our other blog posts on how you can use ParseHub to help grow your business. We’ve split our blog posts into different categories depending on what kind of information you're trying to extract and the purpose of your scraping.

Ecommerce website/ Competitor Analysis / Brand reputation

Web Scraping Free

Lead Generation

Introduction To Web Scraping

Brand Monitoring and Investing Opportunities

Closing Thoughts

There are many ways web scraping can help with your business and every day many businesses are finding creative ways to use ParseHub to grow their business! Web scraping is a great way to collect the data you need, but can be a bit intimidating at first if you don’t know what you’re doing. That’s why we wanted to create this beginner's guide to web scraping to help you gain a better understanding of what it is, how it works, and how you can use web scraping for your business!

If you have any trouble with anything, you can visit our help center or blog to help you to navigate with ParseHub or can contact support for any inquiries.

Learn more about web scraping

If you want to learn more about web scraping and elevate your skills, you can check out our free web scraping course! Once completed, you'll get a certification to show off your new skills and knowledge.

Web Scraping Tools

Happy Scraping!