4 mins read

Scraping Multiple Pages with Pagination Using Python

Web scraping is the process of extracting data from websites. It is a valuable technique for many data-driven projects such as web indexing, data mining, and data analysis. In this article, we will learn how to scrape multiple pages with pagination using Python.

Introduction

Often, web pages contain large amounts of data that we want to scrape. In many cases, this data is spread across multiple pages with pagination. Scraping multiple pages with pagination can be a tedious and time-consuming process. However, Python provides us with many powerful tools to make this task easier.

Getting Started

Before we start, we need to install some libraries. We will use the requests library to make HTTP requests and BeautifulSoup to parse the HTML content. We can install these libraries using pip:

!pip install requests beautifulsoup4

Now that we have the necessary libraries installed, let's start by scraping a single page.

Scraping a Single Page

Let's start by scraping a single page. We will use the requests library to make a GET request to the website and get the HTML content. We will then use BeautifulSoup to parse the HTML content and extract the data we need.

import requests
from bs4 import BeautifulSoup

url = "https://www.example.com/page/1"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# extract data here

In the code above, we have made a GET request to the URL https://www.example.com/page/1 and got the HTML content. We have then used BeautifulSoup to parse the HTML content and store it in the soup variable.

Now we can extract the data we need from the HTML content. Let's say we want to extract the title of the page. We can do this using the title tag.

title = soup.find("title").text
print(title)

The output of this code will be the title of the page.

Scraping Multiple Pages

Now let's move on to scraping multiple pages with pagination. We will use a loop to iterate over the pages and scrape the data from each page.

import requests
from bs4 import BeautifulSoup

for i in range(1, 11):
    url = f"https://www.example.com/page/{i}"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # extract data here

In the code above, we have used a loop to iterate over the pages. We have made a GET request to each URL and parsed the HTML content using BeautifulSoup.

Now we can extract the data we need from each page. Let's say we want to extract the titles of all the pages. We can do this using the title tag and a list.

import requests
from bs4 import BeautifulSoup

titles = []

for i in range(1, 11):
    url = f"https://www.example.com/page/{i}"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    title = soup.find("title").text
    titles.append(title)

print(titles)

The output of this code will be a list of the titles of all the pages.

Handling Pagination

Sometimes, pagination can be tricky to handle. The URLs of the pages may not be sequential, or they may have query parameters that need to be modified. Fortunately, Python provides us with many tools to handle pagination.

Let's say we have a website that uses query parameters to handle pagination. The URL of the first page may look like this:

https://www.example.com/page?limit=10&page=1

The URL of the second page may look like this:

https://www.example.com/page?limit=10&page=2

To scrape multiple pages with pagination in this case, we need to modify the query parameter page in each request.

import requests
from bs4 import BeautifulSoup

limit = 10

for i in range(1, 11):
    url = f"https://www.example.com/page?limit={limit}&page={i}"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")

    # extract data here

In the code above, we have used the limit and page query parameters to handle pagination. We have modified the page parameter in each request to scrape multiple pages.

Conclusion

In this article, we have learned how to scrape multiple pages with pagination using Python. We have used the requests library to make HTTP requests and BeautifulSoup to parse the HTML content. We have also learned how to handle pagination using query parameters. Web scraping is a powerful technique for data-driven projects, and Python provides us with many powerful tools to make this task easier.

If you want to learn more about web scraping, I recommend checking out the Python documentation and other online resources. Happy scraping!