top of page

Posts

[Introduction to Scraping] Retrieving Table Data from Websites with Python


In recent years, data analysis has gained significant attention, and scraping is one of the fundamental techniques for it. It primarily involves retrieving specific data from websites. The term "scrape" means "to gather together," and it seems to have derived from there.

In this article, we'll use the Python programming language to automatically retrieve table information from a page.


Preparation


If Python is not installed, download and install the package for your OS from the official site: https://www.python.org/downloads/.

Once that's done, install the libraries we'll be using, "BeautifulSoup4" and "html.parser".

In this article, I used a Windows environment, so open the command prompt in the searching bar or press [CTRL]+[R], type [cmd], and run the following command to start the installation:


pip install bs4 html.parser

This time, I will try to automatically retrieve information like the domains and IP addresses used in Office 365 from the following page of Microsoft and write it out to a CSV file. (doing this manually can be quite tedious)

"Office 365 URLs and IP address ranges" > "Microsoft 365 Common and Office Online"


Operating Environment and Full Code


OS Used: Microsoft Windows 10 Pro

Python Version: 3.10

from bs4 import BeautifulSoup
from html.parser import HTMLParser
import csv
from urllib.request import urlopen

headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"}

html = urlopen("https://docs.microsoft.com/ja-jp/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide")
bsObj = BeautifulSoup(html, "html.parser")

table = bsObj.findAll("table")[4]
rows = table.findAll("tr")

with open(r"C:\Users\python\Desktop\python\2022\microsoft.csv", "w", encoding="cp932", newline="") as file:
writer = csv.writer(file)
for row in rows:
csvRow = []
for cell in row.findAll(["td", "th"]):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
l.get_text())

Explanation


from bs4 import BeautifulSoup
from html.parser import HTMLParser
import csv
from urllib.request import urlopen

Import the necessary libraries.


headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"}
html = urlopen("https://docs.microsoft.com/ja-jp/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide")
bsObj = BeautifulSoup(html, "html.parser")

Attach user-agent information (we'll use Firefox for this example). Specify the page you want to open with urlopen and declare it to be read by BeautifulSoup.


table = bsObj.findAll("table")[4]

rows = table.findAll("tr")

From the structure of the HTML (using the Web Developer Tools), specify [4] as the fourth table. Also, use findAll to find the "tr" tag.


<table aria-label="Sharepoint Online と OneDrive for Business" class="table table-sm">
  <thead>
    <tr>
      <th>ID</th>
      <th>Category</th>
      <th>ER</th>
      <th>Address</th>
      <th>Port</th>
    </tr>
  </thead>
  <tbody
    <tr>
      <td>31</td>
      <td>
        "Optimization"
        <br>
        "Required"
      </td>
      <td>Yes</td>
      <td> == </td>
      <td>
        <code>&lt;tenant&gt;.sharepoint.com, &lt;tenant&gt;-my.sharepoint.com</code>
        <br>
        <code>
          "13.107.136.0/22, 40.108.128.0/17, 52.104.0.0/14,
          104.146.128.0/17, 150.171.40.0/22, 2603:1061:1300::/40,
          2620:1ec:c8f8::/46, 2620:1ec:908::/46, 2a01:111:f402::/48"
        </code>
      </td>
      <td>...</td>
    </tr>
    <tr>...</tr>
  </tbody>
</table>
with open(r"C:\Users\python\Desktop\python\2022\microsoft.csv", "w", encoding="cp932", newline="") as file:
writer = csv.writer(file)
for row in rows:
csvRow = []
for cell in row.findAll(["td", "th"]):
csvRow.append(cell.get_text())
writer.writerow(csvRow)

Specify the path and file name with the desired encoding (if the file does not exist, it will be created at the specified path).

You can write with "w" and write out the information obtained with "newline="" with a new line for each row.

Find td and th in the rows (tr tags) specified in the aforementioned columns, and use a loop process to obtain the values in those columns → write them to the csv file


The result


Although it's just a few pieces of information this time, the more information you need to retrieve, the more efficient it becomes.

Thank you for reading!


This blog post is translated from a blog post written by Kawa Ken on our Japanese website Beyond Co..


15 views

Σχόλια


bottom of page