2: Python爬虫原理与实践

第一章:爬虫基础与核心原理再探

在我们开始写代码之前,必须先稳固地掌握爬虫的内在逻辑。

什么是爬虫?

网络爬虫,本质上是一个自动化的程序。它的核心任务是模拟人类访问网页的行为,但速度和规模远超人类。

  • 人类访问网页:打开浏览器 -> 输入网址 (URL) -> 回车 -> 浏览器发送请求 -> 服务器返回HTML代码 -> 浏览器将代码渲染成我们看到的图文并茂的页面。
  • 爬虫访问网页:执行程序 -> 指定URL -> 程序发送请求 -> 服务器返回HTML代码 -> 程序不渲染,而是直接解析这份纯文本的HTML代码,寻找并提取数据。

关键区别在于“渲染”“解析”。浏览器负责美观地展示,而爬虫负责高效地提取。

HTTP:爬虫与网站沟通的语言

你的爬虫程序通过HTTP协议与网站服务器对话。最常见的两种对话方式是 GETPOST

  • GET 请求:就像在浏览器地址栏输入网址后回车。这是最常见的请求,用于获取(GET)网页数据。我们的绝大多数抓取任务都从GET请求开始。
  • POST 请求:通常用于向服务器提交(POST)数据,例如填写登录表单、提交搜索关键词等。有些需要登录或搜索后才能看到内容的页面,就需要模拟POST请求。

网页的骨架:HTML简介

爬虫获取到的就是一堆HTML(超文本标记语言)代码。你需要能看懂它的基本结构,才能知道去哪里找数据。

<!DOCTYPE html>
<html>
<head>
    <title>网页标题</title>
</head>
<body>
    <div class="content">
        <h1>这是一个主标题</h1>
        <p id="intro">这是一个段落。</p>
        <a href="/about.html">关于我们</a>
    </div>
</body>
</html>
  • 标签 (Tag):由尖括号包围,如<html>, <h1>, <p>, <a>。它们定义了内容的类型和结构。
  • 属性 (Attribute):在标签内部,提供额外信息,如class="content"id="intro"href="/about.html"属性是我们定位数据的关键线索

正则表达式:强大的文本匹配工具

正则表达式(Regular Expression,简称Regex)是一种用于文本匹配的模式语言,它在爬虫中扮演着重要角色,特别是当HTML结构不规则或需要提取特定格式的内容时。

基本语法

符号 含义 例子
. 匹配任意单个字符 a.c 匹配 “abc”, “adc”, “a2c” 等
* 匹配前面的表达式0次或多次 a*b 匹配 “b”, “ab”, “aab” 等
+ 匹配前面的表达式1次或多次 a+b 匹配 “ab”, “aab” 等,但不匹配 “b”
? 匹配前面的表达式0次或1次 a?b 匹配 “b” 或 “ab”
[] 匹配方括号内任意一个字符 [abc] 匹配 “a”, “b” 或 “c”
[^] 匹配除方括号内字符外的任何字符 [^abc] 匹配除 “a”, “b”, “c” 外的任何字符
() 创建捕获组 (abc) 将 “abc” 作为一个组捕获
匹配任意数字 \d{4} 匹配四位数字,如 “2025”
匹配字母、数字或下划线 \w+ 匹配单词
匹配空白字符(空格、制表符等) a\sb 匹配 “a b”

Python中使用正则表达式

Python通过re模块提供正则表达式支持:

import re

# 示例:提取邮箱地址
text = "联系我:lyubh22@gmail.com,或者通过微信:13120040612"

# 匹配邮箱
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print("邮箱:", emails)  # 输出: ['lyubh22@gmail.com']

# 匹配电话号码
phone_pattern = r'\d{11}'
phones = re.findall(phone_pattern, text)
print("电话:", phones)  # 输出: ['13120040612']
邮箱: ['lyubh22@gmail.com']
电话: ['13120040612']

爬虫中的应用实例

当BeautifulSoup无法精确定位元素时,正则表达式可以派上用场:

import requests
import re
from bs4 import BeautifulSoup

# 假设我们要从网页中提取所有发表年份
url = 'https://lyubh.cn'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 用正则表达式匹配所有"20XX-XX"格式的年月
date_pattern = r'20\d{2}-\d{2}'
all_text = soup.get_text()
publication_dates = re.findall(date_pattern, all_text)

print("发现的出版日期:", publication_dates)
发现的出版日期: []

正则表达式虽然强大,但过于复杂的表达式可能难以维护。在实际爬虫开发中,应该先尝试使用BeautifulSoup等结构化解析工具,当这些工具无法满足需求时再考虑使用正则表达式。

第二章:环境准备与第一个静态爬虫

现在,让我们卷起袖子,开始动手!

2.1 搭建你的爬虫工作室

  1. 安装 Python: 确保你的电脑上安装了Python 3。

  2. 安装必备库: 打开你的终端或命令行工具,安装我们即将使用的两个核心库。

    pip install requests
    pip install beautifulsoup4
    • requests: 负责发送HTTP请求,从网站获取HTML。
    • beautifulsoup4: 负责解析HTML,帮我们轻松提取数据。

2.2 实战:抓取学者主页上的信息

目标:自动抓取学者主页(如 https://lyubh.cn)上的姓名、简介、新闻动态,以及每一篇论文的标题、作者和会议。

第一步:分析目标网页(侦察工作)

在浏览器中打开目标网页,右键点击你想要爬取的内容,选择“检查”或“审查元素”。你会看到类似这样的 HTML 结构:

<!-- 姓名 -->
<h1>Bohan Lyu</h1>

<!-- 简介 -->
<p>
  I am an undergruduate at Tsinghua University. I'm interested in ML and NLP topics. My works are published in ICML and ACL.
</p>

<!-- 新闻 -->
<h2 id="News">News</h2>
<ul>
  <li>2025-07 Goedel-Prover is accepted to COLM 2025!</li>
  ...
</ul>

<!-- 论文条目,title、作者、会议分布在同一个<tr>里 -->
<tr class="paper-row">
  <td>...</td>
  <td>
    <div>
      <h3>Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving</h3>
      <br>
      Yong Lin*, Shange Tang*, <b>Bohan Lyu</b>, Jiayun Wu, ...
      <br>
      <em>COLM 2025</em>
      ...
    </div>
  </td>
</tr>

侦察结论

  • 姓名在 <h1> 标签里。
  • 简介在第一个 <p> 标签里。
  • 新闻在 <h2 id="News"> 后的 <ul> 里。
  • 论文信息在 <tr class="paper-row"> 里,标题是 <h3>,作者是 <h3> 下方的所有文本,会议在 <em> 内。

第二步:编写Python代码

import requests
from bs4 import BeautifulSoup, NavigableString, Tag

url = 'https://lyubh.cn'

try:
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')

    # Name
    name = soup.find('h1').text.strip() if soup.find('h1') else 'Name not found'

    # Introduction
    intro_section = soup.find('p')
    intro_text = intro_section.text.strip() if intro_section else ''

    # News
    news_items = []
    news_ul = soup.find('h2', id='News')
    if news_ul:
        news_ul = news_ul.find_next('ul')
        if news_ul:
            for li in news_ul.find_all('li'):
                news_items.append(li.text.strip())

    # Publications with full authors and venue
    publications = []
    for tr in soup.find_all('tr', class_='paper-row'):
        h3 = tr.find('h3')
        if h3:
            title = h3.text.strip()

            # 下面拼接所有作者节点(包括<b>、普通文本等),直到遇到<em>或者<p>为止
            authors = []
            node = h3.next_sibling
            while node:
                if isinstance(node, Tag) and node.name in ['em', 'p']:
                    break
                # 跳过换行等
                if isinstance(node, NavigableString):
                    t = node.strip()
                    if t:
                        authors.append(t)
                elif isinstance(node, Tag):
                    t = node.get_text(strip=True)
                    if t:
                        authors.append(t)
                node = node.next_sibling

            authors_str = ' '.join(authors).replace(' ,', ',').replace('  ', ' ').replace('\n', '').strip()
            # venue
            venue_elem = tr.find('em')
            venue = venue_elem.text.strip() if venue_elem else ''
            publications.append({
                'title': title,
                'authors': authors_str,
                'venue': venue
            })

    # 输出
    print(f"Name: {name}")
    print(f"\nIntroduction: {intro_text}")

    print("\nRecent News:")
    for item in news_items[:3]:
        print(f"- {item}")

    print("\nSelected Publications:")
    for i, pub in enumerate(publications[:3], 1):
        print(f"{i}. {pub['title']}")
        print(f"    Authors: {pub['authors']}")
        print(f"    Venue: {pub['venue']}\n")

except requests.exceptions.RequestException as e:
    print(f"Error: Could not connect to website {url}. Reason: {e}")
Name: redirect

Introduction: pls visit bohanlyu.com

Recent News:

Selected Publications:

第三章:进阶爬取技术

抓取单个页面只是开始,真正的威力在于处理列表和多个页面。

3.1 爬取列表页数据(“爬列表”)

目标:抓取一个虚构的图书商店(http://books.toscrape.com - 这是一个真实的、为爬虫练习而生的网站)第一页所有书的标题和价格。

侦察工作: 打开网站,用开发者工具检查一本书。你会发现,每一本书的信息都在一个<article class="product_pod">标签里。书名在<h3>标签的<a>标签里,价格在<p class="price_color">标签里。

代码实现

import requests
from bs4 import BeautifulSoup
import csv # 引入csv库,用于保存数据

# 目标URL
url = 'http://books.toscrape.com/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 1. 定位所有包含书籍信息的article标签
# find_all()会返回一个包含所有匹配项的列表
books = soup.find_all('article', class_='product_pod')

book_data = [] # 创建一个列表来存储所有书籍信息

# 2. 循环处理每一本书
for book in books:
    # 在每个book标签内继续查找标题和价格
    # 注意这里的路径查找方式
    title = book.h3.a['title']
    price = book.find('p', class_='price_color').text
    
    # 打印出来看看
    print(f"书名: {title}, 价格: {price}")
    
    # 将提取的数据存入字典,再添加到列表中
    book_data.append({'title': title, 'price': price})

print(book_data)
书名: A Light in the Attic, 价格: £51.77
书名: Tipping the Velvet, 价格: £53.74
书名: Soumission, 价格: £50.10
书名: Sharp Objects, 价格: £47.82
书名: Sapiens: A Brief History of Humankind, 价格: £54.23
书名: The Requiem Red, 价格: £22.65
书名: The Dirty Little Secrets of Getting Your Dream Job, 价格: £33.34
书名: The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull, 价格: £17.93
书名: The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics, 价格: £22.60
书名: The Black Maria, 价格: £52.15
书名: Starving Hearts (Triangular Trade Trilogy, #1), 价格: £13.99
书名: Shakespeare's Sonnets, 价格: £20.66
书名: Set Me Free, 价格: £17.46
书名: Scott Pilgrim's Precious Little Life (Scott Pilgrim #1), 价格: £52.29
书名: Rip it Up and Start Again, 价格: £35.02
书名: Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991, 价格: £57.25
书名: Olio, 价格: £23.88
书名: Mesaerion: The Best Science Fiction Stories 1800-1849, 价格: £37.59
书名: Libertarianism for Beginners, 价格: £51.33
书名: It's Only the Himalayas, 价格: £45.17
[{'title': 'A Light in the Attic', 'price': '£51.77'}, {'title': 'Tipping the Velvet', 'price': '£53.74'}, {'title': 'Soumission', 'price': '£50.10'}, {'title': 'Sharp Objects', 'price': '£47.82'}, {'title': 'Sapiens: A Brief History of Humankind', 'price': '£54.23'}, {'title': 'The Requiem Red', 'price': '£22.65'}, {'title': 'The Dirty Little Secrets of Getting Your Dream Job', 'price': '£33.34'}, {'title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'price': '£17.93'}, {'title': 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'price': '£22.60'}, {'title': 'The Black Maria', 'price': '£52.15'}, {'title': 'Starving Hearts (Triangular Trade Trilogy, #1)', 'price': '£13.99'}, {'title': "Shakespeare's Sonnets", 'price': '£20.66'}, {'title': 'Set Me Free', 'price': '£17.46'}, {'title': "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'price': '£52.29'}, {'title': 'Rip it Up and Start Again', 'price': '£35.02'}, {'title': 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'price': '£57.25'}, {'title': 'Olio', 'price': '£23.88'}, {'title': 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'price': '£37.59'}, {'title': 'Libertarianism for Beginners', 'price': '£51.33'}, {'title': "It's Only the Himalayas", 'price': '£45.17'}]

3.2 循环爬取多个页面(“循环爬”/翻页)

目标:抓取books.toscrape.com前5页所有书的信息。

侦察工作: 在第一页底部,找到“next”按钮。检查它的HTML,你会看到一个<a>标签,它的href属性指向下一页的地址(catalogue/page-2.html)。这给了我们构建下一页URL的规律。

代码实现

import requests
from bs4 import BeautifulSoup
import time # 引入时间库,用于设置延时

base_url = 'http://books.toscrape.com/catalogue/'
all_book_data = []

# 循环爬取前3页
for i in range(1, 4): # 从第1页到第3页
    # 构造每一页的完整URL
    url = f"{base_url}page-{i}.html"
    print(f"正在抓取页面: {url}")
    
    response = requests.get(url)
    
    # 如果页面不存在,则跳出循环
    if response.status_code != 200:
        print(f"页面 {url} 不存在,停止抓取。")
        break
        
    soup = BeautifulSoup(response.text, 'html.parser')
    books = soup.find_all('article', class_='product_pod')

    for book in books:
        title = book.h3.a['title']
        price = book.find('p', class_='price_color').text
        all_book_data.append({'title': title, 'price': price})
    
    # 做一个有礼貌的爬虫,每次请求后暂停一下
    print(f"页面 {i} 抓取完毕,暂停1秒...")
    time.sleep(1) 

print(f"\n全部 {len(all_book_data)} 本书的信息抓取完毕!")

print(all_book_data)
正在抓取页面: http://books.toscrape.com/catalogue/page-1.html
页面 1 抓取完毕,暂停1秒...
正在抓取页面: http://books.toscrape.com/catalogue/page-2.html
页面 2 抓取完毕,暂停1秒...
正在抓取页面: http://books.toscrape.com/catalogue/page-3.html
页面 3 抓取完毕,暂停1秒...

全部 60 本书的信息抓取完毕!
[{'title': 'A Light in the Attic', 'price': '£51.77'}, {'title': 'Tipping the Velvet', 'price': '£53.74'}, {'title': 'Soumission', 'price': '£50.10'}, {'title': 'Sharp Objects', 'price': '£47.82'}, {'title': 'Sapiens: A Brief History of Humankind', 'price': '£54.23'}, {'title': 'The Requiem Red', 'price': '£22.65'}, {'title': 'The Dirty Little Secrets of Getting Your Dream Job', 'price': '£33.34'}, {'title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'price': '£17.93'}, {'title': 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'price': '£22.60'}, {'title': 'The Black Maria', 'price': '£52.15'}, {'title': 'Starving Hearts (Triangular Trade Trilogy, #1)', 'price': '£13.99'}, {'title': "Shakespeare's Sonnets", 'price': '£20.66'}, {'title': 'Set Me Free', 'price': '£17.46'}, {'title': "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'price': '£52.29'}, {'title': 'Rip it Up and Start Again', 'price': '£35.02'}, {'title': 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'price': '£57.25'}, {'title': 'Olio', 'price': '£23.88'}, {'title': 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'price': '£37.59'}, {'title': 'Libertarianism for Beginners', 'price': '£51.33'}, {'title': "It's Only the Himalayas", 'price': '£45.17'}, {'title': 'In Her Wake', 'price': '£12.84'}, {'title': 'How Music Works', 'price': '£37.32'}, {'title': 'Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More', 'price': '£30.52'}, {'title': 'Chase Me (Paris Nights #2)', 'price': '£25.27'}, {'title': 'Black Dust', 'price': '£34.53'}, {'title': 'Birdsong: A Story in Pictures', 'price': '£54.64'}, {'title': "America's Cradle of Quarterbacks: Western Pennsylvania's Football Factory from Johnny Unitas to Joe Montana", 'price': '£22.50'}, {'title': 'Aladdin and His Wonderful Lamp', 'price': '£53.13'}, {'title': 'Worlds Elsewhere: Journeys Around Shakespeareâ\x80\x99s Globe', 'price': '£40.30'}, {'title': 'Wall and Piece', 'price': '£44.18'}, {'title': 'The Four Agreements: A Practical Guide to Personal Freedom', 'price': '£17.66'}, {'title': 'The Five Love Languages: How to Express Heartfelt Commitment to Your Mate', 'price': '£31.05'}, {'title': 'The Elephant Tree', 'price': '£23.82'}, {'title': 'The Bear and the Piano', 'price': '£36.89'}, {'title': "Sophie's World", 'price': '£15.94'}, {'title': 'Penny Maybe', 'price': '£33.29'}, {'title': 'Maude (1883-1993):She Grew Up with the country', 'price': '£18.02'}, {'title': 'In a Dark, Dark Wood', 'price': '£19.63'}, {'title': 'Behind Closed Doors', 'price': '£52.22'}, {'title': "You can't bury them all: Poems", 'price': '£33.63'}, {'title': 'Slow States of Collapse: Poems', 'price': '£57.31'}, {'title': 'Reasons to Stay Alive', 'price': '£26.41'}, {'title': 'Private Paris (Private #10)', 'price': '£47.61'}, {'title': '#HigherSelfie: Wake Up Your Life. Free Your Soul. Find Your Tribe.', 'price': '£23.11'}, {'title': 'Without Borders (Wanderlove #1)', 'price': '£45.07'}, {'title': 'When We Collided', 'price': '£31.77'}, {'title': 'We Love You, Charlie Freeman', 'price': '£50.27'}, {'title': 'Untitled Collection: Sabbath Poems 2014', 'price': '£14.27'}, {'title': 'Unseen City: The Majesty of Pigeons, the Discreet Charm of Snails & Other Wonders of the Urban Wilderness', 'price': '£44.18'}, {'title': 'Unicorn Tracks', 'price': '£18.78'}, {'title': 'Unbound: How Eight Technologies Made Us Human, Transformed Society, and Brought Our World to the Brink', 'price': '£25.52'}, {'title': 'Tsubasa: WoRLD CHRoNiCLE 2 (Tsubasa WoRLD CHRoNiCLE #2)', 'price': '£16.28'}, {'title': 'Throwing Rocks at the Google Bus: How Growth Became the Enemy of Prosperity', 'price': '£31.12'}, {'title': 'This One Summer', 'price': '£19.49'}, {'title': 'Thirst', 'price': '£17.27'}, {'title': 'The Torch Is Passed: A Harding Family Story', 'price': '£19.09'}, {'title': 'The Secret of Dreadwillow Carse', 'price': '£56.13'}, {'title': 'The Pioneer Woman Cooks: Dinnertime: Comfort Classics, Freezer Food, 16-Minute Meals, and Other Delicious Ways to Solve Supper!', 'price': '£56.41'}, {'title': 'The Past Never Ends', 'price': '£56.50'}, {'title': 'The Natural History of Us (The Fine Art of Pretending #2)', 'price': '£45.22'}]

第四章:攻克动态网站(“动态爬”)

挑战:很多现代网站使用JavaScript在页面加载后才动态地载入数据。你用requests直接请求,只能拿到一个空壳HTML,数据根本不在里面。

例子:一个股价实时更新的页面,或者无限滚动的社交媒体。

解决方案:使用浏览器自动化工具,如Selenium

Selenium可以驱动一个真实的浏览器(如Chrome或Firefox)去加载网页。它会等待所有JavaScript执行完毕,渲染出最终的页面,然后我们再从这个完整的页面中提取数据。

4.1 安装与配置 Selenium

  1. 安装 Selenium 库:

    pip install selenium
  2. 下载 WebDriver: Selenium需要一个叫做WebDriver的驱动程序来控制浏览器。你需要下载与你的浏览器版本完全对应WebDriver

    • Chrome用户: 搜索 “ChromeDriver” 下载。
    • Firefox用户: 搜索 “GeckoDriver” 下载。
    • 下载后,将chromedriver.exe(或geckodriver.exe)放到你的Python脚本所在的文件夹,或者一个系统路径下。

4.2 实战:抓取动态加载的数据

目标:抓取一个虚构的、由JS加载名言的网站 (http://quotes.toscrape.com/js/) 的第一页所有名言。

侦察工作: 如果你直接用requests请求这个URL,会发现返回的HTML里根本没有名言数据。但用浏览器打开,名言却清晰可见。这就是JS动态加载的证据。

代码实现

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

# --- Selenium 设置 ---
# 指定WebDriver的路径(如果不在系统路径下)
# driver_path = 'path/to/your/chromedriver.exe'
# driver = webdriver.Chrome(executable_path=driver_path)
driver = webdriver.Chrome() # 假设chromedriver在PATH中

url = 'http://quotes.toscrape.com/js/'
driver.get(url)

try:
    # --- 等待动态内容加载 ---
    # 设置一个最长等待时间(10秒)
    # 等待直到class为'quote'的元素出现
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "quote"))
    )

    # 此刻,浏览器中的页面已经完全加载好了
    # 获取渲染后的页面源代码
    html_content = driver.page_source
    
    # --- 使用BeautifulSoup解析 ---
    soup = BeautifulSoup(html_content, 'html.parser')
    
    quotes = soup.find_all('div', class_='quote')
    
    for quote in quotes:
        text = quote.find('span', class_='text').text
        author = quote.find('small', class_='author').text
        print(f"'{text}' - {author}")
        
finally:
    # 完成后务必关闭浏览器,释放资源
    time.sleep(2) # 稍等片刻,方便观察
    driver.quit()

动态爬取小结

  • 优点:能抓取几乎所有“所见即所得”的网页内容,是终极解决方案。
  • 缺点:速度慢(因为要真实加载整个浏览器),消耗资源多。

更高效的动态抓取思路(高级): 在开发者工具的“Network”标签页下,筛选XHRFetch请求。你往往能找到JS用来获取数据的那个后台API接口。如果能找到这个接口,你就可以直接用requests去请求这个API,获取返回的JSON数据,这比启动整个Selenium快得多!