2: Python爬虫原理与实践

第一章：爬虫基础与核心原理再探

在我们开始写代码之前，必须先稳固地掌握爬虫的内在逻辑。

什么是爬虫？

网络爬虫，本质上是一个自动化的程序。它的核心任务是模拟人类访问网页的行为，但速度和规模远超人类。

人类访问网页：打开浏览器 -> 输入网址 (URL) -> 回车 -> 浏览器发送请求 -> 服务器返回HTML代码 -> 浏览器将代码渲染成我们看到的图文并茂的页面。
爬虫访问网页：执行程序 -> 指定URL -> 程序发送请求 -> 服务器返回HTML代码 -> 程序不渲染，而是直接解析这份纯文本的HTML代码，寻找并提取数据。

关键区别在于“渲染”和“解析”。浏览器负责美观地展示，而爬虫负责高效地提取。

HTTP：爬虫与网站沟通的语言

你的爬虫程序通过HTTP协议与网站服务器对话。最常见的两种对话方式是 GET 和 POST。

GET 请求：就像在浏览器地址栏输入网址后回车。这是最常见的请求，用于获取（GET）网页数据。我们的绝大多数抓取任务都从GET请求开始。
POST 请求：通常用于向服务器提交（POST）数据，例如填写登录表单、提交搜索关键词等。有些需要登录或搜索后才能看到内容的页面，就需要模拟POST请求。

网页的骨架：HTML简介

爬虫获取到的就是一堆HTML（超文本标记语言）代码。你需要能看懂它的基本结构，才能知道去哪里找数据。

<!DOCTYPE html>
<html>
<head>
    <title>网页标题</title>
</head>
<body>
    <div class="content">
        <h1>这是一个主标题</h1>
        <p id="intro">这是一个段落。</p>
        <a href="/about.html">关于我们</a>
    </div>
</body>
</html>

标签 (Tag)：由尖括号包围，如<html>, <h1>, <p>, <a>。它们定义了内容的类型和结构。
属性 (Attribute)：在标签内部，提供额外信息，如class="content"，id="intro"，href="/about.html"。属性是我们定位数据的关键线索。

正则表达式：强大的文本匹配工具

正则表达式（Regular Expression，简称Regex）是一种用于文本匹配的模式语言，它在爬虫中扮演着重要角色，特别是当HTML结构不规则或需要提取特定格式的内容时。

基本语法

符号	含义	例子
.	匹配任意单个字符	`a.c` 匹配 “abc”, “adc”, “a2c” 等
*	匹配前面的表达式0次或多次	`a*b` 匹配 “b”, “ab”, “aab” 等
+	匹配前面的表达式1次或多次	`a+b` 匹配 “ab”, “aab” 等，但不匹配 “b”
?	匹配前面的表达式0次或1次	`a?b` 匹配 “b” 或 “ab”
[]	匹配方括号内任意一个字符	`[abc]` 匹配 “a”, “b” 或 “c”
[^]	匹配除方括号内字符外的任何字符	`[^abc]` 匹配除 “a”, “b”, “c” 外的任何字符
()	创建捕获组	`(abc)` 将 “abc” 作为一个组捕获
匹配任意数字	`\d{4}` 匹配四位数字，如 “2025”
	匹配字母、数字或下划线	`\w+` 匹配单词
	匹配空白字符（空格、制表符等）	`a\sb` 匹配 “a b”

Python中使用正则表达式

Python通过re模块提供正则表达式支持：

import re

# 示例：提取邮箱地址
text = "联系我：lyubh22@gmail.com，或者通过微信：13120040612"

# 匹配邮箱
email_pattern = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(email_pattern, text)
print("邮箱:", emails)  # 输出: ['lyubh22@gmail.com']

# 匹配电话号码
phone_pattern = r'\d{11}'
phones = re.findall(phone_pattern, text)
print("电话:", phones)  # 输出: ['13120040612']

邮箱: ['lyubh22@gmail.com']
电话: ['13120040612']

爬虫中的应用实例

当BeautifulSoup无法精确定位元素时，正则表达式可以派上用场：

import requests
import re
from bs4 import BeautifulSoup

# 假设我们要从网页中提取所有发表年份
url = 'https://lyubh.cn'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 用正则表达式匹配所有"20XX-XX"格式的年月
date_pattern = r'20\d{2}-\d{2}'
all_text = soup.get_text()
publication_dates = re.findall(date_pattern, all_text)

print("发现的出版日期:", publication_dates)

发现的出版日期: []

正则表达式虽然强大，但过于复杂的表达式可能难以维护。在实际爬虫开发中，应该先尝试使用BeautifulSoup等结构化解析工具，当这些工具无法满足需求时再考虑使用正则表达式。

第二章：环境准备与第一个静态爬虫

现在，让我们卷起袖子，开始动手！

2.1 搭建你的爬虫工作室

安装 Python: 确保你的电脑上安装了Python 3。
安装必备库: 打开你的终端或命令行工具，安装我们即将使用的两个核心库。
```
pip install requests
pip install beautifulsoup4
```
- requests: 负责发送HTTP请求，从网站获取HTML。
- beautifulsoup4: 负责解析HTML，帮我们轻松提取数据。

2.2 实战：抓取学者主页上的信息

目标：自动抓取学者主页（如 https://lyubh.cn）上的姓名、简介、新闻动态，以及每一篇论文的标题、作者和会议。

第一步：分析目标网页（侦察工作）

在浏览器中打开目标网页，右键点击你想要爬取的内容，选择“检查”或“审查元素”。你会看到类似这样的 HTML 结构：

<!-- 姓名 -->
<h1>Bohan Lyu</h1>

<!-- 简介 -->
<p>
  I am an undergruduate at Tsinghua University. I'm interested in ML and NLP topics. My works are published in ICML and ACL.
</p>

<!-- 新闻 -->
<h2 id="News">News</h2>
<ul>
  <li>2025-07 Goedel-Prover is accepted to COLM 2025!</li>
  ...
</ul>

<!-- 论文条目，title、作者、会议分布在同一个<tr>里 -->
<tr class="paper-row">
  <td>...</td>
  <td>
    <div>
      <h3>Goedel-Prover: A Frontier Model for Open-Source Automated Theorem Proving</h3>
      <br>
      Yong Lin*, Shange Tang*, <b>Bohan Lyu</b>, Jiayun Wu, ...
      <br>
      <em>COLM 2025</em>
      ...
    </div>
  </td>
</tr>

侦察结论：

姓名在 <h1> 标签里。
简介在第一个 <p> 标签里。
新闻在 <h2 id="News"> 后的 <ul> 里。
论文信息在 <tr class="paper-row"> 里，标题是 <h3>，作者是 <h3> 下方的所有文本，会议在 <em> 内。

第二步：编写Python代码

import requests
from bs4 import BeautifulSoup, NavigableString, Tag

url = 'https://lyubh.cn'

try:
    response = requests.get(url)
    response.raise_for_status()
    soup = BeautifulSoup(response.text, 'html.parser')

    # Name
    name = soup.find('h1').text.strip() if soup.find('h1') else 'Name not found'

    # Introduction
    intro_section = soup.find('p')
    intro_text = intro_section.text.strip() if intro_section else ''

    # News
    news_items = []
    news_ul = soup.find('h2', id='News')
    if news_ul:
        news_ul = news_ul.find_next('ul')
        if news_ul:
            for li in news_ul.find_all('li'):
                news_items.append(li.text.strip())

    # Publications with full authors and venue
    publications = []
    for tr in soup.find_all('tr', class_='paper-row'):
        h3 = tr.find('h3')
        if h3:
            title = h3.text.strip()

            # 下面拼接所有作者节点（包括<b>、普通文本等），直到遇到<em>或者<p>为止
            authors = []
            node = h3.next_sibling
            while node:
                if isinstance(node, Tag) and node.name in ['em', 'p']:
                    break
                # 跳过换行等
                if isinstance(node, NavigableString):
                    t = node.strip()
                    if t:
                        authors.append(t)
                elif isinstance(node, Tag):
                    t = node.get_text(strip=True)
                    if t:
                        authors.append(t)
                node = node.next_sibling

            authors_str = ' '.join(authors).replace(' ,', ',').replace('  ', ' ').replace('\n', '').strip()
            # venue
            venue_elem = tr.find('em')
            venue = venue_elem.text.strip() if venue_elem else ''
            publications.append({
                'title': title,
                'authors': authors_str,
                'venue': venue
            })

    # 输出
    print(f"Name: {name}")
    print(f"\nIntroduction: {intro_text}")

    print("\nRecent News:")
    for item in news_items[:3]:
        print(f"- {item}")

    print("\nSelected Publications:")
    for i, pub in enumerate(publications[:3], 1):
        print(f"{i}. {pub['title']}")
        print(f"    Authors: {pub['authors']}")
        print(f"    Venue: {pub['venue']}\n")

except requests.exceptions.RequestException as e:
    print(f"Error: Could not connect to website {url}. Reason: {e}")

Name: redirect

Introduction: pls visit bohanlyu.com

Recent News:

Selected Publications:

第三章：进阶爬取技术

抓取单个页面只是开始，真正的威力在于处理列表和多个页面。

3.1 爬取列表页数据（“爬列表”）

目标：抓取一个虚构的图书商店（http://books.toscrape.com - 这是一个真实的、为爬虫练习而生的网站）第一页所有书的标题和价格。

侦察工作：打开网站，用开发者工具检查一本书。你会发现，每一本书的信息都在一个<article class="product_pod">标签里。书名在<h3>标签的<a>标签里，价格在<p class="price_color">标签里。

代码实现：

import requests
from bs4 import BeautifulSoup
import csv # 引入csv库，用于保存数据

# 目标URL
url = 'http://books.toscrape.com/'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# 1. 定位所有包含书籍信息的article标签
# find_all()会返回一个包含所有匹配项的列表
books = soup.find_all('article', class_='product_pod')

book_data = [] # 创建一个列表来存储所有书籍信息

# 2. 循环处理每一本书
for book in books:
    # 在每个book标签内继续查找标题和价格
    # 注意这里的路径查找方式
    title = book.h3.a['title']
    price = book.find('p', class_='price_color').text
    
    # 打印出来看看
    print(f"书名: {title}, 价格: {price}")
    
    # 将提取的数据存入字典，再添加到列表中
    book_data.append({'title': title, 'price': price})

print(book_data)

书名: A Light in the Attic, 价格: Â£51.77
书名: Tipping the Velvet, 价格: Â£53.74
书名: Soumission, 价格: Â£50.10
书名: Sharp Objects, 价格: Â£47.82
书名: Sapiens: A Brief History of Humankind, 价格: Â£54.23
书名: The Requiem Red, 价格: Â£22.65
书名: The Dirty Little Secrets of Getting Your Dream Job, 价格: Â£33.34
书名: The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull, 价格: Â£17.93
书名: The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics, 价格: Â£22.60
书名: The Black Maria, 价格: Â£52.15
书名: Starving Hearts (Triangular Trade Trilogy, #1), 价格: Â£13.99
书名: Shakespeare's Sonnets, 价格: Â£20.66
书名: Set Me Free, 价格: Â£17.46
书名: Scott Pilgrim's Precious Little Life (Scott Pilgrim #1), 价格: Â£52.29
书名: Rip it Up and Start Again, 价格: Â£35.02
书名: Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991, 价格: Â£57.25
书名: Olio, 价格: Â£23.88
书名: Mesaerion: The Best Science Fiction Stories 1800-1849, 价格: Â£37.59
书名: Libertarianism for Beginners, 价格: Â£51.33
书名: It's Only the Himalayas, 价格: Â£45.17
[{'title': 'A Light in the Attic', 'price': 'Â£51.77'}, {'title': 'Tipping the Velvet', 'price': 'Â£53.74'}, {'title': 'Soumission', 'price': 'Â£50.10'}, {'title': 'Sharp Objects', 'price': 'Â£47.82'}, {'title': 'Sapiens: A Brief History of Humankind', 'price': 'Â£54.23'}, {'title': 'The Requiem Red', 'price': 'Â£22.65'}, {'title': 'The Dirty Little Secrets of Getting Your Dream Job', 'price': 'Â£33.34'}, {'title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'price': 'Â£17.93'}, {'title': 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'price': 'Â£22.60'}, {'title': 'The Black Maria', 'price': 'Â£52.15'}, {'title': 'Starving Hearts (Triangular Trade Trilogy, #1)', 'price': 'Â£13.99'}, {'title': "Shakespeare's Sonnets", 'price': 'Â£20.66'}, {'title': 'Set Me Free', 'price': 'Â£17.46'}, {'title': "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'price': 'Â£52.29'}, {'title': 'Rip it Up and Start Again', 'price': 'Â£35.02'}, {'title': 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'price': 'Â£57.25'}, {'title': 'Olio', 'price': 'Â£23.88'}, {'title': 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'price': 'Â£37.59'}, {'title': 'Libertarianism for Beginners', 'price': 'Â£51.33'}, {'title': "It's Only the Himalayas", 'price': 'Â£45.17'}]

3.2 循环爬取多个页面（“循环爬”/翻页）

目标：抓取books.toscrape.com前5页所有书的信息。

侦察工作：在第一页底部，找到“next”按钮。检查它的HTML，你会看到一个<a>标签，它的href属性指向下一页的地址（catalogue/page-2.html）。这给了我们构建下一页URL的规律。

代码实现：

import requests
from bs4 import BeautifulSoup
import time # 引入时间库，用于设置延时

base_url = 'http://books.toscrape.com/catalogue/'
all_book_data = []

# 循环爬取前3页
for i in range(1, 4): # 从第1页到第3页
    # 构造每一页的完整URL
    url = f"{base_url}page-{i}.html"
    print(f"正在抓取页面: {url}")
    
    response = requests.get(url)
    
    # 如果页面不存在，则跳出循环
    if response.status_code != 200:
        print(f"页面 {url} 不存在，停止抓取。")
        break
        
    soup = BeautifulSoup(response.text, 'html.parser')
    books = soup.find_all('article', class_='product_pod')

    for book in books:
        title = book.h3.a['title']
        price = book.find('p', class_='price_color').text
        all_book_data.append({'title': title, 'price': price})
    
    # 做一个有礼貌的爬虫，每次请求后暂停一下
    print(f"页面 {i} 抓取完毕，暂停1秒...")
    time.sleep(1) 

print(f"\n全部 {len(all_book_data)} 本书的信息抓取完毕！")

print(all_book_data)

正在抓取页面: http://books.toscrape.com/catalogue/page-1.html
页面 1 抓取完毕，暂停1秒...
正在抓取页面: http://books.toscrape.com/catalogue/page-2.html
页面 2 抓取完毕，暂停1秒...
正在抓取页面: http://books.toscrape.com/catalogue/page-3.html
页面 3 抓取完毕，暂停1秒...

全部 60 本书的信息抓取完毕！
[{'title': 'A Light in the Attic', 'price': 'Â£51.77'}, {'title': 'Tipping the Velvet', 'price': 'Â£53.74'}, {'title': 'Soumission', 'price': 'Â£50.10'}, {'title': 'Sharp Objects', 'price': 'Â£47.82'}, {'title': 'Sapiens: A Brief History of Humankind', 'price': 'Â£54.23'}, {'title': 'The Requiem Red', 'price': 'Â£22.65'}, {'title': 'The Dirty Little Secrets of Getting Your Dream Job', 'price': 'Â£33.34'}, {'title': 'The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull', 'price': 'Â£17.93'}, {'title': 'The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics', 'price': 'Â£22.60'}, {'title': 'The Black Maria', 'price': 'Â£52.15'}, {'title': 'Starving Hearts (Triangular Trade Trilogy, #1)', 'price': 'Â£13.99'}, {'title': "Shakespeare's Sonnets", 'price': 'Â£20.66'}, {'title': 'Set Me Free', 'price': 'Â£17.46'}, {'title': "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)", 'price': 'Â£52.29'}, {'title': 'Rip it Up and Start Again', 'price': 'Â£35.02'}, {'title': 'Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991', 'price': 'Â£57.25'}, {'title': 'Olio', 'price': 'Â£23.88'}, {'title': 'Mesaerion: The Best Science Fiction Stories 1800-1849', 'price': 'Â£37.59'}, {'title': 'Libertarianism for Beginners', 'price': 'Â£51.33'}, {'title': "It's Only the Himalayas", 'price': 'Â£45.17'}, {'title': 'In Her Wake', 'price': 'Â£12.84'}, {'title': 'How Music Works', 'price': 'Â£37.32'}, {'title': 'Foolproof Preserving: A Guide to Small Batch Jams, Jellies, Pickles, Condiments, and More: A Foolproof Guide to Making Small Batch Jams, Jellies, Pickles, Condiments, and More', 'price': 'Â£30.52'}, {'title': 'Chase Me (Paris Nights #2)', 'price': 'Â£25.27'}, {'title': 'Black Dust', 'price': 'Â£34.53'}, {'title': 'Birdsong: A Story in Pictures', 'price': 'Â£54.64'}, {'title': "America's Cradle of Quarterbacks: Western Pennsylvania's Football Factory from Johnny Unitas to Joe Montana", 'price': 'Â£22.50'}, {'title': 'Aladdin and His Wonderful Lamp', 'price': 'Â£53.13'}, {'title': 'Worlds Elsewhere: Journeys Around Shakespeareâ\x80\x99s Globe', 'price': 'Â£40.30'}, {'title': 'Wall and Piece', 'price': 'Â£44.18'}, {'title': 'The Four Agreements: A Practical Guide to Personal Freedom', 'price': 'Â£17.66'}, {'title': 'The Five Love Languages: How to Express Heartfelt Commitment to Your Mate', 'price': 'Â£31.05'}, {'title': 'The Elephant Tree', 'price': 'Â£23.82'}, {'title': 'The Bear and the Piano', 'price': 'Â£36.89'}, {'title': "Sophie's World", 'price': 'Â£15.94'}, {'title': 'Penny Maybe', 'price': 'Â£33.29'}, {'title': 'Maude (1883-1993):She Grew Up with the country', 'price': 'Â£18.02'}, {'title': 'In a Dark, Dark Wood', 'price': 'Â£19.63'}, {'title': 'Behind Closed Doors', 'price': 'Â£52.22'}, {'title': "You can't bury them all: Poems", 'price': 'Â£33.63'}, {'title': 'Slow States of Collapse: Poems', 'price': 'Â£57.31'}, {'title': 'Reasons to Stay Alive', 'price': 'Â£26.41'}, {'title': 'Private Paris (Private #10)', 'price': 'Â£47.61'}, {'title': '#HigherSelfie: Wake Up Your Life. Free Your Soul. Find Your Tribe.', 'price': 'Â£23.11'}, {'title': 'Without Borders (Wanderlove #1)', 'price': 'Â£45.07'}, {'title': 'When We Collided', 'price': 'Â£31.77'}, {'title': 'We Love You, Charlie Freeman', 'price': 'Â£50.27'}, {'title': 'Untitled Collection: Sabbath Poems 2014', 'price': 'Â£14.27'}, {'title': 'Unseen City: The Majesty of Pigeons, the Discreet Charm of Snails & Other Wonders of the Urban Wilderness', 'price': 'Â£44.18'}, {'title': 'Unicorn Tracks', 'price': 'Â£18.78'}, {'title': 'Unbound: How Eight Technologies Made Us Human, Transformed Society, and Brought Our World to the Brink', 'price': 'Â£25.52'}, {'title': 'Tsubasa: WoRLD CHRoNiCLE 2 (Tsubasa WoRLD CHRoNiCLE #2)', 'price': 'Â£16.28'}, {'title': 'Throwing Rocks at the Google Bus: How Growth Became the Enemy of Prosperity', 'price': 'Â£31.12'}, {'title': 'This One Summer', 'price': 'Â£19.49'}, {'title': 'Thirst', 'price': 'Â£17.27'}, {'title': 'The Torch Is Passed: A Harding Family Story', 'price': 'Â£19.09'}, {'title': 'The Secret of Dreadwillow Carse', 'price': 'Â£56.13'}, {'title': 'The Pioneer Woman Cooks: Dinnertime: Comfort Classics, Freezer Food, 16-Minute Meals, and Other Delicious Ways to Solve Supper!', 'price': 'Â£56.41'}, {'title': 'The Past Never Ends', 'price': 'Â£56.50'}, {'title': 'The Natural History of Us (The Fine Art of Pretending #2)', 'price': 'Â£45.22'}]

第四章：攻克动态网站（“动态爬”）

挑战：很多现代网站使用JavaScript在页面加载后才动态地载入数据。你用requests直接请求，只能拿到一个空壳HTML，数据根本不在里面。

例子：一个股价实时更新的页面，或者无限滚动的社交媒体。

解决方案：使用浏览器自动化工具，如Selenium。

Selenium可以驱动一个真实的浏览器（如Chrome或Firefox）去加载网页。它会等待所有JavaScript执行完毕，渲染出最终的页面，然后我们再从这个完整的页面中提取数据。

4.1 安装与配置 Selenium

安装 Selenium 库:
```
pip install selenium
```
下载 WebDriver: Selenium需要一个叫做WebDriver的驱动程序来控制浏览器。你需要下载与你的浏览器版本完全对应的WebDriver。
- Chrome用户: 搜索 “ChromeDriver” 下载。
- Firefox用户: 搜索 “GeckoDriver” 下载。
- 下载后，将chromedriver.exe（或geckodriver.exe）放到你的Python脚本所在的文件夹，或者一个系统路径下。

4.2 实战：抓取动态加载的数据

目标：抓取一个虚构的、由JS加载名言的网站 (http://quotes.toscrape.com/js/) 的第一页所有名言。

侦察工作：如果你直接用requests请求这个URL，会发现返回的HTML里根本没有名言数据。但用浏览器打开，名言却清晰可见。这就是JS动态加载的证据。

代码实现：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

# --- Selenium 设置 ---
# 指定WebDriver的路径（如果不在系统路径下）
# driver_path = 'path/to/your/chromedriver.exe'
# driver = webdriver.Chrome(executable_path=driver_path)
driver = webdriver.Chrome() # 假设chromedriver在PATH中

url = 'http://quotes.toscrape.com/js/'
driver.get(url)

try:
    # --- 等待动态内容加载 ---
    # 设置一个最长等待时间（10秒）
    # 等待直到class为'quote'的元素出现
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "quote"))
    )

    # 此刻，浏览器中的页面已经完全加载好了
    # 获取渲染后的页面源代码
    html_content = driver.page_source
    
    # --- 使用BeautifulSoup解析 ---
    soup = BeautifulSoup(html_content, 'html.parser')
    
    quotes = soup.find_all('div', class_='quote')
    
    for quote in quotes:
        text = quote.find('span', class_='text').text
        author = quote.find('small', class_='author').text
        print(f"'{text}' - {author}")
        
finally:
    # 完成后务必关闭浏览器，释放资源
    time.sleep(2) # 稍等片刻，方便观察
    driver.quit()

动态爬取小结：

优点：能抓取几乎所有“所见即所得”的网页内容，是终极解决方案。
缺点：速度慢（因为要真实加载整个浏览器），消耗资源多。

更高效的动态抓取思路（高级）：在开发者工具的“Network”标签页下，筛选XHR或Fetch请求。你往往能找到JS用来获取数据的那个后台API接口。如果能找到这个接口，你就可以直接用requests去请求这个API，获取返回的JSON数据，这比启动整个Selenium快得多！