技术栈

Playwright 的详细使用指南，支持 Python 环境，涵盖安装、核心功能及示例代码：

1. 安装 Playwright

bash

# 安装 Python 包
pip install playwright

# 安装浏览器驱动（Chromium、Firefox、WebKit）
playwright install

2. 基本用法：启动浏览器

python

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # 启动 Chromium 浏览器（默认无头模式）
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())
    browser.close()

3. 异步模式（推荐）

python

import asyncio
from playwright.async_api import async_playwright


async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        page = await browser.new_page()
        await page.goto("https://example.com")
        print(await page.title())
        await browser.close()


asyncio.run(main())

4. 页面操作

导航与点击

python

# 点击按钮
await page.click('button#submit')

# 输入文本
await page.fill('input#username', 'myuser')

# 提交表单
await page.press('input#password', 'Enter')

提取数据

python

# 获取文本内容
text = await page.inner_text('.result')

# 获取属性
href = await page.get_attribute('a.link', 'href')

# 获取多个元素
items = await page.query_selector_all('.list-item')
for item in items:
    print(await item.text_content())

5. 处理动态加载

等待元素出现

python

# 等待元素加载（最多等待 10 秒）
await page.wait_for_selector('.dynamic-content', timeout=10000)

处理 AJAX 请求

python

# 监听网络请求
async with page.expect_response("https://api.example.com/data") as response:
    await page.click('button.load-data')
data = await response.value.json()

6. 截图与 PDF

python

# 截图
await page.screenshot(path='screenshot.png', full_page=True)

# 生成 PDF（仅支持 Chromium）
await page.pdf(path='page.pdf')

7. 高级功能

模拟设备（移动端）

python

iphone = p.devices['iPhone 13']
context = await browser.new_context(**iphone)
page = await context.new_page()

拦截请求

python

async def handle_request(route):
    await route.continue_(headers={...})  # 修改请求头


await page.route('**/*', handle_request)

8. 跨浏览器测试

python

# 多浏览器示例
browsers = [
    p.chromium.launch(),
    p.firefox.launch(),
    p.webkit.launch()
]

for browser in browsers:
    page = await browser.new_page()
    await page.goto("https://example.com")
    await browser.close()

9. 常见问题解决

元素无法点击：
- 使用 page.wait_for_selector() 确保元素加载完成。
- 检查是否被遮挡（如弹窗、悬浮层）。
反爬虫检测：
- 禁用无头模式：launch(headless=False)
- 自定义 User-Agent：
  python
```
context = await browser.new_context(user_agent='Mozilla/5.0...')
```
超时错误：
- 增加超时时间：page.goto(url, timeout=60000)

10. 集成到爬虫框架（如 Scrapy）

结合 Playwright 处理动态页面：

python

# 在 Scrapy 中间件中使用 Playwright
class PlaywrightMiddleware:
    async def process_request(self, request, spider):
        async with async_playwright() as p:
            browser = await p.chromium.launch()
            page = await browser.new_page()
            await page.goto(request.url)
            html = await page.content()
            await browser.close()
            return HtmlResponse(url=request.url, body=html, encoding='utf-8')

通过 Playwright，你可以轻松实现浏览器自动化、动态数据抓取和端到端测试，适用于复杂 Web 应用场景（如单页应用、JavaScript 渲染）。

1. 安装 Playwright ​

2. 基本用法：启动浏览器 ​

3. 异步模式（推荐） ​

4. 页面操作 ​

导航与点击 ​

提取数据 ​

5. 处理动态加载 ​

等待元素出现 ​

处理 AJAX 请求 ​

6. 截图与 PDF ​

7. 高级功能 ​

模拟设备（移动端） ​

拦截请求 ​

8. 跨浏览器测试 ​

9. 常见问题解决 ​

10. 集成到爬虫框架（如 Scrapy） ​

1. 安装 Playwright

2. 基本用法：启动浏览器

3. 异步模式（推荐）

4. 页面操作

导航与点击

提取数据

5. 处理动态加载

等待元素出现

处理 AJAX 请求

6. 截图与 PDF

7. 高级功能

模拟设备（移动端）

拦截请求

8. 跨浏览器测试

9. 常见问题解决

10. 集成到爬虫框架（如 Scrapy）