Python 实现伪装爬取网页内容

为了让你的爬虫在网络上更加隐蔽并避免被封禁或识别为爬虫行为，可以采取一些伪装措施。这些技术主要是模仿正常用户的行为，使爬虫看起来像一个真实的浏览器用户。以下是一些常见的伪装技巧：

1. 设置User-Agent

每个浏览器都会向服务器发送一个 User-Agent，用于告知服务器浏览器的类型。默认情况下，爬虫库（如requests）使用一个非常基础的 User-Agent，这会让网站轻松识别出爬虫。通过伪装 User-Agent，可以让爬虫看起来像是真实的浏览器用户。

代码示例：

import requests
from bs4 import BeautifulSoup

# 设置目标URL
url = "https://movie.douban.com/"

# 添加伪装的User-Agent头
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# 发起GET请求并添加伪装的headers
response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    movies = soup.find_all('div', class_='browse-movie-bottom')

    for movie in movies:
        title = movie.find('a').get_text()
        link = movie.find('a')['href']
        print(f"Title: {title}, Link: {link}")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

2. 使用代理

代理服务器可以隐藏你的真实IP地址，从而避免被目标网站封禁。你可以使用免费的或付费的代理。建议使用动态代理或轮换代理，以避免频繁使用同一IP导致封禁。

代码示例：

import requests
from bs4 import BeautifulSoup

url = "https://movie.douban.com/"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

# 使用代理（这里需要替换为你找到的代理IP和端口）
proxies = {
    'http': 'http://your_proxy_ip:port',
    'https': 'https://your_proxy_ip:port',
}

# 发起带代理的请求
response = requests.get(url, headers=headers, proxies=proxies)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    movies = soup.find_all('div', class_='browse-movie-bottom')

    for movie in movies:
        title = movie.find('a').get_text()
        link = movie.find('a')['href']
        print(f"Title: {title}, Link: {link}")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

3. 随机化请求间隔

网站会监控过于频繁的请求来检测爬虫。通过在请求之间添加随机的时间间隔，模仿正常用户的行为，可以减少被封禁的风险。使用 time.sleep() 来实现延迟，并结合 random.uniform() 随机化等待时间。

代码示例：

import requests
from bs4 import BeautifulSoup
import time
import random

url = "https://movie.douban.com/"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
    movies = soup.find_all('div', class_='browse-movie-bottom')

    for movie in movies:
        title = movie.find('a').get_text()
        link = movie.find('a')['href']
        print(f"Title: {title}, Link: {link}")
        
        # 随机等待1到5秒
        time.sleep(random.uniform(1, 5))
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

4. 轮换User-Agent和代理

可以通过随机选择不同的 User-Agent 和代理进行请求，进一步伪装爬虫，减少被识别的可能性。你可以维护一个包含多个 User-Agent 和代理的列表，并在每次请求时随机选择其中一个。

代码示例：

import requests
from bs4 import BeautifulSoup
import random
import time

# User-Agent 列表
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36',
    'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0',
]

# 代理列表
proxies_list = [
    {'http': 'http://proxy1_ip:port', 'https': 'https://proxy1_ip:port'},
    {'http': 'http://proxy2_ip:port', 'https': 'https://proxy2_ip:port'},
    {'http': 'http://proxy3_ip:port', 'https': 'https://proxy3_ip:port'},
]

url = "https://movie.douban.com/"

for i in range(10):  # 假设循环10次
    # 随机选择User-Agent
    headers = {
        'User-Agent': random.choice(user_agents)
    }

    # 随机选择代理
    proxies = random.choice(proxies_list)

    # 发起带代理和User-Agent的请求
    response = requests.get(url, headers=headers, proxies=proxies)

    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        movies = soup.find_all('div', class_='browse-movie-bottom')

        for movie in movies:
            title = movie.find('a').get_text()
            link = movie.find('a')['href']
            print(f"Title: {title}, Link: {link}")

        # 随机等待1到5秒
        time.sleep(random.uniform(1, 5))
    else:
        print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

5. 避免加载JavaScript内容

有些网站依赖JavaScript渲染内容。如果你需要爬取动态内容，考虑使用像 Selenium 这样的库，它能模拟完整的浏览器行为，包括处理JavaScript渲染。结合上面的技术，Selenium也可以用于爬虫伪装。

总结：

伪装爬虫的主要目的是模仿正常用户的行为，避免被封禁。通过修改 User-Agent，使用代理，随机化请求间隔以及轮换请求头和代理，可以有效减少爬虫被识别的风险。在实际应用中，请确保遵守目标网站的 robots.txt 文件，并尊重他们的爬取政策。

关注公众号『窗外天空』

获取更多建站运营运维新知！
互联网创业、前沿技术......

Python 实现伪装爬取网页内容

1. 设置User-Agent

2. 使用代理

3. 随机化请求间隔

4. 轮换User-Agent和代理

5. 避免加载JavaScript内容

总结：

相关推荐

评论抢沙发

猜你喜欢

最新评论

热门标签

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续给力更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

1. 设置User-Agent

2. 使用代理

3. 随机化请求间隔

4. 轮换User-Agent和代理

5. 避免加载JavaScript内容

总结：

相关推荐

评论 抢沙发

猜你喜欢

最新评论

热门标签

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续给力更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

评论抢沙发