抓取任意网站标题、描述到CSV文件的python程序

做SEO的第一步,通常也是最头疼的一步就是关键词收集。如果行业里正好有SEO做的很好的同行的话,那恭喜你,有个简单的方法,就是把对方的标题和描述抓取下来,整理一下,就是自己网站内容的关键词了。

今天跟大家分享一个python程序,可以帮你抓取任意网站全部网页的标题Title、描述Description,并存储在CSV文件中。代码如下:

import requests
from bs4 import BeautifulSoup
import csv
from collections import deque
import re
import time

def extract_links(soup, base_url):
    links = [a['href'] for a in soup.find_all('a', href=True) if re.match(r'^/', a['href'])]
    links = [base_url + link if link.startswith('/') else link for link in links]
    return links

def extract_title_and_description(soup):
    title = soup.title.string if soup.title else None
    description = soup.find('meta', attrs={'name':'description'})
    description = description['content'] if description else None
    return title, description

def crawl_website(base_url, output_file):
    visited = set()
    queue = deque([base_url])

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
    }

    with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
        csvwriter = csv.writer(csvfile)
        csvwriter.writerow(['URL', 'Title', 'Description'])

        while queue:
            url = queue.popleft()

            if url in visited:
                continue
            visited.add(url)  # 将URL标记为已访问

            try:
                response = requests.get(url, headers=headers)
                print(f"Fetching {url} ... Status: {response.status_code}")

                if response.status_code == 200:
                    soup = BeautifulSoup(response.content, 'html.parser')
                    title, description = extract_title_and_description(soup)
                    
                    csvwriter.writerow([url, title, description])  # 写入CSV
                    csvfile.flush()  
                    print(title)

                    links = extract_links(soup, base_url)
                    queue.extend(links)

                    time.sleep(2)

            except requests.RequestException as e:
                print(f"Error fetching {url}: {e}")

if __name__ == '__main__':
    website_url = 'https://www.***.com'  # 请修改为你的目标网站
    output_file = 'website_data.csv'
    crawl_website(website_url, output_file)

在代码执行过程中,可能会遇到对方服务器的反抓取程序,最简单的方法是开着V-p_en,不定时切换IP。遇到其他问题,欢迎微信联系我

贾定强微信

微信扫一下,或点击链接添加好友