做SEO的第一步,通常也是最头疼的一步就是关键词收集。如果行业里正好有SEO做的很好的同行的话,那恭喜你,有个简单的方法,就是把对方的标题和描述抓取下来,整理一下,就是自己网站内容的关键词了。
今天跟大家分享一个python程序,可以帮你抓取任意网站全部网页的标题Title、描述Description,并存储在CSV文件中。代码如下:
import requests
from bs4 import BeautifulSoup
import csv
from collections import deque
import re
import time
def extract_links(soup, base_url):
links = [a['href'] for a in soup.find_all('a', href=True) if re.match(r'^/', a['href'])]
links = [base_url + link if link.startswith('/') else link for link in links]
return links
def extract_title_and_description(soup):
title = soup.title.string if soup.title else None
description = soup.find('meta', attrs={'name':'description'})
description = description['content'] if description else None
return title, description
def crawl_website(base_url, output_file):
visited = set()
queue = deque([base_url])
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}
with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(['URL', 'Title', 'Description'])
while queue:
url = queue.popleft()
if url in visited:
continue
visited.add(url) # 将URL标记为已访问
try:
response = requests.get(url, headers=headers)
print(f"Fetching {url} ... Status: {response.status_code}")
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
title, description = extract_title_and_description(soup)
csvwriter.writerow([url, title, description]) # 写入CSV
csvfile.flush()
print(title)
links = extract_links(soup, base_url)
queue.extend(links)
time.sleep(2)
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
if __name__ == '__main__':
website_url = 'https://www.***.com' # 请修改为你的目标网站
output_file = 'website_data.csv'
crawl_website(website_url, output_file)
在代码执行过程中,可能会遇到对方服务器的反抓取程序,最简单的方法是开着V-p_en,不定时切换IP。遇到其他问题,欢迎微信联系我。