博客文章
三天前我爬了自己博客的归档记录,但是一个博客最重要的内容是文章,所以今天我在看成果直播的时候顺手写了爬取我博客文章的程序,共爬取26篇博客。
每篇文章都用时间+标题+副标题为文件名,以Markdown格式保存在blogArticle文件夹中。
时间、标题和副标题的获取方式见上篇博客
在原来基础上获得博客的url,然后再通过url爬取博客的内容。
在获得标题的那一步,我们可以看到该篇博客的href信息,通过链接+href的方式来得到博客的url。
<section class="post-preview">
<a class="post-link" href="/2020/03/09/%E7%BD%91%E7%AB%99%E8%AE%BE%E7%BD%AE-%E8%B8%A9%E5%9D%91.html" title="阿里云ECS服务器部署"></a>
<h2 class="post-title">阿里云ECS服务器部署</h2>
<h3 class="post-subtitle">部署&踩坑</h3>
</section>
所以该篇博客的url为:
https://jmbaozi.top/2020/03/09/%E7%BD%91%E7%AB%99%E8%AE%BE%E7%BD%AE-%E8%B8%A9%E5%9D%91.html
然后解析url,找到博客内容所属的标签<article,class_ = "markdown-body">
获得html格式的博客内容。
#获取文章
def get_article():
for each in blog_article_href:
link = url + str(each)
r = requests.get(link,headers = headers)
soup = BeautifulSoup(r.text,'lxml')
result = soup.find_all('article',class_ = 'markdown-body')
blog_article.append(result)
最后将获得的文章信息保存在Markdow文件中:
#写入文章
def write_article():
for i in range(len(blog_article)):
file_name = dirname + '/' + blog_time[i]+'-'+blog_tilte[i]+'---'+blog_subtitle[i]+ '.md'
with open(file_name,'w',encoding = 'utf-8') as file:
file.write(str(blog_article[i]))
- 代码:blogArticle.py
- 结果:blogArticle
代码
"""
Author:JMbaozi
爬取jmbaozi.top博客文章内容
时间:2020.3.13 19:53
"""
import requests
from bs4 import BeautifulSoup
import lxml
url = 'https://jmbaozi.top/'
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.43 Safari/537.36 Edg/81.0.416.28'
}
blog_tilte = []#标题
blog_subtitle = []#副标题
blog_time = []#时间
blog_article_href = []#文章链接
blog_article = []#文章内容
page_number = 5#博客页数
dirname = 'blogArticle'#文章存放文件夹
#获取信息
def get_data():
for i in range(1,page_number+1):
if i==1:
link = url
else:
link = url + 'page' + str(i)
r = requests.get(link,headers = headers)
soup = BeautifulSoup(r.text,'lxml')
title_list = soup.find_all('section',class_='post-preview')#标题&副标题&文章链接
for each in title_list:
title = each.h2.text.strip()
subtitle = each.h3.text.strip()
blog_tilte.append(title)
blog_subtitle.append(subtitle)
href = each.find('a')['href']#文章链接
blog_article_href.append(href)
time_list = soup.find_all('time',class_='post-date')#时间
for each in time_list:
time = each.text.strip()
blog_time.append(time)
#获取文章
def get_article():
for each in blog_article_href:
link = url + str(each)
r = requests.get(link,headers = headers)
soup = BeautifulSoup(r.text,'lxml')
result = soup.find_all('article',class_ = 'markdown-body')
blog_article.append(result)
#写入文章
def write_article():
for i in range(len(blog_article)):
file_name = dirname + '/' + blog_time[i]+'-'+blog_tilte[i]+'---'+blog_subtitle[i]+ '.md'
with open(file_name,'w',encoding = 'utf-8') as file:
file.write(str(blog_article[i]))
if __name__ == '__main__':
get_data()
get_article()
write_article()
print('写入完成!')