博客文章

三天前我爬了自己博客的归档记录,但是一个博客最重要的内容是文章,所以今天我在看成果直播的时候顺手写了爬取我博客文章的程序,共爬取26篇博客。

每篇文章都用时间+标题+副标题为文件名,以Markdown格式保存在blogArticle文件夹中。

时间、标题和副标题的获取方式见上篇博客

在原来基础上获得博客的url,然后再通过url爬取博客的内容。

在获得标题的那一步,我们可以看到该篇博客的href信息,通过链接+href的方式来得到博客的url。

<section class="post-preview">
  <a class="post-link" href="/2020/03/09/%E7%BD%91%E7%AB%99%E8%AE%BE%E7%BD%AE-%E8%B8%A9%E5%9D%91.html" title="阿里云ECS服务器部署"></a>
  <h2 class="post-title">阿里云ECS服务器部署</h2>
  <h3 class="post-subtitle">部署&踩坑</h3>
</section>

所以该篇博客的url为:

https://jmbaozi.top/2020/03/09/%E7%BD%91%E7%AB%99%E8%AE%BE%E7%BD%AE-%E8%B8%A9%E5%9D%91.html

然后解析url,找到博客内容所属的标签<article,class_ = "markdown-body">获得html格式的博客内容。

#获取文章
def get_article():
    for each in blog_article_href:
        link = url + str(each)
        r = requests.get(link,headers = headers)
        soup = BeautifulSoup(r.text,'lxml')
        result = soup.find_all('article',class_ = 'markdown-body')
        blog_article.append(result)

最后将获得的文章信息保存在Markdow文件中:

#写入文章
def write_article():
    for i in range(len(blog_article)):
        file_name = dirname + '/' + blog_time[i]+'-'+blog_tilte[i]+'---'+blog_subtitle[i]+ '.md'
        with open(file_name,'w',encoding = 'utf-8') as file:
            file.write(str(blog_article[i]))

代码

"""
Author:JMbaozi
爬取jmbaozi.top博客文章内容
时间:2020.3.13  19:53
"""

import requests
from bs4 import BeautifulSoup
import lxml

url = 'https://jmbaozi.top/'
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.43 Safari/537.36 Edg/81.0.416.28'
}
blog_tilte = []#标题
blog_subtitle = []#副标题
blog_time = []#时间
blog_article_href = []#文章链接
blog_article = []#文章内容
page_number = 5#博客页数
dirname = 'blogArticle'#文章存放文件夹

#获取信息
def get_data():
    for i in range(1,page_number+1):
        if i==1:
            link = url
        else:
            link = url + 'page' + str(i)
        r = requests.get(link,headers = headers)
        soup = BeautifulSoup(r.text,'lxml')
        title_list = soup.find_all('section',class_='post-preview')#标题&副标题&文章链接
        for each in title_list:
            title = each.h2.text.strip()
            subtitle = each.h3.text.strip()
            blog_tilte.append(title)
            blog_subtitle.append(subtitle)
            href = each.find('a')['href']#文章链接
            blog_article_href.append(href)
        time_list = soup.find_all('time',class_='post-date')#时间
        for each in time_list:
            time = each.text.strip()
            blog_time.append(time)

#获取文章
def get_article():
    for each in blog_article_href:
        link = url + str(each)
        r = requests.get(link,headers = headers)
        soup = BeautifulSoup(r.text,'lxml')
        result = soup.find_all('article',class_ = 'markdown-body')
        blog_article.append(result)

#写入文章
def write_article():
    for i in range(len(blog_article)):
        file_name = dirname + '/' + blog_time[i]+'-'+blog_tilte[i]+'---'+blog_subtitle[i]+ '.md'
        with open(file_name,'w',encoding = 'utf-8') as file:
            file.write(str(blog_article[i]))

if __name__ == '__main__':
    get_data()
    get_article()
    write_article()
    print('写入完成!')