使用Requests和BeautifulSoup并行抓取今日头条图集

代码分块讲解。分析Ajax请求来爬取今日头条的图集，将结果存储在MongoDB中。使用了Python多进程(multiprocessing)并行进行抓取。针对一个实战的教程进行修改，以适应今日头条最新的网页请求方式

完整代码在这里

MongoDB笔记

Requests与BeautifulSoup笔记

首先，本次要爬取的页面外观如下

搜索后点击图集标签

index

点击进某一个详细页面后，得到：

detail

其中所有照片不会一口气全部呈现在页面中，而是通过翻页来使用Ajax请求新的展示图

在搜索结果展示页，打开开发者工具->Network->XHR，然后刷新，我们可以获取请求的URL和请求的参数

index_url

index_param

编写获取页面的代码

def get_page_index(offset, keyword):
    '''
    
    :param offset: 页面offset，用于自动append新结果
    :param keyword: 搜索关键字
    :return: 
    '''
    #请求的参数
    data = {
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload':'true',
        'count': '20',
        'cur_tab': 3,
        'from': 'gallery'
    }
    #urlencode可以用来将参数的json串转换为url上的参数
    url = 'https://www.toutiao.com/search_content/?' + urlencode(data)
    try:
        response = requests.get(url)
        #成功200
        if response.status_code == 200:
            # print(response.text)
            return response.text
        return None
    except RequestException:
        print('请求索引页出错')
        return None

从开发者工具可以看到得到的响应串是

index_response

编写获取每个detail页面的url代码

def parse_page_index(html):
    try:
        #转化为json对象
        data = json.loads(html)
        # print(data)
        #有数据并且又‘data’标签
        if data and 'data' in data.keys():
            for item in data.get('data'):
                if 'article_url' in item.keys():
                    yield item.get('article_url')
    except JSONDecodeError:
        pass

获得每个detail页面的url后，与index页面类似，请求每个url

def get_page_detail(url):
    try:
        response = requests.get(url)
        # print(response.text)
        if response.status_code == 200:
            return response.text
        return None
    except RequestException:
        print('请求详情页出错', url)
        return None

在每个detail页面中打开开发者工具->Network->Doc，可以看到detail页面都预备要请求什么

detail_url

可以发现在BASE_DATA.galleryInfo中gallery里记录了每个sub_img的信息，所以用正则表达式把它们提出来并尝试去请求图片

def parse_page_detail(html, url):
    soup = BeautifulSoup(html, 'lxml')
    #获得页面title
    title = soup.select('title')[0].get_text()
    print(title)
    #匹配需要的串
    image_pattern = re.compile('gallery: JSON.parse\("(.*?)"\),', re.S)
    result = re.search(image_pattern, html)
    if result:
        #group(1)只匹配括号里的，group()返回包括括号里的整个串
        data = json.loads(result.group(1).replace('\\', ''))
        # print(data)
        if data and 'sub_images' in data.keys():
            sub_imges = data['sub_images']
            # 获得图片的url
            images = [item.get('url') for item in sub_imges]
            # 下载图片
            for image in images:
                download_image(image)
            #返回预备要存入MongoDB的json串
            return {
                'title' : title,
                'images' : images,
                'url' : url
            }
    else:
        print("failed")

下载图片同请求网页类似

def download_image(url):
    print('downing', url, '...')
    try:
        response = requests.get(url, stream=True)
        if response.status_code == 200:
            # print(response.text)
            filename = url.split('/')[-1]
            #保存图片
            save_image(response, filename)
        return None
    except RequestException:
        print('请求图片出错', url)
        return None

保存图片

def save_image(response, filename):
    file_path = './jiepai/{0}.{1}'.format(filename, 'jpg')
    if not os.path.exists(file_path):
        with open(file_path, 'wb') as f:
            #按字节流保存
            for chunk in response.iter_content(chunk_size=128):
                f.write(chunk)
            f.close()

主函数

def main(offset):
    #请求index页面
    html = get_page_index(offset, '街拍')
    #解析页面
    for url in parse_page_index(html):
        #头条返回的url已经不能直接访问到网页，而是跳转到了新的页面，所以要自己转下格式
        url = 'https://www.toutiao.com/a' + url.split('/')[-2]
        # 请求detail页面
        html = get_page_detail(url)
        if html:
            #解析detail页面
            result = parse_page_detail(html, url)
            #存入数据库
            if result:
                save_to_mongo(result)

插入数据库的操作

#连接数据库
client = pymongo.MongoClient(MONGO_URL)
#确定响应的database
db = client[MONGO_DB]

def save_to_mongo(result):
    #插入
    if db[MONGO_TABLE].insert(result):
        print('存储到MongoDB成功', result)
        return True
    else:
        return False

其中

MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'

GROUP_START = 0
GROUP_END = 20

并行地抓取不同offset的数据

if __name__ == "__main__":

    if not os.path.exists('./jiepai'):
        os.mkdir('./jiepai')
    groups = [x*20 for x in range(GROUP_START, GROUP_END+1)]
    #创建进程池，默认为cpu核数
    pool = Pool(processes=4)
    #开启进程，传入参数
    pool.map(main, groups)

Enjoy it !!!

result