利用Scrapy爬取自己的CSDN博客-白红宇

利用Scrapy爬取自己的CSDN博客

阅读量：4655 次

发布时间：2019-06-09

本文共 3093 字，大约阅读时间需要 10 分钟。

最近开始接触Scrapy这个开源的爬虫，看了一些文档和人家的技术博客，模仿一下，来爬取自己博客。

首先创建项目：

scrapy startproject myblog

items.py的编写：

我准备爬取博客文章标题，文章链接及文章被阅读的次数

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class MyBlogItem(scrapy.Item):

article_name = scrapy.Field()

article_url = scrapy.Field()

article_readcount = scrapy.Field()

pipelines.py的编写：

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import json

import codecs

class MyBlogPipeline(object):

def __init__(self):

self.file = codecs.open('myblog_data.json',mode='wb',encoding='utf-8')

def process_item(self, item, spider):

line = json.dumps(dict(item))+'\n'

self.file.write(line.decode('unicode_escape'))

return item

Scrapy爬虫框架抓取的中文结果为Unicode编码，对于如何转换为UTF-8编码。下面部分的代码算是比较好的解决了这个问题。

settings.py的编写：

# -*- coding: utf-8 -*-

# Scrapy settings for myblog project

# For simplicity, this file contains only the most important settings by

# default. All the other settings are documented here:

#     http://doc.scrapy.org/en/latest/topics/settings.html

BOT_NAME = 'myblog'

SPIDER_MODULES = ['myblog.spiders']

NEWSPIDER_MODULE = 'myblog.spiders'

COOKIES_ENABLED = False

ITEM_PIPELINES = {

'myblog.pipelines.MyBlogPipeline':300

# Crawl responsibly by identifying yourself (and your website) on the user-agent

#USER_AGENT = 'myblog (+http://www.yourdomain.com)'

这里将COOKIES_ENABLED参数置为True，使根据cookies判断访问的站点不能发现爬虫轨迹，防止被ban。

ITEM_PIPELINES类型为字典，用于设置启动的pipeline，其中key为定义的pipeline类，value为启动顺序，默认0-1000。

爬虫的编写：

#!/usr/bin/env python

# __author__ = 'root'

from scrapy.spider import Spider

from scrapy.selector import  Selector

from scrapy.http import Request

from myblog.items import MyBlogItem

import  re

class MyBlogSpider(Spider):

name = "myblog"

download_delay = 1

allowed_domains=["blog.csdn.net"]

start_urls=[

"http://blog.csdn.net/bnxf00000/article/details/2785136"

def parse(self, response):

sel = Selector(response)

item = MyBlogItem()

templist=[]

article_url = str(response.url)

article_name = sel.xpath('//div[@id="article_details"]/div/h1/span/a/text()').extract()

article_readcount = sel.xpath('//div[@id="article_details"]/div[2]/span[@class="link_view"]/text()').extract()

for temp in article_readcount:

result = re.match('(\d+)',temp)

if result:

templist.append(result.group(0))

#article_readcount = re.match('\d+',article_readcount)

item['article_name'] = [n.encode('utf-8') for n in article_name]

item['article_url'] = article_url.encode('utf-8')

item['article_readcount']=[n.encode('utf-8') for n in templist]

yield item

urls = sel.xpath('//li[@class="next_article"]/a/@href').extract()

for url in urls:

#print url

url = "http://blog.csdn.net" + url

#print url

yield Request(url, callback=self.parse)

原理是分析网页得到“下一篇”的链接，并返回Request对象。进而继续爬取下一篇文章，直至没有。

执行：

scrapy crawl myblog

部分结果图示：

第一个爬虫程序，参照别人的代码和讲解依葫芦画瓢，自己添加了对阅读次数的处理，后续准备对Scrapy爬虫源码进行阅读学习。

参考链接：

转载于:https://www.cnblogs.com/hiccup/p/4475631.html

你可能感兴趣的文章

新手理解HTML、CSS、javascript之间的关系

查看>>

位运算

查看>>

搭建Java服务器，并且实现远程安全访问linux系统

查看>>

BitmapDrawable

查看>>

手机连接mac电脑无法使用adb命令解决方法

查看>>

Round#534 div.2-B Game with string

[LeetCode] 96. Unique Binary Search Trees 独一无二的二叉搜索树

HDU 2068 RPG的错排(错排公式 + 具体解释)

查看>>

Html标签之frameset&图片切换

查看>>