爬虫实践——Scrapy

概述:爬虫实践——Scrapy

Install

(1)lxml是一种使用 Python 编写的库,可以迅速、灵活地处理 XML。

下载命令:

1
python -m pip install lxml

下载地址:

1
https://pypi.python.org/pypi/lxml/3.3.1

(2)setuptools:一般已安装, 可在cmd中用 python -m pip list 查看是否已经安装。

下载命令:

1
python -m pip install setuptools

下载地址:

1
https://pypi.python.org/packages/2.7/s/setuptools

(3)zope.interface

下载命令:

1
python -m pip install zope.interface

使用第三步下载的setuptools来安装egg文件

1
https://pypi.python.org/pypi/zope.interface/4.1.0#downloads

(4)Twisted:用Python实现的基于事件驱动的网络引擎框架。

下载命令:

1
python -m pip install Twisted

如果安装失败,前往下载 Twisted-17.9.0-cp36-cp36m-win_amd64.whl

1
https://download.lfd.uci.edu/pythonlibs/n1rrk3iq/Twisted-17.9.0-cp36-cp36m-win_amd64.whl

(5)pyOpenSSL:Python的OpenSSL接口

下载命令:

1
python -m pip install pyOpenSSL

下载地址:

1
https://launchpad.net/pyopenssl

(6)win32py

下载地址:

1
https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/

安装时可能出现Python vision3.6-32 required, which was not found in the registry.

解决办法:

i. 运行以下代码(win32-error.py - 链接:https://github.com/jm199504/Scrapy-demo)

ii. 运行 regedit:搜索PythonCode,与3.6文件夹同级建立一个3.6-32的文件夹,且在3.6-32的文件夹里面建立InstallPath和PythonPath文件夹,且其中的默认数值与3.6的对应文件夹内数值相同,即可完成安装。

(7)Scrapy

1
2
3
easy_install scrapy

scrapy

创建项目

1
scrapy startproject house

定义字段

1
2
3
4
5
6
7
8
9
10
11
class HouseItem(scrapy.Item):
# 房屋标题
htitle = scrapy.Field()
# 房屋布局
hlayout = scrapy.Field()
# 房屋面积
harea = scrapy.Field()
# 房屋单价
htprice = scrapy.Field()
# 房屋总价
hsprice = scrapy.Field()

设置请求头部和优先级

settings.py

1
2
3
4
DEFAULT_REQUEST_HEADERS = {
"User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}
1
2
3
ITEM_PIPELINES = {
'house.pipelines.HousePipeline': 300,
}

编写parse函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import scrapy
from house.items import HouseItem
from bs4 import BeautifulSoup
class houseSpider(scrapy.Spider):

name = "house"
allowed_domains = ['cd.58.com']
url = "http://cd.58.com/ershoufang/pn"
offset = 1
# 爬取URL的起步
start_urls = [url + str(offset)]

def parse(self, response):

for each in response.xpath("//ul[@class='house-list-wrap']/li"):
item = HouseItem()
# 房屋标题
item['htitle'] = each.xpath("./div[@class='list-info']/h2[@class='title']/a/text()")
.extract()[0]
# 房屋布局
item['hlayout'] = each.xpath
("./div[@class='list-info']/p[@class='baseinfo']/span[1]/text()").extract()[0]
# 房屋面积
item['harea'] = each.xpath("./div[@class='list-info']/p[@class='baseinfo']/
span[2]/text()").extract()[0]
# 房屋总价
# 总价数
htprice1 = each.xpath("./div[@class='price']/p[@class='sum']/b/text()").extract()[0]
# 单位:万
htprice2 = each.xpath("./div[@class='price']/p[@class='sum']/text()").extract()[0]
item['htprice'] = htprice1 + htprice2
# 房屋单价
item['hsprice'] = each.xpath("./div[@class='price']/p[@class='unit']/text()")
.extract()[0]

yield item

if self.offset < 70:
self.offset += 1

# 每次处理完一页的数据之后,重新发送下一页页面请求,调用回调函数self.parse处理Response
yield scrapy.Request(self.url + str(self.offset), callback = self.parse)

设置管道

pipelines.py 以json为例,该部分可以考虑数据库(sqlite/mysql/mongodb)等

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import json

class HousePipeline(object):

def __init__(self):
self.filename = open("house.json","wb")

def process_item(self, item, spider):
text = json.dumps(dict(item),ensure_ascii=False)+',\n'
self.filename.write(text.encode("utf-8"))
return item

def close_spider(self,spider):
self.filename.close()

启动

查看当前爬虫项目

1
scrapy list

开始爬虫

1
scrapy crawl 爬虫项目名