概述:爬虫实践——Scrapy
Install
(1)lxml是一种使用 Python 编写的库,可以迅速、灵活地处理 XML。
下载命令:
1
| python -m pip install lxml
|
下载地址:
1
| https://pypi.python.org/pypi/lxml/3.3.1
|
(2)setuptools:一般已安装, 可在cmd中用 python -m pip list 查看是否已经安装。
下载命令:
1
| python -m pip install setuptools
|
下载地址:
1
| https://pypi.python.org/packages/2.7/s/setuptools
|
(3)zope.interface
下载命令:
1
| python -m pip install zope.interface
|
使用第三步下载的setuptools来安装egg文件
1
| https://pypi.python.org/pypi/zope.interface/4.1.0#downloads
|
(4)Twisted:用Python实现的基于事件驱动的网络引擎框架。
下载命令:
1
| python -m pip install Twisted
|
如果安装失败,前往下载 Twisted-17.9.0-cp36-cp36m-win_amd64.whl
1
| https://download.lfd.uci.edu/pythonlibs/n1rrk3iq/Twisted-17.9.0-cp36-cp36m-win_amd64.whl
|
(5)pyOpenSSL:Python的OpenSSL接口
下载命令:
1
| python -m pip install pyOpenSSL
|
下载地址:
1
| https://launchpad.net/pyopenssl
|
(6)win32py
下载地址:
1
| https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/
|
安装时可能出现Python vision3.6-32 required, which was not found in the registry.
解决办法:
i. 运行以下代码(win32-error.py - 链接:https://github.com/jm199504/Scrapy-demo)
ii. 运行 regedit:搜索PythonCode,与3.6文件夹同级建立一个3.6-32的文件夹,且在3.6-32的文件夹里面建立InstallPath和PythonPath文件夹,且其中的默认数值与3.6的对应文件夹内数值相同,即可完成安装。
(7)Scrapy
1 2 3
| easy_install scrapy
scrapy
|
创建项目
1
| scrapy startproject house
|
定义字段
1 2 3 4 5 6 7 8 9 10 11
| class HouseItem(scrapy.Item): htitle = scrapy.Field() hlayout = scrapy.Field() harea = scrapy.Field() htprice = scrapy.Field() hsprice = scrapy.Field()
|
设置请求头部和优先级
settings.py
1 2 3 4
| DEFAULT_REQUEST_HEADERS = { "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;", 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' }
|
1 2 3
| ITEM_PIPELINES = { 'house.pipelines.HousePipeline': 300, }
|
编写parse函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42
| import scrapy from house.items import HouseItem from bs4 import BeautifulSoup class houseSpider(scrapy.Spider):
name = "house" allowed_domains = ['cd.58.com'] url = "http://cd.58.com/ershoufang/pn" offset = 1 start_urls = [url + str(offset)]
def parse(self, response):
for each in response.xpath("//ul[@class='house-list-wrap']/li"): item = HouseItem() item['htitle'] = each.xpath("./div[@class='list-info']/h2[@class='title']/a/text()") .extract()[0] item['hlayout'] = each.xpath ("./div[@class='list-info']/p[@class='baseinfo']/span[1]/text()").extract()[0] item['harea'] = each.xpath("./div[@class='list-info']/p[@class='baseinfo']/ span[2]/text()").extract()[0] htprice1 = each.xpath("./div[@class='price']/p[@class='sum']/b/text()").extract()[0] htprice2 = each.xpath("./div[@class='price']/p[@class='sum']/text()").extract()[0] item['htprice'] = htprice1 + htprice2 item['hsprice'] = each.xpath("./div[@class='price']/p[@class='unit']/text()") .extract()[0]
yield item
if self.offset < 70: self.offset += 1
yield scrapy.Request(self.url + str(self.offset), callback = self.parse)
|
设置管道
pipelines.py
以json为例,该部分可以考虑数据库(sqlite/mysql/mongodb)等
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| import json
class HousePipeline(object):
def __init__(self): self.filename = open("house.json","wb")
def process_item(self, item, spider): text = json.dumps(dict(item),ensure_ascii=False)+',\n' self.filename.write(text.encode("utf-8")) return item
def close_spider(self,spider): self.filename.close()
|
启动
查看当前爬虫项目
开始爬虫