2020-09-17

爬虫实践——Scrapy

概述：爬虫实践——Scrapy

Install

（1）lxml是一种使用 Python 编写的库，可以迅速、灵活地处理 XML。

下载命令：

1	python -m pip install lxml

下载地址：

1	https://pypi.python.org/pypi/lxml/3.3.1

（2）setuptools：一般已安装，可在cmd中用 python -m pip list 查看是否已经安装。

下载命令：

1	python -m pip install setuptools

下载地址：

1	https://pypi.python.org/packages/2.7/s/setuptools

（3）zope.interface

下载命令：

1	python -m pip install zope.interface

使用第三步下载的setuptools来安装egg文件

1	https://pypi.python.org/pypi/zope.interface/4.1.0#downloads

（4）Twisted：用Python实现的基于事件驱动的网络引擎框架。

下载命令：

1	python -m pip install Twisted

如果安装失败，前往下载 Twisted-17.9.0-cp36-cp36m-win_amd64.whl

1	https://download.lfd.uci.edu/pythonlibs/n1rrk3iq/Twisted-17.9.0-cp36-cp36m-win_amd64.whl

（5）pyOpenSSL：Python的OpenSSL接口

下载命令：

1	python -m pip install pyOpenSSL

下载地址：

1	https://launchpad.net/pyopenssl

（6）win32py

下载地址：

1	https://sourceforge.net/projects/pywin32/files/pywin32/Build%20220/

安装时可能出现Python vision3.6-32 required， which was not found in the registry.

解决办法：

i. 运行以下代码（win32-error.py - 链接：https://github.com/jm199504/Scrapy-demo）

ii. 运行 regedit：搜索PythonCode，与3.6文件夹同级建立一个3.6-32的文件夹，且在3.6-32的文件夹里面建立InstallPath和PythonPath文件夹，且其中的默认数值与3.6的对应文件夹内数值相同，即可完成安装。

（7）Scrapy

1
2
3

easy_install scrapy

scrapy

创建项目

1	scrapy startproject house

定义字段

class HouseItem(scrapy.Item):
    # 房屋标题
    htitle = scrapy.Field()
    # 房屋布局
    hlayout = scrapy.Field()
    # 房屋面积
    harea = scrapy.Field()
    # 房屋单价
    htprice = scrapy.Field()
    # 房屋总价
    hsprice = scrapy.Field()

设置请求头部和优先级

settings.py

DEFAULT_REQUEST_HEADERS = {
    "User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
}

1
2
3

ITEM_PIPELINES = {
    'house.pipelines.HousePipeline': 300,
}

编写parse函数

import scrapy
from house.items import HouseItem
from bs4 import BeautifulSoup
class houseSpider(scrapy.Spider):

    name = "house"
    allowed_domains = ['cd.58.com']
    url = "http://cd.58.com/ershoufang/pn"
    offset = 1
    # 爬取URL的起步
    start_urls = [url + str(offset)]

    def parse(self, response):

        for each in response.xpath("//ul[@class='house-list-wrap']/li"):
            item = HouseItem()
            # 房屋标题
    item['htitle'] = each.xpath("./div[@class='list-info']/h2[@class='title']/a/text()")
            .extract()[0]
            # 房屋布局
    item['hlayout'] = each.xpath
            ("./div[@class='list-info']/p[@class='baseinfo']/span[1]/text()").extract()[0]
            # 房屋面积
    item['harea'] = each.xpath("./div[@class='list-info']/p[@class='baseinfo']/
            span[2]/text()").extract()[0]
            # 房屋总价
      # 总价数        
    htprice1 = each.xpath("./div[@class='price']/p[@class='sum']/b/text()").extract()[0]
            # 单位：万
    htprice2 = each.xpath("./div[@class='price']/p[@class='sum']/text()").extract()[0]  
    item['htprice'] = htprice1 + htprice2
            # 房屋单价
    item['hsprice'] = each.xpath("./div[@class='price']/p[@class='unit']/text()")
            .extract()[0]

            yield item

        if self.offset < 70:
            self.offset += 1

        # 每次处理完一页的数据之后，重新发送下一页页面请求，调用回调函数self.parse处理Response
        yield scrapy.Request(self.url + str(self.offset), callback = self.parse)

设置管道

pipelines.py 以json为例，该部分可以考虑数据库（sqlite/mysql/mongodb）等

import json

class HousePipeline(object):

    def __init__(self):
        self.filename = open("house.json","wb")

    def process_item(self, item, spider):
        text = json.dumps(dict(item),ensure_ascii=False)+',\n'
        self.filename.write(text.encode("utf-8"))
        return item

    def close_spider(self,spider):
        self.filename.close()

启动

查看当前爬虫项目

1	scrapy list

开始爬虫

1	scrapy crawl 爬虫项目名