Leo Yeh's Blog

解決問題 Scrapy (1)

教學目標

解決如何在 Windows 作業系統上安裝 Scrapy 爬蟲開源碼套件的問題。

重點概念

首先若要在 Windows 作業系統上安裝 Scrapy 爬蟲開源碼套件,請先安裝 Python 3.x。

接著安裝前置作業,主要有三個套件,分別為:

  1. 安裝 pywin32 套件。
  2. 安裝 lxml 套件。
  3. 安裝 twisted 套件。

安装 pywin32 套件

Pywin32 為存取 Windows 作業系統 API 函式庫,允許我們透過 Python 開發 Win32 的應用。

直接透過 pip 指令進行 pywin32 套件安裝。

1
> pip install pypiwin32

安裝 lxml 套件

lxml 為高效能的 Python XML 函式庫,建構在 libxml2 和 libxslt 函式庫之上,主要執行 XML 解析和轉換等任務。

請先下載至 http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml 網站下載對應版本的 whl 檔。

透過 pip 指令進行 lxml 套件安裝,請注意檔案名稱請以下載的檔案為主。

1
> pip install lxml-3.8.0-cp36-cp36m-win32.whl

安装 twisted 套件

twisted 為基於事件驅動的網路引擎框架,支援許多常見的傳輸及應用層協定,像是 TCP、HTTP、SSL、… 等。

請先下載至 http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 網站下載對應版本的 whl 檔。

透過 pip 指令進行 twisted 套件安裝,請注意檔案名稱請以下載的檔案為主。

1
> pip install Twisted-17.5.0-cp36-cp36m-win32.whl

安裝 Scrapy 爬蟲開源碼套件

再來開始安装 Scrapy 套件,非常簡單,僅需要輸入下述指令就能夠進行安裝。

透過 pip 指令進行 Scrapy 套件安裝

1
> pip install Scrapy

查看 Scrapy 套件版本。

1
2
> scrapy -v
Scrapy 1.4.0 - no active project

測試 Scrapy 爬蟲開源碼套件

最後我們為了測試 Scrapy 套件,所以會試爬 SAS 部落格的最新文章的標題,請先建立 myspider.py 程式碼檔案,接著透過 Scrapy 執行 myspider.py 程式碼檔案,將結果輸出至 result.json 檔案中。

myspider.py 程式碼檔案內容。

1
2
3
4
5
6
7
8
9
import scrapy

class BlogSpider(scrapy.spiders.Spider):
name = 'blogspider'
start_urls = ['http://blogs.sas.com/content/all-posts/']

def parse(self, response):
for title in response.css('article.post > div.content > a'):
yield {'title': title.css('a ::text').extract_first()}

執行 myspider.py 程式碼檔案。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
> scrapy runspider myspider.py -o result.json
2017-09-13 23:28:52 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-09-13 23:28:52 [scrapy.utils.log] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'result.json', 'SPIDER_LOADER_WARN_ONLY': True}
2017-09-13 23:28:52 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2017-09-13 23:28:52 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-09-13 23:28:52 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-09-13 23:28:52 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-09-13 23:28:52 [scrapy.core.engine] INFO: Spider opened
2017-09-13 23:28:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-09-13 23:28:52 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-09-13 23:28:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blogs.sas.com/content/all-posts/> (referer: None)
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tBusiness intelligence, business users and agility'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tUsing government data for good: Hear leaders share analytics stories'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tCould your company survive a fake news attack?'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tBaby Led Weaning: What it is and How to do it'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tEdge Analytics - So kommt Analytics in den Truck (Teil 3) HEUTE: Konfiguration der Software'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tSimulate multivariate clusters in SAS'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tInvertir en gestión y análisis de datos será decisivo para las empresas colombianas'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tTop machine learning techniques: Add features to training data'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tHow did one school system save money, improve local traffic and make students happier? With fewer bus stops and better bus schedules'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tWhere do hurricanes strike Florida? (110 years of data)'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\t3 new outcomes you can expect from self-service data preparation'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tSAS Viya: What’s in it for me, the business?'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tProblem Relatives: Google Doc Add-on vs. Wordy and Misplaced Clauses'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\t6 Gründe warum für Reisebloggerin Kerstin Beck das australische Perth das Paradies ist'}
2017-09-13 23:28:58 [scrapy.core.scraper] DEBUG: Scraped from <200 http://blogs.sas.com/content/all-posts/>
{'title': '\n\t\t\t\t\t\tSymbolic derivatives in SAS'}
2017-09-13 23:28:58 [scrapy.core.engine] INFO: Closing spider (finished)
2017-09-13 23:28:58 [scrapy.extensions.feedexport] INFO: Stored json feed (15 items) in: result.json
2017-09-13 23:28:58 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 229,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 54085,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 9, 13, 15, 28, 58, 104852),
'item_scraped_count': 15,
'log_count/DEBUG': 17,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 9, 13, 15, 28, 52, 792298)}
2017-09-13 23:28:58 [scrapy.core.engine] INFO: Spider closed (finished)

result.json 爬蟲結果檔案內容。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[
{"title": "\n\t\t\t\t\t\tBusiness intelligence, business users and agility"},
{"title": "\n\t\t\t\t\t\tUsing government data for good: Hear leaders share analytics stories"},
{"title": "\n\t\t\t\t\t\tCould your company survive a fake news attack?"},
{"title": "\n\t\t\t\t\t\tBaby Led Weaning: What it is and How to do it"},
{"title": "\n\t\t\t\t\t\tEdge Analytics - So kommt Analytics in den Truck (Teil 3) HEUTE: Konfiguration der Software"},
{"title": "\n\t\t\t\t\t\tSimulate multivariate clusters in SAS"},
{"title": "\n\t\t\t\t\t\tInvertir en gesti\u00f3n y an\u00e1lisis de datos ser\u00e1 decisivo para las empresas colombianas"},
{"title": "\n\t\t\t\t\t\tTop machine learning techniques: Add features to training data"},
{"title": "\n\t\t\t\t\t\tHow did one school system save money, improve local traffic and make students happier? With fewer bus stops and better bus schedules"},
{"title": "\n\t\t\t\t\t\tWhere do hurricanes strike Florida? (110 years of data)"},
{"title": "\n\t\t\t\t\t\t3 new outcomes you can expect from self-service data preparation"},
{"title": "\n\t\t\t\t\t\tSAS Viya: What\u2019s in it for me, the business?"},
{"title": "\n\t\t\t\t\t\tProblem Relatives: Google Doc Add-on vs. Wordy and Misplaced Clauses"},
{"title": "\n\t\t\t\t\t\t6 Gru\u0308nde warum f\u00fcr Reisebloggerin Kerstin Beck das australische Perth das Paradies ist"},
{"title": "\n\t\t\t\t\t\tSymbolic derivatives in SAS"}
]

總結當我們在 Windows 作業系統中安裝好 Scrapy 套件之後,就能夠以非常有效率的方式從網際網路中爬下任何資料,以利進行後續進行資料準備。

相關資源

⬅️ Go back