Leo Yeh's Blog

解決問題 Scrapy (2)

教學目標

透過 Python 中的 Scrapy 架構程式實作多層次爬蟲功能解決收集特定內容的問題。

重點概念

首先任何提供服務的網站通常皆會提供目錄或搜尋功能讓我們能夠找到適當的資訊,此時若我們只知道關鍵字,則要如何進行先搜尋關鍵字之後,接著收集特定內容呢?這就是本篇教學要解決的問題,主要會透過 Python 中的 Scrapy 架構程式實作多層次爬蟲功能,以 Youtube 影片為例。

Scrapy 架構介紹

接著我們會透過 Scrapy 架構程式範例實作多層次爬蟲功能,主要有三個階段,分別為:

  1. 前置階段:透過網址開啟搜尋關鍵字的網頁。
  2. 搜尋階段:透過搜尋關鍵字的網頁,取得關鍵字對應內容的網址。
  3. 收集階段:透過關鍵字對應內容的網址,收集特定內容的資訊。

其中 Scrapy 架構中的資料流主要有五個元件,分別為:

  1. Scrapy Engine
  2. Scheduler
  3. Downloader
  4. Spiders
  5. Item Pipeline

Scrapy 架構程式範例

再來 Scrapy 架構程式範例僅會使用 Scrapy Engine、Spiders 和 Downloader 三個元件為主,請參考下述 Scrapy 架構程式範例。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import scrapy
class example(scrapy.spider.Spider):
name='example'
# 透過網址開啟搜尋關鍵字的網頁
keyword='keyword'
start_urls=['http://www.example.com?q='+keyword]
# 透過搜尋關鍵字的網頁,取得關鍵字對應內容的網址
def parse(self,response):
url = response.xpath('…').extract()
keyword = response.xpath('…').extract()
yield scrapy.http.Request(youtube_url, callback = self.parse2, meta={'keyword': keyword})
# 透過關鍵字對應內容的網址,收集特定內容的資訊。
def parse2(self,response):
content = response.xpath('…').extract()
yield {'keyword': response.meta["keyword"]}
yield {'url': response.url}
yield {'content': content}

其中 scrapy.http.Request 物件主要是代表 HTTP 請求,其主要參數主要有 url、callback 和 meta,所謂 url 代表請求的網址,callback 則是請求回應的函數,最後 meta 則是傳送參數,主要是在 Spider 中產生,並且由 Downloader 執行,之後產生 Reponse 物件。其中所謂 scrapy.http.Response 物件主要是代表 HTTP 回應通常被下載至 Spiders 中進行處理。

Scrapy 架構程式應用

最後我們就以 Scrapy 架構程式範例套用至 Youtube 網站收集特定關鍵字影片的觀看數、喜歡數和不喜歡數,主要有三個階段,分別為:

  1. 前置階段:透過 Youtube 網址開啟搜尋關鍵字的網頁。
  2. 搜尋階段:透過 Youtube 搜尋關鍵字的網頁,取得關鍵字對應內容的 Youtube 影片網址 。
  3. 收集階段:透過關鍵字對應內容的 Youtube 影片網址,收集影片觀看數、喜歡數和不喜歡數的資訊。

編輯 YoutubeDataSpider.py 程式碼

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import scrapy
import urllib
class YoutubeDataSpider(scrapy.spiders.Spider):
# 透過 Youtube 網址開啟搜尋關鍵字的網頁。
name = 'YoutubeDataSpider'
keywords = ['SAS Viya','SAS 9']
start_urls = []
for keyword in keywords:
start_urls.append('https://www.youtube.com/results?search_query='+urllib.parse.quote_plus(keyword))
# 透過 Youtube 搜尋關鍵字的網頁,取得關鍵字對應內容的 Youtube 影片網址
def parse(self, response):
keyword = response.xpath('//title/text()')[0].extract()
url = 'https://www.youtube.com' + response.xpath('//div[@class="yt-lockup-content"]//a//@href')[0].extract()
yield scrapy.http.Request(url, callback = self.parse2, meta={'keyword': keyword})
# 透過關鍵字對應內容的 Youtube 影片網址,收集影片觀看數、喜歡數和不喜歡數的資訊
def parse2(self, response):
views = response.xpath('//div[@class="watch-view-count"]/text()')[0].extract().replace("views","").replace(",","").replace(" ","")
likes = response.xpath('//button[contains(@aria-label, "like")]//@aria-label')[0].extract().replace("like this video along with","").replace(" other people","").replace(" other person","").replace(",","").replace(" ","")
dislikes = response.xpath('//button[contains(@aria-label, "dislike")]//@aria-label')[0].extract().replace("dislike this video along with","").replace(" other people","").replace(" other person","").replace(",","").replace(" ","")
yield {'keyword': response.meta["keyword"]}
yield {'url': response.url}
yield {'view': views}
yield {'like': likes}
yield {'dislike': dislikes}

執行 YoutubeDataSpider.py 程式碼

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
> scrapy runspider YoutubeDataSpider.py -o result.json
2017-10-28 20:09:33 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-10-28 20:09:33 [scrapy.utils.log] INFO: Overridden settings: {'FEED_FORMAT': 'json', 'FEED_URI': 'result.json', 'SPIDER_LOADER_WARN_ONLY': True}
2017-10-28 20:09:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2017-10-28 20:09:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-28 20:09:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-28 20:09:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-10-28 20:09:33 [scrapy.core.engine] INFO: Spider opened
2017-10-28 20:09:33 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-28 20:09:33 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-28 20:09:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/results?search_query=SAS+9> (referer: None)
2017-10-28 20:09:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/results?search_query=SAS+Viya> (referer: None)
2017-10-28 20:09:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=6cIppmnzL6M> (referer: https://www.youtube.com/results?search_query=SAS+9)
2017-10-28 20:09:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=6cIppmnzL6M>
{'keyword': 'SAS 9 - YouTube'}
2017-10-28 20:09:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=6cIppmnzL6M>
{'url': 'https://www.youtube.com/watch?v=6cIppmnzL6M'}
2017-10-28 20:09:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=6cIppmnzL6M>
{'view': '9860'}
2017-10-28 20:09:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=6cIppmnzL6M>
{'like': '25'}
2017-10-28 20:09:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=6cIppmnzL6M>
{'dislike': '1'}
2017-10-28 20:09:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.youtube.com/watch?v=2wXAUBLJGuo&list=PLVBcK_IpFVi8gMnQgAwBWrn0yqjyBnCLV> (referer: https://www.youtube.com/results?search_query=SAS+Viya)
2017-10-28 20:09:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=2wXAUBLJGuo&list=PLVBcK_IpFVi8gMnQgAwBWrn0yqjyBnCLV>
{'keyword': 'SAS Viya - YouTube'}
2017-10-28 20:09:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=2wXAUBLJGuo&list=PLVBcK_IpFVi8gMnQgAwBWrn0yqjyBnCLV>
{'url': 'https://www.youtube.com/watch?v=2wXAUBLJGuo&list=PLVBcK_IpFVi8gMnQgAwBWrn0yqjyBnCLV'}
2017-10-28 20:09:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=2wXAUBLJGuo&list=PLVBcK_IpFVi8gMnQgAwBWrn0yqjyBnCLV>
{'view': '3866'}
2017-10-28 20:09:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=2wXAUBLJGuo&list=PLVBcK_IpFVi8gMnQgAwBWrn0yqjyBnCLV>
{'like': '22'}
2017-10-28 20:09:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=2wXAUBLJGuo&list=PLVBcK_IpFVi8gMnQgAwBWrn0yqjyBnCLV>
{'dislike': '4'}
2017-10-28 20:09:35 [scrapy.core.engine] INFO: Closing spider (finished)
2017-10-28 20:09:35 [scrapy.extensions.feedexport] INFO: Stored json feed (10 items) in: result.json
2017-10-28 20:09:35 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1260,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 154656,
'downloader/response_count': 4,
'downloader/response_status_count/200': 4,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 10, 28, 12, 9, 35, 613798),
'item_scraped_count': 10,
'log_count/DEBUG': 15,
'log_count/INFO': 8,
'request_depth_max': 1,
'response_received_count': 4,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2017, 10, 28, 12, 9, 33, 973120)}
2017-10-28 20:09:35 [scrapy.core.engine] INFO: Spider closed (finished)

總結我們透過 Python 中的 Scrapy 架構程式實作多層次爬蟲功能,本篇主要以 Youtube 影片為例收集影片觀看數、喜歡數和不喜歡數的資訊。

相關資源

⬅️ Go back