Scrapy Notes

Creating project

create a project in pycharm called scrapy_learn

scrapy startproject quote #project name inside scrapy_learn

create a spider must inside the spiders folder

1
2
3
4
5
6
7
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quotes' #name of the spider
start_urls = ['http://quotes.toscrape.com/'] #don't change the name of name and start_urls parse
def parse(self, response):
title = response.css('title::text').extract()
yield {'titletext':title}

Running the main code in spider

cd quote# name of the small project

scrapy crawl quotes # name of the spider

Using shell(css selector) inside scrapy

scrapy shell “http://quotes.toscrape.com/"

response.css(“title::text”).extract()

response.css(“title::text”)[0].extract() #want a item inside a list

response.css(“span.text::text”).extract() #. represents class # represents ID

Using xpath selector

response.xpath(“//title”).extract()

response.xpath(“//title/text()”).extract()

response.xpath(“//span[@class=’text’]/text()”).extract() # all text

response.xpath(“//span[@class=’text’]/text()”)[1].extract() # second quote, if its an ID instead of a class, @ID=

response.css(“li.next a”).extract() #li tag contains a ‘a’ tag inside it

Extract data — temporary containers(items) — storing in database

In items.py

1
2
3
4
import scrapyclass QuoteItem(scrapy.Item):    # define the fields for your item here like:    
title = scrapy.Field()
author = scrapy.Field()
tag = scrapy.Field()

In spider(quotes_spider.py)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy
from ..items import QuoteItem

class QuoteSpider(scrapy.Spider):
name = 'quotes' #name of the spider
start_urls = ['http://quotes.toscrape.com/'] #don't change the name of name and start_urls parse

def parse(self, response):
items = QuoteItem() # create an instance from a class

all_div_quotes = response.css('div.quote')

for quotes in all_div_quotes:
title = quotes.css('span.text::text').extract()
author = quotes.css('.author::text').extract()
tag = quotes.css('.tag::text').extract()

items['title'] = title
items['author'] = author
items['tag'] = tag

yield items

Storing all the data in json,xml and csv

scrapy crawl quotes -o items.json # -o: output

scraped data — item containers — json/csv

scraped data — item containers — pipeline —mysql database

go to settings.py uncomment configure item pipelines

1
2
3
ITEM_PIPELINES = {
'quote.pipelines.QuotePipeline': 300,# the lower the number the more priority given
}

install mysql-connector-python