Creating project

create a project in pycharm called scrapy_learn

scrapy startproject quote #project name inside scrapy_learn

create a spider must inside the spiders folder

import scrapy
class QuoteSpider(scrapy.Spider):    
    name = 'quotes' #name of the spider    
    start_urls = ['http://quotes.toscrape.com/']  #don't change the name of name and start_urls parse    
    def parse(self, response):        
        title = response.css('title::text').extract()        
        yield {'titletext':title}

Running the main code in spider

cd quote# name of the small project

scrapy crawl quotes # name of the spider

Using shell(css selector) inside scrapy

scrapy shell “http://quotes.toscrape.com/"

response.css(“title::text”).extract()

response.css(“title::text”)[0].extract() #want a item inside a list

response.css(“span.text::text”).extract() #. represents class # represents ID

Using xpath selector

response.xpath(“//title”).extract()

response.xpath(“//title/text()”).extract()

response.xpath(“//span[@class=’text’]/text()”).extract() # all text

response.xpath(“//span[@class=’text’]/text()”)[1].extract() # second quote, if its an ID instead of a class, @ID=

response.css(“li.next a”).extract() #li tag contains a ‘a’ tag inside it

Extract data — temporary containers(items) — storing in database

In items.py

import scrapyclass QuoteItem(scrapy.Item):    # define the fields for your item here like:    
    title = scrapy.Field()    
   	author = scrapy.Field()    
    tag = scrapy.Field()

In spider(quotes_spider.py)

import scrapy
from ..items import QuoteItem

class QuoteSpider(scrapy.Spider):
    name = 'quotes' #name of the spider
    start_urls = ['http://quotes.toscrape.com/']  #don't change the name of name and start_urls parse

    def parse(self, response):
        items = QuoteItem() # create an instance from a class

        all_div_quotes = response.css('div.quote')

        for quotes in all_div_quotes:
            title = quotes.css('span.text::text').extract()
            author = quotes.css('.author::text').extract()
            tag =  quotes.css('.tag::text').extract()

            items['title'] = title
            items['author'] = author
            items['tag'] = tag

            yield items

Storing all the data in json,xml and csv

scrapy crawl quotes -o items.json # -o: output

scraped data — item containers — json/csv

scraped data — item containers — pipeline —mysql database

go to settings.py uncomment configure item pipelines

1
2
3

ITEM_PIPELINES = {
   'quote.pipelines.QuotePipeline': 300,# the lower the number the more priority given
}

install mysql-connector-python