How to Connect Scrapy to your PostgreSQL database

Nicolas Bourriez
3 min readApr 2, 2021

--

Directly store your scraped data in your database with pipelines

Use pipelines to connect your web scraper to your database

For a side project, I wanted to store the items I scraped on my personal PostgresSQL database, hosted on my OVH VPS.

But the thing was that the output data I would get in the form of JSON files were too difficult and heavy to manipulate. I simply just couldn’t find an efficient way to export them on to the postgres db without risking to loose some data on the way, or just taking ages to upload files one by one using

scp dummy_heavy_data.json myvps@36.374.394.43.dummy:/illustrative/directory

But I dug for some info and racked my brain out to find a solution to this problem, and when I finally found one (that was in addition user-friendly and fast to implement), I definitively praised Scrapy and its pipelines.py file.

Let me unwrap the method to connect your web scraper to your SQL database.

1. Be sure to have a schema configured for your items

For the method to function, it needs to have a coherent schema between your SQL database and your item configured in your items.py file.

Let’s consider both situations:

Option n°1: No table is configured in your database

In this case, start from scratch and create your table in your database, for e.g a table containing tweets about blockchain:

CREATE TABLE blockchain_tweets (
user_id serial PRIMARY KEY,
username VARCHAR ( 50 ) UNIQUE NOT NULL,
tweet VARCHAR ( 255 ) NOT NULL,
nb_of_retweets INT,
nb_of_likes INT,
nb_of_comments INT,
tweeted_on TIMESTAMP NOT NULL,
);

Once the table is initiated inside your SQL database, configure your Scrapy item schema in items.py:

import scrapyclass TweetItem(scrapy.Item):
# define the fields for your item here like:
user_id = scrapy.Field()
username = scrapy.Field()
tweet = scrapy.Field()
nb_of_retweets = scrapy.Field()
nb_of_likes = scrapy.Field()
nb_of_comments = scrapy.Field()
tweeted_on = scrapy.Field()
pass

Option n°2: A table is already configured and awaits for data

In the case a table is already kindly waiting for data to knock on, simply be aware of configuring the exact same schema for your item in order to make the pipeline possible.

2. Configure your pipeline

All the game is in your pipelines.py file. Following the example of blockchain tweets, here is the configuration of a file (with dummy data for database config):

import psycopg2class TweetPipeline(object):
"""Tweets pipeline for storing scraped items in the database"""
#Define function to configure the connection to the database & connect to it def open_spider(self, spider):
hostname = 'localhost'
username = 'USERNAME_USED_TO_CONNECT_TO_DATABASE'
password = 'PASSWORD_TO_CONNECT_TO_DATABASE'
database = 'NAME_OF_DATABASE'
self.connection = psycopg2.connect(
host=hostname, user=username, password=password,
dbname=database)
self.cur = self.connection.cursor()
#Define function to disconnect from database def close_spider(self, spider):
self.cur.close()
self.connection.close()
#Define function to process each scraped item and insert it into PostgreSQL table def process_item(self, item, spider):
try:
#Execute SQL command on database to insert data in table
self.cur.execute("insert into blockchain_tweets(user_id,username,tweet,nb_of_retweets,nb_of_likes,nb_of_comments,tweeted_on) values(%s,%s,%s,%s,%s,%s,%s)", (item['user_id'], item['username'], item['tweet'], item['nb_of_retweets'], item['nb_of_likes'], item['nb_of_comments'], item['tweeted_on']))
self.connection.commit()
except:
self.connection.rollback()
raise
return item

Once everything is configured and up to go, we can activate the connection before scraping.

3. Activate the connection

The last thing to do is activate the connection in settings.py so that Scrapy knows it must toggle the pipeline for this scrap.

In order to do so, look for the following piece of code to uncomment first, and modify then with the name of your pipeline:

# Configure item pipelines
# See <https://docs.scrapy.org/en/latest/topics/item-pipeline.html>
ITEM_PIPELINES = {
'webscrapping.pipelines.TweetPipeline': 300,
}

And off you go ! You can now scrap without having to worry about your data, it will automatically be stored onto your connected database :)

--

--