21.7. Spidering Twitter using a database¶

Note

Spidering Twitter uses the Tweepy API at https://docs.tweepy.org/en/stable/api.html. This requires a developer account, which you may or may not be granted. To apply for a developer account see the instructions at https://developer.twitter.com/en/support/twitter-api/developer-account.

In this section, we will create a simple spidering program that will go through Twitter accounts and build a database of them.

One of the problems of any kind of spidering program is that it needs to be able to be stopped and restarted many times and you do not want to lose the data that you have retrieved so far. You don’t want to always restart your data retrieval at the very beginning so we want to store data as we retrieve it so our program can start back up and pick up where it left off.

Q-1: Which of the following is a problem with spidering programs?

Spidering programs need to be stopped and restarted frequently.
One of the problems of any kind of spidering program is that it needs to be able to be stopped and restarted many times and you do not want to lose the data that you have retrieved so far.
You can lose the data you already gathered.
You won't lose any data if you store the data in a database
Spidering programs cannot be restarted.
Spidering programs can be restarted from where they leave off.

We will start by retrieving one person’s Twitter friends and their statuses, looping through the list of friends, and adding each of the friends to a database to be retrieved in the future. After we process one person’s Twitter friends, we check in our database and retrieve one of the friends of the friend. We do this over and over, picking an “unvisited” person, retrieving their friend list, and adding friends we have not seen to our list for a future visit.

We also track how many times we have seen a particular friend in the database to get some sense of their “popularity”.

By storing our list of known accounts and whether we have retrieved the account or not, and how popular the account is in a database on the disk of the computer, we can stop and restart our program as many times as we like.

Q-2: Which of the following are true?

Programs using databases may be stopped and restarted, but do not retain the data.
Spidering programs do retain data by storing it in a database
Spidering programs gather data by looping through data
Spidering programs do loop through data

This program is a bit complex.

Here is the source code for our Twitter spidering application:

import json
import sqlite3
import twitter_info
import tweepy
import os

# Set up OAuth2 for access to twitter
auth = tweepy.OAuth2BearerHandler(twitter_info.bearer_token)
api = tweepy.API(auth, wait_on_rate_limit=True)

# set up the database
dir = os.path.dirname(__file__) + os.sep
conn = sqlite3.connect(dir + 'spider.db')
cur = conn.cursor()
cur.execute('''
        CREATE TABLE IF NOT EXISTS Twitter
        (name TEXT, retrieved INTEGER, friends INTEGER)''')

# loop till the user types quit
while True:
    acct = input('Enter a Twitter screen name, or hit enter, or type quit: ')
    if (acct == 'quit'): break
    if (len(acct) < 1): # enter so pick someone to visit from the database

        # select a screen name that hasn't been visited yet
        cur.execute('SELECT name FROM Twitter WHERE retrieved = 0 LIMIT 1')
        try:
            acct = cur.fetchone()[0]
        except:
            print('No unretrieved Twitter accounts found')
            continue

    # Fetch friends for this screen name
    print("Fetching friends for screen name: " + acct)
    idList = api.get_friend_ids(screen_name=acct)

    # update the database to show that you have retrieved friends for this one
    cur.execute('UPDATE Twitter SET retrieved=1 WHERE name = ?', (acct, ))

    # loop through 5 ids of the friends (to not get too many at once)
    countnew = 0
    countold = 0
    num = min(5, len(idList)) # in case they do not have as least 5 friends
    for i in range(num):

        # get the current id
        currId = idList[i]
        # print(currId)
        # get the user information for this id
        friend = api.get_user(user_id=currId)
        screenName = friend.screen_name

        # check if we already have this screen name in the database
        cur.execute('SELECT friends FROM Twitter WHERE name = ? LIMIT 1',
               (screenName, ))
        try:
            # if there update the count
            count = cur.fetchone()[0]
            cur.execute('UPDATE Twitter SET friends = ? WHERE name = ?',
                   (count+1, screenName))
            countold = countold + 1
        except:
            # not there, so insert it
            cur.execute('''INSERT INTO Twitter (name, retrieved, friends)
                       VALUES (?, 0, 1)''', (screenName, ))
            countnew = countnew + 1
    print('New accounts=', countnew, ' revisited=', countold)
    conn.commit()

cur.close()

Our database is stored in the file spider.db and it has one table named Twitter. Each row in the Twitter table has a column for the account name, whether we have retrieved the friends of this account, and how many times this account has been “friended”.

In the main loop of the program, we tell the user to enter a Twitter account name, hit enter, or type “quit” to exit the program. If the user enters a Twitter account name, we retrieve the list of friends for that user and add up to five friends to the database if the friend is not already in the database. If the friend is already in the list, we add 1 to the friends field in the row in the database.

If the user presses enter, we look in the database for the next Twitter account that we have not yet retrieved, retrieve the friends for that account, add them to the database or update them, and increase their friends count.

Once we retrieve the list of friends, we loop through all of the user items in the returned JSON and retrieve the screen_name for each user. Then we use the SELECT statement to see if we already have stored this particular screen_name in the database and retrieve the friend count (friends) if the record exists.

# loop through 5 ids of the friends (to not get too many at once)
countnew = 0
countold = 0
num = min(5, len(idList)) # in case they do not have as least 5 friends
for i in range(num):

    # get the current id
    currId = idList[i]
    # print(currId)
    # get the user information for this id
    friend = api.get_user(user_id=currId)
    screenName = friend.screen_name

    # check if we already have this screen name in the database
    cur.execute('SELECT friends FROM Twitter WHERE name = ? LIMIT 1',
               (screenName, ))
    try:
        # if there update the count
        count = cur.fetchone()[0]
        cur.execute('UPDATE Twitter SET friends = ? WHERE name = ?',
                   (count+1, screenName))
        countold = countold + 1
    except:
        # not there, so insert it
        cur.execute('''INSERT INTO Twitter (name, retrieved, friends)
                       VALUES (?, 0, 1)''', (screenName, ))
        countnew = countnew + 1
print('New accounts=', countnew, ' revisited=', countold)
conn.commit()

Once the cursor executes the SELECT statement, we must retrieve the rows. We could do this with a for statement, but since we are only retrieving one row (LIMIT 1), we can use the fetchone() method to fetch the first (and only) row that is the result of the SELECT operation. Since fetchone() returns the row as a tuple (even though there is only one field), we take the first value from the tuple using to get the current friend count into the variable count.

If this retrieval is successful, we use the SQL UPDATE statement with a WHERE clause to add 1 to the friends column for the row that matches the friend’s account. Notice that there are two placeholders (i.e., question marks) in the SQL, and the second parameter to the execute() is a two-element tuple that holds the values to be substituted into the SQL in place of the question marks.

If the code in the try block fails, it is probably because no record matched the WHERE name = ? clause on the SELECT statement. So in the except block, we use the SQL INSERT statement to add the friend’s screen_name to the table with an indication that we have not yet retrieved the screen_name and set the friend count to zero.

So the first time the program runs and we enter a Twitter account, the program runs as follows:

Enter a Twitter screen name, or hit enter, or type quit: drchuck
Fetching friends for screen name: drchuck
New accounts= 5  revisited= 0
Enter a Twitter screen name, or hit enter, or type quit: quit

Since this is the first time we have run the program, the database is empty and we create the database in the file spider.db and add a table named Twitter to the database. Then we retrieve five friends and add them all to the database since the database is empty.

At this point, we might want to write a simple database dumper to take a look at what is in our spider.db file:

import sqlite3
import os

dir = os.path.dirname(__file__) + os.sep
conn = sqlite3.connect(dir + 'spider.db')
cur.execute('SELECT * FROM Twitter')
count = 0
for row in cur:
    print(row)
    count = count + 1
print(count, 'rows.')
cur.close()

This program simply opens the database and selects all of the columns of all of the rows in the table Twitter, then loops through the rows and prints out each row.

If we run this program after the first execution of our Twitter spider above, its output will be as follows:

('ravenmaster1', 0, 1)
('BrentSeverance', 0, 1)
('prairycat', 0, 1)
('lionelrobertjr', 0, 1)
('LockPickingLwyr', 0, 1)
5 rows.

We see one row for each screen_name, that we have not retrieved the data for that screen_name, and everyone in the database has one friend.

Now our database reflects the retrieval of the friends of our first Twitter account (drchuck). We can run the program again and tell it to retrieve the friends of the next “unprocessed” account by simply pressing enter instead of a Twitter account as follows:

Enter a Twitter account, or quit:
Enter a Twitter screen name, or hit enter, or type quit:
Fetching friends for screen name: ravenmaster1
New accounts= 5  revisited= 0
Enter a Twitter screen name, or hit enter, or type quit:
Fetching friends for screen name: BrentSeverance
New accounts= 4  revisited= 1
Enter a Twitter screen name, or hit enter, or type quit: quit

Since we pressed enter (i.e., we did not specify a Twitter account), the following code is executed:

if (len(acct) < 1): # enter so pick someone to visit from the database

   # select a screen name that hasn't been visited yet
   cur.execute('SELECT name FROM Twitter WHERE retrieved = 0 LIMIT 1')
   try:
       acct = cur.fetchone()[0]
   except:
       print('No unretrieved Twitter accounts found')
       continue

We use the SQL SELECT statement to retrieve the name of the first (LIMIT 1) user who still has their “have we retrieved this user” value set to zero. We also use the fetchone()[0] pattern within a try/except block to either extract a screen_name from the retrieved data or put out an error message and loop back up.

If we successfully retrieved an unprocessed screen_name, we retrieve their data as follows:

# Fetch friends for this screen name
print("Fetching friends for screen name: " + acct)
idList = api.get_friend_ids(screen_name=acct)

# update the database to show that you have retrieved friends for this one
cur.execute('UPDATE Twitter SET retrieved=1 WHERE name = ?', (acct, ))

Once we retrieve the data successfully, we use the UPDATE statement to set the retrieved column to 1 to indicate that we have completed the retrieval of the friends of this account. This keeps us from retrieving the same data over and over and keeps us progressing forward through the network of Twitter friends.

If we run the friend program and press enter twice to retrieve the next unvisited friend’s friends, then run the dumping program, it will give us the following output:

('ravenmaster1', 1, 1)
('BrentSeverance', 1, 1)
('prairycat', 0, 2)
('lionelrobertjr', 0, 1)
('LockPickingLwyr', 0, 1)
('myldn', 0, 1)
('DickieDover', 0, 1)
('Ukraine', 0, 1)
('AlisonMoyet', 0, 1)
('PhilipPullman', 0, 1)
('botimer', 0, 1)
('nessimonstar', 0, 1)
('educause', 0, 1)
('lindafeng', 0, 1)
14 rows.

We can see that we have properly recorded that we have visited ravenmaster1 and BrentSeverance. Also the accounts prairycat already has two followers. Since we now have retrieved the friends of three people (drchuck, ravenmaster1, and BrentSeverance) our table has 14 rows of friends to retrieve.

Each time we run the program and press enter it will pick the next unvisited account (e.g., the next account will be prairycat), retrieve their friends, mark them as retrieved, and for each of the friends of prairycat either add them to the end of the database or update their friend count if they are already in the database.

Since the program’s data is all stored on disk in a database, the spidering activity can be suspended and resumed as many times as you like with no loss of data.

You have attempted of activities on this page