10.3. Market Basket Analysis

The key question in market basket analysis is what products are most frequently purchased together.

To answer this question we will want to build a table where each row is a product, each column is a product and the cell where two products intersect is the count of the number of times they ended up in the same shopping cart.

mb = pd.DataFrame({'doritos': {'oreos': 2, 'snickers':6}, 'oreos': {'doritos':2, 'snickers':3}, 'snickers': {'doritos': 6, 'oreos':3}})
mb
doritos oreos snickers
doritos NaN 2.0 6.0
oreos 2.0 NaN 3.0
snickers 6.0 3.0 NaN

Looking at the table it is clear that doritos and snickers are most often purchased together because the total is 6. The next most common pair is snickers and oreos that just beat out doritos and oreos.

Looking at the table that is not too hard. We have the shopping cart data so we know what products have ended up in the same cart for thousands of carts.

The challenge is that we have a table with 49,688 columns and 49,688 rows. That means we have 2,468,897,344 (thats 2+ billion!) cells in our table. How many of those cells do you think are empty? Hint Lots of them!

products = pd.read_csv('ecomm/products.csv')
products.head()
product_id product_name aisle_id department_id
0 1 Chocolate Sandwich Cookies 61 19
1 2 All-Seasons Salt 104 13
2 3 Robust Golden Unsweetened Oolong Tea 94 7
3 4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1
4 5 Green Chile Anytime Sauce 5 13

The most common operation we are going to want to use with this table is to look up a product id to get more information about the product. So, lets make the product_id the index of the DataFrame to make things faster.

products.set_index('product_id', inplace=True)
products.head()
product_name aisle_id department_id
product_id
1 Chocolate Sandwich Cookies 61 19
2 All-Seasons Salt 104 13
3 Robust Golden Unsweetened Oolong Tea 94 7
4 Smart Ones Classic Favorites Mini Rigatoni Wit... 38 1
5 Green Chile Anytime Sauce 5 13
len(products)**2
2468897344

10.3.1. Constructing an item-item matrix

To construct a matrix of the kind we showed above will take a bit of thought (and time!) So lets do some design first.

  1. Its a very good assumption that this data is sparse, so lets start by using a data structure that supports sparsity. – A dictionary of dictionaries is the key to this. In fact scroll back just a bit and look carefully at how our DataFrame was constructed. You will notice a dictionary like this:

    {'doritos': {'oreos': 2, 'snickers':6},
     'oreos': {'doritos':2, 'snickers':3},
     'snickers': {'doritos': 6, 'oreos':3}
     }
    

    If you think about it you will realise that we are storing twice as much data as we need to. This matrix we are building is symmetric. the value at i,j will always match the value at j,i because the order of the products is not important.

  2. The primary source of our data will be the order_products__train data. There the data is sorted by the order number and the order in which products were added to the cart. We want to take each order as a group and add all pairs of items in a cart to the matrix.

  3. We can take advantage of the symmetry by ordering the products in the same cart by their product id and always using the lower number as the first index.

Lets make a simple data file that matches our example and see how we can build a sparse version of it.

1 = doritos 2 = oreos 3 = snickers

ordernum,product 1, 1 1, 2 1, 3 2, 1 2, 3 3, 1 4, 1 4, 2 5, 1 5, 3 6, 1 6, 3 7, 1 7, 3 7, 3 8, 2 8, 3

small_o = pd.read_csv('small_orders.csv')
groups = small_o.groupby('ordernum')
groups.get_group(6)
ordernum product
10 6 1
11 6 3

The get_group function is awesome! This allows us to. get a mini data frame containing only the items in one particular order. Now if we are smart and process the items from smallest to largest we can build our dictionary based matrix no problem.

groups.get_group(1)['product'].sort_values()
0    1
1    2
2    3
Name: product, dtype: int64
cart = groups.get_group(1)['product'].sort_values()
cart.loc[1:]
1    2
2    3
Name: product, dtype: int64
for g in range(1,9):
    cart = groups.get_group(g)['product'].sort_values()
    for i in cart.index:
        for j in cart.loc[i+1:]:
            print(f"products {cart[i]} and {j} in cart")
    print("--")
products 1 and 2 in cart
products 1 and 3 in cart
products 2 and 3 in cart
--
products 1 and 3 in cart
--
--
products 1 and 3 in cart
--
products 1 and 3 in cart
--
products 1 and 3 in cart
--
products 1 and 2 in cart
products 1 and 3 in cart
products 2 and 3 in cart
--
products 2 and 3 in cart
--
mat = {}
for g in range(1,9):
    cart = groups.get_group(g)['product'].sort_values().tolist()
    for i in range(len(cart)):
        print(mat)
        if cart[i] not in mat:
            mat[cart[i]] = {}
        for j in cart[i+1:]:
            print(cart[i],j)
            mat[cart[i]][j] = mat[cart[i]].get(j,0) + 1
    print("--")
pd.DataFrame(mat)
{}
1 2
1 3
{1: {2: 1, 3: 1}}
2 3
{1: {2: 1, 3: 1}, 2: {3: 1}}
--
{1: {2: 1, 3: 1}, 2: {3: 1}, 3: {}}
1 3
{1: {2: 1, 3: 2}, 2: {3: 1}, 3: {}}
--
{1: {2: 1, 3: 2}, 2: {3: 1}, 3: {}}
--
{1: {2: 1, 3: 2}, 2: {3: 1}, 3: {}}
1 3
{1: {2: 1, 3: 3}, 2: {3: 1}, 3: {}}
--
{1: {2: 1, 3: 3}, 2: {3: 1}, 3: {}}
1 3
{1: {2: 1, 3: 4}, 2: {3: 1}, 3: {}}
--
{1: {2: 1, 3: 4}, 2: {3: 1}, 3: {}}
1 3
{1: {2: 1, 3: 5}, 2: {3: 1}, 3: {}}
--
{1: {2: 1, 3: 5}, 2: {3: 1}, 3: {}}
1 2
1 3
{1: {2: 2, 3: 6}, 2: {3: 1}, 3: {}}
2 3
{1: {2: 2, 3: 6}, 2: {3: 2}, 3: {}}
--
{1: {2: 2, 3: 6}, 2: {3: 2}, 3: {}}
2 3
{1: {2: 2, 3: 6}, 2: {3: 3}, 3: {}}
--
1 2 3
2 2 NaN NaN
3 6 3.0 NaN
mat
{1: {2: 2, 3: 6}, 2: {3: 3}, 3: {}}

Ok, now we have a “cooccurence matrix” . given one product we an tell how often that product is in the same shopping cart as many others. The matrix we have built turns out to be a “lower triangular” matrix because we are only storing the lower left. The upper right is symmetric so we can save half the storage!

Important Saving storage often comes with an additional cost in complexity. In this case, becuase we are building a “lower triangular” matrix we have to be careful if we want to get all of the products that are purchased together. We cannot just look at the column corresponding to the product and we cannot just look at the row corresponding to the product. If we wanted to know everything purchased with product 2 we have to look at the row for 2 as well as the column for two. The row for 2 tells us that 2 was purchased with 1 (2 times) and the column for 2 tells us that 2 was purchased with 3 (3 times). If we kept both triangles we could look at either the row or the column.

Let’s build the item item matrix for the instacart data and see what we can learn!

The first thing we’ll need is a list of unique order ids. In the toy example above we were able to just use a range of numbers, because we knew that the order numbers started at 1 and went sequentailly.

order_products = pd.read_csv("ecomm/order_products__prior.csv")
order_products.head()
order_id product_id add_to_cart_order reordered
0 2 33120 1 1
1 2 28985 2 1
2 2 9327 3 0
3 2 45918 4 1
4 2 30035 5 0
import ipywidgets
def log_progress(sequence, every=None, size=None, name='Items'):
    from ipywidgets import IntProgress, HTML, VBox
    from IPython.display import display

    is_iterator = False
    if size is None:
        try:
            size = len(sequence)
        except TypeError:
            is_iterator = True
    if size is not None:
        if every is None:
            if size <= 200:
                every = 1
            else:
                every = int(size / 200)     # every 0.5%
    else:
        assert every is not None, 'sequence is iterator, set every'

    if is_iterator:
        progress = IntProgress(min=0, max=1, value=1)
        progress.bar_style = 'info'
    else:
        progress = IntProgress(min=0, max=size, value=0)
    label = HTML()
    box = VBox(children=[label, progress])
    display(box)

    index = 0
    try:
        for index, record in enumerate(sequence, 1):
            if index == 1 or index % every == 0:
                if is_iterator:
                    label.value = '{name}: {index} / ?'.format(
                        name=name,
                        index=index
                    )
                else:
                    progress.value = index
                    label.value = u'{name}: {index} / {size}'.format(
                        name=name,
                        index=index,
                        size=size
                    )
            yield record
    except:
        progress.bar_style = 'danger'
        raise
    else:
        progress.bar_style = 'success'
        progress.value = index
        label.value = "{name}: {index}".format(
            name=name,
            index=str(index or '?')
        )
%%time

groups = order_products.groupby('order_id')
unique_orderids = order_products.order_id.unique()
mat = {}
for g in log_progress(unique_orderids, size=len(unique_orderids)):
    cart = groups.get_group(g)['product_id'].sort_values().tolist()
    for i in range(len(cart)):
        if cart[i] not in mat:
            mat[cart[i]] = {}
        for j in cart[i+1:]:
            mat[cart[i]][j] = mat[cart[i]].get(j,0) + 1
VBox(children=(HTML(value=''), IntProgress(value=0, max=3214874)))
CPU times: user 39min 39s, sys: 21.5 s, total: 40min
Wall time: 40min 15s

A bit of analysis revealed that there are are HUGE number of entries in the matrix that are a count of 1. These 1 time “co-purchases” don’t give us much useful information for recommending products, so lets save some memory and remove them.

You can’t remove things from a dictionary while you are iterating over a dictionary. So we will need to make a list of keys to remove in one pass and then delete them later.

delkeys = []
for i in mat.keys():
    for k,v in mat[i].items():
        if v == 1:
            delkeys.append((i,k))

len(delkeys)
21944168

Yep, 21.9 million entries in our matrix are 1’s

for i,j in delkeys:
    del mat[i][j]
%%time

smat = pd.SparseDataFrame(mat)
smat.head()
CPU times: user 10min 55s, sys: 33.8 s, total: 11min 29s
Wall time: 11min 41s

We can check on the density of our sparse data structure by looking at its density attribute.

smat.density
0.008275774966857377

And we see that it is only 0.8% full!

We can use idxmax to give us a series that for each column tells us the row with the maximum value for that column.

maxcols = smat.idxmax()
maxcols = maxcols.dropna()
%%time

maxcc = 0
maxrow = None
maxcol = None
for col, row in maxcols.astype(int).iteritems():
    if smat.loc[row, col] > maxcc:
        maxrow = row
        maxcol = col
        maxcc = smat.loc[row,col]
CPU times: user 1.46 s, sys: 1.86 s, total: 3.32 s
Wall time: 5.95 s
maxcc
62341.0
maxrow
47209
maxcol
13176

10.3.1.1. Testing our item-item matrix

Let’s test the matrix by doing some exploring. What are the two products most commonly purchased together?

print(f"product {maxrow} was purchased with {maxcol} {maxcc} times")
product 47209 was purchased with 13176 62341.0 times

Because we were smart before and made the product_id the index of the products table we can use this nice lookup syntax to get the product name!

products.loc[maxrow, 'product_name']
'Organic Hass Avocado'
products.loc[maxcol, 'product_name']
'Bag of Organic Bananas'
def get_product_byid(df, idx):
    return df.loc[idx].product_name

OK, so now lets see what our real data has to say about the products that are bought with Doritos.

products[products.product_name.str.contains('Dorito')]
product_name aisle_id department_id
product_id
2144 Doritos 107 19
12540 Doritos Nacho Cheese Sandwich Crackers 78 19
42541 Cheetos Flamin' Hot & Doritos Dinamita Chile L... 107 19
def get_product_count(sp_mat, ix1, ix2):
    if ix1 > ix2:
        return sp_mat.loc[ix1, ix2]
    else:
        return sp_mat.loc[ix2, ix1]
get_product_count(smat, 47209, 13176)
62341.0
def get_all_cocart(sp_mat, pid):
    """
    Return a Pandas series where the index is the product id of products that were in
    the same shopping cart.  The value indicates the count of those times.
    remove the NA's
    """
    return pd.concat((sp_mat[pid], sp_mat.loc[pid])).dropna()
get_all_cocart(smat, 2144).nlargest(10)
24852    68.0
16797    41.0
19734    34.0
16696    25.0
23909    25.0
45064    24.0
28199    23.0
10673    21.0
17122    21.0
13249    17.0
Name: 2144, dtype: float64
for idx, val in get_all_cocart(smat, 2144).nlargest(10).iteritems():
    print(get_product_byid(products,idx), val)
Banana 68.0
Strawberries 41.0
Classic Mix Variety 34.0
Coke Classic 25.0
2% Reduced Fat Milk 25.0
Honey Wheat Bread 24.0
Clementines, Bag 23.0
Original Nooks & Crannies English Muffins 21.0
Honeycrisp Apples 21.0
Skim Milk 17.0
get_product_byid(products, 2144)
'Doritos'
def product_search(df, name):
    prods = df.product_name.str.lower()
    return df[prods.str.contains(name)].product_name
product_search(products, 'diapers')
product_id
15                                Overnight Diapers Size 6
682                   Cruisers Diapers Jumbo Pack - Size 5
765              Swaddlers Diapers Jumbo Pack Size Newborn
879                                Baby Dry Diapers Size 4
1304              Little Movers Comfort Fit Size 3 Diapers
1716     Baby Dry Pampers Baby Dry Diapers Size 5 78 Co...
3087              Baby Dry Pampers Baby Dry Diapers Size 2
3277                 Overnight Diapers Sleepy Sheep Size 4
4630     Baby Dry Pampers Baby Dry Newborn Diapers Size...
5444     Little Snugglers Jumbo Pack Size 2 Disney Diap...
5657                              Baby Dry Diapers  Size 5
5897                               Baby Dry Diapers Size 3
6401               Tender Care Diapers Jumbo Pack - Size 4
6986     Diapers, Overnight, Free & Clear, Size 6 (35+ ...
7487                              Swaddlers Diapers Size 1
7489                   Swaddlers Size 4 Giant Pack Diapers
8102                         Naty Diapers Size 1, 8-14 lbs
9121                    Diapers Cruisers Size 4 Super Pack
9356                              Swaddlers Size 2 Diapers
9482                                        Diapers Size 1
9927                             Size 4 Snug & Dry Diapers
10011                                  Baby Diapers Size 2
10420                                Honest Diapers Size 4
11660                         Tribal Pastel Size 3 Diapers
11745    Swaddlers Sensitive Diapers Jumbo Pack Size Ne...
11922    Pants Pampers Easy Ups Training Pants Boys Siz...
12340                     Free & Clear Size 4 Baby Diapers
13377                    Swaddlers Diapers Jumbo Pack Size
13801                Free & Clear Overnight Diapers Size 5
14009                      Snug & Dry Diapers Step 1 Jumbo
                               ...
35954           Little Movers Diapers, Giant Pack - Size 5
36200                      Baby Dry Diapers Size 6 Diapers
36453                             Size 3 M  Skulls Diapers
36831                  Cruisers Diapers Giant Pack, Size 6
37172                                       Size 4 Diapers
37872     Free & Clear Size 4 22-37 Lbs Disposable Diapers
37949                  Diapers Swaddlers Size 2 (12-18 lb)
38365                   Size 5 Cruisers Diapers Super Pack
38899                         Little Movers Size 3 Diapers
40110                            Giraffes Diapers Size 4 L
40343                 Baby Dry Diapers Giant Pack - Size 6
40355                   Baby Dry Size 4 Disposable Diapers
40537          Free & Clear Stage 1 8-14 Lbs. Baby Diapers
40916                                       Size 2 Diapers
41393                              Baby Dry Size 4 Diapers
41475                   Baby Dry Diapers Jumbo Pack Size 4
41595                            Snug & Dry Size 2 Diapers
41705                                Honest Diapers Size 3
42923           Baby Free & Clear Size 3 16-28 Lbs Diapers
43217    Honest Diapers Eco-Friendly & Premium Diapers ...
43481                  Cruisers Diapers Jumbo Pack  Size 3
43989                            Cruisers Diapers - Size 6
44950                 Swaddlers Diapers Super Pack, Size 3
45786            Little Movers Diapers Giant Pack - Size 3
46583                       Tribal Pastel Size 4/L Diapers
46599    Ultra Leakguards Value Pack Diapers Size 3 (16...
46608       Free & Clear Newborn Up To 10 lbs Baby Diapers
47578                                              Diapers
47632                                       Honest Diapers
48263                                Honest Diapers Size 5
Name: product_name, Length: 93, dtype: object
# snickers - 14261

for idx, val in get_all_cocart(smat, 682).nlargest(10).iteritems():
    print(get_product_byid(products,idx), val)
Strawberries 13.0
Banana 13.0
Zero Rise Orange 7.0
Organic Fuji Apple 7.0
Baby Fresh Pampers Baby Wipes Baby Fresh 1X 64 count  Baby Wipes 6.0
Black Beans 6.0
Honey Nut Cheerios 6.0
Baby Wipes Sensitive 6.0
Select-A-Size Paper Towels, White, 2 Huge Rolls = 5 Regular Rolls  Towels/Napkins 5.0
Peach Yoghurt 5.0

10.3.2. Cleaning up and saving

Since building the item-item matrix takes some time we should save it in a format that is convenient for us to reload so we don’t need to remake it every time.

We can probably reduce the size of our sparse matrix by eliminating all of the cells with a count of 1. That doesn’t really tell us anything that we would want to use in making a recommendation.

We can also eliminate our original dictionary

smat.to_pickle('item_item.pkl')

10.3.2.1. Understanding the item-item matrix

This kind of shopping cart analysis is useful in many areas. Whether it is news articles, stocks, search terms, or products this kind of recommender is widely used

  • Create a histogram that shows the distribution of the shopping cart co-occurence counts.
  • How many items in this item-item matrix contain a count of 1. That is probably not good information and you could save a lot more memory by deleting all of the items with a count of 1 from smat.
  • Can you make a visualization of this item-item matrix?
forhist = pd.DataFrame({'allvals': smat.values.flatten()})
forhist = forhist.dropna()
alt.Chart(forhist).mark_bar().encode(x=alt.X('allvals', bin=True),y='count()')
../_images/Instacart_69_0.png

10.3.2.2. Experimenting with item-item recommendations

  • The histogram above shows that the vast majority of the items are in the 0-200 co-occurence range. But the items purchased together outside that big bar are interesting. Write a function to print out the item pairs that have been in the same shopping cart more than 200 times.
  • Re-do the histogram so that it focuses in on the products that have between 0 and 200 co-occurrences.
  • Write a function called top_n that takes a product name to search for, allows the user to select the best match and then returns the topn recommendatons for products that have been purchased with the selected item.
  • Write a function that takes a product id as its parameter and then recommends the top10 products to go with the given product but from the same department or same aisle.
  • One of the problems with a recommender like this one is that it tends to recommend a lot of popular items. We might call this the bananna problem in this dataset! Can you devise a strategy to recommend things that are not just the popular things?
  • Design an experiment whereby you can train an item-item model like we have done above and then test it. Perhaps in the training set you withold that last item added to the shopping cart to see how frequently you can predict the last item based on the first items.
  • Challenge - The original collaborative filtering recommender system was not item-item like this was. It was user-user where the recommendations came from finding a group of users similar to the subject user based on their ratings or purchase behavior. The system would then recommend items to the subject user based on items that their similar users had purchased but the subject had not. Can you write such a recommender and devise an experiment to compare it to the item-item recommender?

Lesson Feedback

    During this lesson I was primarily in my...
  • Comfort Zone
  • Learning Zone
  • Panic Zone
    Completing this lesson took...
  • Very little time
  • A reasonable amount of time
  • More time than is reasonable
    Based on my own interests and needs, the things taught in this lesson...
  • Don't seem worth learning
  • May be worth learning
  • Are definitely worth learning
    For me to master the things taught in this lesson feels...
  • Definitely within reach
  • Within reach if I try my hardest
  • Out of reach no matter how hard I try
Next Section - 10.4. Visualizing Grocery Data