5.2. Pandas exercises

Before attempting this exercise, make sure you’ve read through the first four pages of Chapter 3 of the Python Data Science Handbook.

We’re going to be using a dataset about movies to try out processing some data with Pandas.

We start with some standard imports:

import ast
import pandas as pd
import numpy as np

We are providing you with data for this exercise that comes from The Movie Database. To create this lesson we used the TMDb API but our book is not endorsed or certified by TMDb. Their API also provides access to data on many additional movies, actors and actresses, crew members, and TV shows.

Then we load the data from a local file and checkout the data:

df = pd.read_csv('../Data/movies_metadata.csv').dropna(axis=1, how='all')
df.head()
belongs_to_collection budget genres homepage id imdb_id original_language original_title overview popularity ... release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862.0 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... 21.946943 ... 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0
1 NaN 65000000 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... NaN 8844.0 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... 17.015539 ... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0
2 {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... NaN 15602.0 tt0113228 en Grumpier Old Men A family wedding reignites the ancient feud be... 11.712900 ... 1995-12-22 0.0 101.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Still Yelling. Still Fighting. Still Ready for... Grumpier Old Men False 6.5 92.0
3 NaN 16000000 [{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam... NaN 31357.0 tt0114885 en Waiting to Exhale Cheated on, mistreated and stepped on, the wom... 3.859495 ... 1995-12-22 81452156.0 127.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Friends are the people who let you be yourself... Waiting to Exhale False 6.1 34.0
4 {'id': 96871, 'name': 'Father of the Bride Col... 0 [{'id': 35, 'name': 'Comedy'}] NaN 11862.0 tt0113041 en Father of the Bride Part II Just when George Banks has recovered from his ... 8.387519 ... 1995-02-10 76578911.0 106.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Just When His World Is Back To Normal... He's ... Father of the Bride Part II False 5.7 173.0

5 rows × 23 columns

5.2.1. Exploring the data

This dataset was obtained from Kaggle who downloaded it through the TMDB API.

The movies available in this dataset are in correspondence with the movies that are listed in the MovieLens Latest Full Dataset.

Let’s see what data we have:

df.shape
(45453, 23)

Twenty-three columns of data for over 45,000 movies is going be a lot to look at but let’s start by looking at what the columns represent:

df.columns
Index(['belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

Here’s an explanation of each column:

  • belongs_to_collection: A stringified dictionary that identifies the collection that a movie belongs to (if any).
  • budget: The budget of the movie in dollars.
  • genres: A stringified list of dictionaries that list out all the genres associated with the movie.
  • homepage: The Official Homepage of the movie.
  • id: An arbitrary ID for the movie.
  • imdb_id: The IMDB ID of the movie.
  • original_language: The language in which the movie was filmed.
  • original_title: The title of the movie in its original language.
  • overview: A blurb of the movie.
  • popularity: The Popularity Score assigned by TMDB.
  • poster_path: The URL of the poster image (relative to http://image.tmdb.org/t/p/w185/).
  • production_companies: A stringified list of production companies involved with the making of the movie.
  • production_countries: A stringified list of countries where the movie was filmed or produced.
  • release_date: Theatrical release date of the movie.
  • revenue: World-wide revenue of the movie in dollars.
  • runtime: Duration of the movie in minutes.
  • spoken_languages: A stringified list of spoken languages in the film.
  • status: Released, To Be Released, Announced, etc.
  • tagline: The tagline of the movie.
  • title: The official title of the movie.
  • video: Indicates if there is a video present of the movie with TMDB.
  • vote_average: The average rating of the movie on TMDB.
  • vote_count: The number of votes by users, as counted by TMDB.
Next Section - 5.3. Filtering the data