5.6. Dealing with Multiple DataFrames¶

Forget about budget or runtimes as criteria for selecting a movie, let’s take a look at popular opinion. Our dataset has two relevant columns: vote_average and vote_count.

Let’s create a variable called df_high_rated that only contains movies that have received more than 20 votes, and whose average score is greater than 8.

import pandas as pd
df = pd.read_csv("https://runestone.academy/ns/books/published/httlads/_static/movies_metadata.csv").dropna(axis=1, how='all')

df_highly_voted = df[df.vote_count > 20]
df_high_rated = df_highly_voted[df_highly_voted.vote_average > 8]
df_high_rated[['title', 'vote_average', 'vote_count']].head()

	title	vote_average	vote_count
46	Se7en	8.1	5915.0
49	The Usual Suspects	8.1	3334.0
109	Taxi Driver	8.1	2632.0
256	Star Wars	8.1	6778.0
289	Leon: The Professional	8.2	4293.0

Here we have some high-quality movies, at least according to some people.

But what about my opinion?

Here are my favorite movies and their relative scores. Create a DataFrame called compare_votes that contains the title as an index and both the vote_average and my_vote as its columns. Also, only keep the movies that are both my favorites and popular favorites.

Hint: You’ll need to create two Series, one for my ratings and one that maps titles to vote_average.

my_votes = {
    "Star Wars": 9,
    "Paris is Burning": 8,
    "Dead Poets Society": 7,
    "The Empire Strikes Back": 9.5,
    "The Shining": 8,
    "Return of the Jedi": 8,
    "1941": 8,
    "Forrest Gump": 7.5,
}

There should be only 6 movies remaining.

Now add a column to compare_votes that measures the percentage difference between the popular rating and my rating for each movie. You’ll need to take the difference between the vote_average and my_vote and divide it by my_vote.

compare_votes

Q-3: Make up 3 questions you would like to answer about this movie data using the techniques you have learned in this lesson and write them in the box.

Q-4: Summarize the answers to your questions here.

Lesson Feedback

1. Comfort Zone
2. Learning Zone
3. Panic Zone

1. Very little time
2. A reasonable amount of time
3. More time than is reasonable

1. Don't seem worth learning
2. May be worth learning
3. Are definitely worth learning

1. Definitely within reach
2. Within reach if I try my hardest
3. Out of reach no matter how hard I try

You have attempted of activities on this page

Before you keep reading...

Before you keep reading...

5.6. Dealing with Multiple DataFrames¶