Winning Jeopardy

13 minute read

Jeopardy is a popular TV show in the US where participants answer questions to win money. It’s been running for a few decades, and is a major force in popular culture.

Let’s say we want to compete on Jeopardy, and we’re looking for any edge we can get to win. In this project, we’ll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

Project Goal In this project, we’ll work with a dataset of Jeopardy questions to figure out some patterns in the questions that could help us win.

This project was motivated as part of the guided project series on dataquest. The analysis is an extended version of the guided project.

Jeopardy Questions

Let’s explore the jeopardy questions. We will see that the dataset contains the following columns:

  • Show Number – the Jeopardy episode number of the show this question was in.
  • Air Date – the date the episode aired.
  • Round – the round of Jeopardy that the question was asked in. Jeopardy has several rounds as each episode progresses.
  • Category – the category of the question.
  • Value – the number of dollars answering the question correctly is worth.
  • Question – the text of the question.
  • Answer – the text of the answer.

Let’s start our exploration with first of all loading the complete Jeopardy Question csv file from this reddit post. The csv file was downloaded and added to the folder location.

import pandas as pd
from string import punctuation
import re
import matplotlib.pyplot as plt
from scipy.stats import chisquare

# read in the dataset
jeopardy = pd.read_csv('jeopardy.csv')
jeopardy.columns
Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')

We can see that the names of each column contain an additional space. Let’s remove those and keep the name only.

# clean column names from additional spaces
jeopardy.columns = jeopardy.columns.str.replace(' ', '')
jeopardy.columns
Index(['ShowNumber', 'AirDate', 'Round', 'Category', 'Value', 'Question',
       'Answer'],
      dtype='object')
jeopardy.shape
(216930, 7)
jeopardy.head()
ShowNumber AirDate Round Category Value Question Answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams

Our dataset contains 216930 questions.

Normalizing Text

Before we start with our analysis, we need to normalize the text, that is ensure lowercase words and remove punctuation. There are two ways we can go about this. Let’s see which one is more efficient.

# function to normalize the text using python string library
def normalize_text(s):
    s = str(s).lower()
    s = s.translate(str.maketrans('', '', punctuation))
    return s

# function to normalize the text using regex
def normalize_text_reg(s):
    s = str(s).lower()
    s = re.sub("[^A-Za-z0-9\s]", "", s)
    return s

s = 'Hello.?! World.*'
%timeit normalize_text(s)
%timeit normalize_text_reg(s)
3.26 µs ± 38.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
3.26 µs ± 194 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

We can observe that using the function with regular expression seems to be much faster. We will use it to treat and normalize our Question and Answer columns.

# normalize our columns
jeopardy['clean_question'] = jeopardy['Question'].apply(normalize_text_reg)
jeopardy['clean_answer'] = jeopardy['Answer'].apply(normalize_text_reg)
jeopardy.head()
ShowNumber AirDate Round Category Value Question Answer clean_question clean_answer
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams

Normalizing Columns

Now that we have normalized the string columns, let’s also normalize Value and Air Date columns. We will remove the $ sign from the Value column and convert the column to numeric; we will also convert Air Date to datetime.

# function to normalize value column 
def normalize_value(s):
    s = re.sub("[^A-Za-z0-9\s]", "", s)
    try:
        s = int(s)
    except:
        s = 0
    return s

# clean value column
jeopardy['clean_value'] = jeopardy['Value'].apply(normalize_value)
jeopardy['AirDate'] = pd.to_datetime(jeopardy['AirDate'])

Answers in Questions

In order to figure out whether to study past questions, study general knowledge, or not study it all, it would be helpful to figure out two things:

  • How often the answer is deducible from the question.
  • How often new questions are repeats of older questions.

We can answer the first question by seeing how many times words in the answer also occur in the question. We can answer the 2nd question by seeing how often meaningful words occur. Meaningful in this context means words that are not stopwords or words that make sense. Let’s tackle the first question first.

# function to count words from answer in the question
def count_terms(row):
    # split the column values by space
    split_answer = row['clean_answer'].split()
    split_question = row['clean_question'].split()    
    
    # initialize count of words
    match_count = 0
    
    # remove 'the' from answers
    if 'the' in split_answer:
        split_answer.remove('the')
    
    # check if answer is 0
    if len(split_answer) == 0:
        return 0
    
    # count ocurrences of answer values in question
    for i in split_answer:
        if i in split_question:
            match_count += 1
    
    # return the calculated count
    return match_count / len(split_answer)
    
# count the words
jeopardy['answer_in_question'] = jeopardy.apply(count_terms, axis=1)

# calculate the mean
jeopardy['answer_in_question'].mean()
0.05792070323661065
jeopardy.head()
ShowNumber AirDate Round Category Value Question Answer clean_question clean_answer clean_value answer_in_question
0 4680 2004-12-31 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was ... Copernicus for the last 8 years of his life galileo was u... copernicus 200 0.0
1 4680 2004-12-31 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisl... Jim Thorpe no 2 1912 olympian football star at carlisle i... jim thorpe 200 0.0
2 4680 2004-12-31 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record av... Arizona the city of yuma in this state has a record av... arizona 200 0.0
3 4680 2004-12-31 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", th... McDonald's in 1963 live on the art linkletter show this c... mcdonalds 200 0.0
4 4680 2004-12-31 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Co... John Adams signer of the dec of indep framer of the const... john adams 200 0.0

We can see that on average the answer shows up in the question around 6% of the time. This tells us that its quite difficult to deduce the answer from the question.

Recycled Questions

Let’s move to tackling our second question, where we need to find out how often new questions are repeats of older ones.

We also should remove uninformative words (e.g stopwords) in order to better understand the question overlap. For this we will use nltk package to remove stopwords in the english language.

from nltk.corpus import stopwords

# sorty jeopary values in ascending order
jeopardy.sort_values('AirDate', inplace=True)

# stopwords and used terms
terms_used = set()
stop_words = set(stopwords.words('english'))

# function to calculate the number of times 
# a term from previous questions has been used in new ones
def calculate_overlap(row):
    """
    Returns a word match count (proportion) for each row containing 
    a question back to the 'question_overlap' columns. 
    The count is calculated used a set of words used in previous questions
    
    Parameters:
        row (pandas.Series): row from pandas dataframe       
    
    Returns:
        match_count (float): match count proportion
    
    """
    
    # split the question by space
    split_question = row.split()
    
    # keep only non-stop words
    split_question = [word for word in split_question if word not in stop_words]

    # initialize count
    match_count = 0
    
    # loop through words in split_question: count matches
    for w in split_question:
        if w in terms_used:
            match_count += 1

        # add words to terms used set
        terms_used.add(w)
            
    if len(split_question) > 0:
        match_count /= len(split_question)

    return match_count
        
# run the function on the clean_question column
jeopardy['question_overlap'] = jeopardy['clean_question'].apply(calculate_overlap)

# print the mean overlap
print("Average overlap of questions based on individual words is {}".format(jeopardy['question_overlap'].mean()))
Average overlap of questions based on individual words is 0.9285999315781669

Our results show that there is an 92% overlap between terms in new questions vs terms in old questions. This is quite a strong indicator that new questions tend to include terms from old questions. Nevertheless, this doesn’t help us as much since we are only looking at individual terms.

We could possibly investigate further and look at either phrases or find out how similar questions to each other. To keep this analysis brief, we will not cover it today.

Low value vs high value questions

Nevertheless, we can also look at questions that pertain high value in terms of money. This strategy can help us earn more money when on Jeopardy. We can actually figure out which terms correspond to high-value questions using a chi-squared test. We’ll first need to narrow down the questions into two categories:

  • Low value – Any row where Value is less than 800.
  • High value – Any row where Value is greater than 800.

Let’s write the code which will help us identify value pairs for terms.

# function to find value 
def calculate_value(row):
    value = 1 if row['clean_value'] > 800 else 0
    return value

# determine questions with high value
jeopardy['high_value'] = jeopardy.apply(calculate_value, axis=1)

# function to calculate value pairs (high, low count) for any word
def calc_word(word):
    high_count = 0
    low_count = 0
    
    for index, row in jeopardy.iterrows():
        q = row['clean_question'].split()
        if word in q:
            if row['high_value'] == 1:
                high_count += 1
            else:
                low_count += 1
    
    return high_count, low_count
# empty dictionary to hold the word and 
# counts of high and low value questions the term appears in
observed_expected = {}

# sample terms to run the value function on
comparison_terms = list(terms_used)[:10]

# calculate value pairs (high and low count) for each word in the sample
for word in comparison_terms:
    observed_expected[word] = calc_word(word)
observed_expected
{'flighty': (2, 6),
 'pennants': (0, 2),
 'miltons': (2, 3),
 'hrefhttpwwwjarchivecommedia20110602j12jpg': (0, 1),
 'civilizations': (4, 19),
 'gethsemane': (1, 6),
 'theatera': (1, 1),
 'hotheaded': (1, 1),
 'hrefhttpwwwjarchivecommedia20080403dj30jpg': (1, 0),
 'targetblankolivia': (0, 1)}

Applying chi-squared test

Now that we’ve found the observed counts for a few terms, we can compute the expected counts and the chi-squared value.

# find total sum of high and low value questions
high_value_count = jeopardy["high_value"].sum()
low_value_count = jeopardy.shape[0] - high_value_count

# dictionary to hold chisquare and p-value for each term
chi_squared = {}

# loop through the dictionary 
for key, value in observed_expected.items():
    
    # calculate total value of high and low
    total = value[0] + value[1]
    
    # calculate word occurence proportion across the dataset
    total_prop = total / jeopardy.shape[0]
    
    # find expected term for high and low value rows
    high_exp, low_exp = total_prop * high_value_count, total_prop * low_value_count
    
    # calculate chi-squared statistic value and associated p-value
    chi_squared[key] = chisquare(value, (high_exp, low_exp))
chi_squared
{'flighty': Power_divergenceResult(statistic=0.043292301416985354, pvalue=0.8351758561462266),
 'pennants': Power_divergenceResult(statistic=0.7899529284667026, pvalue=0.3741143592744989),
 'miltons': Power_divergenceResult(statistic=0.3363947754070794, pvalue=0.5619176551024535),
 'hrefhttpwwwjarchivecommedia20110602j12jpg': Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695),
 'civilizations': Power_divergenceResult(statistic=1.3519672340861204, pvalue=0.24493449653850868),
 'gethsemane': Power_divergenceResult(statistic=0.6787070195906365, pvalue=0.4100323130964867),
 'theatera': Power_divergenceResult(statistic=0.46338644448358013, pvalue=0.49604555208958945),
 'hotheaded': Power_divergenceResult(statistic=0.46338644448358013, pvalue=0.49604555208958945),
 'hrefhttpwwwjarchivecommedia20080403dj30jpg': Power_divergenceResult(statistic=2.5317964247338085, pvalue=0.11157312838169751),
 'targetblankolivia': Power_divergenceResult(statistic=0.3949764642333513, pvalue=0.5296950912486695)}

The results of the chi-squared test show that all of the sample terms we chose are not statistical significant. So far this shows that none of the terms above correspond to high value questions if we defined test significance acceptance level of $p=0.05$

Exploring Categories

Apart from investigating the values of questions, we can also prepare for jeopardy by knowing the likelihood of certain categories appearning in each round. In our dataset we have a column Category which we will use to answer the following questions:

  • Which categories appear most often
  • What is the probability of each category appearing in each round.

Now, the first question is pretty straight forward. We simply need to count occurences of each category. There are many unique categories, thus we will only look at top 10.

# count number of categories
unique_cats = len(jeopardy["Category"].unique())

print("Number of unique categories {}".format(unique_cats))
Number of unique categories 27995
# count category ocurrences
cat_occurs = jeopardy["Category"].value_counts().sort_values(ascending=False).head(10)
fig, ax = plt.subplots(figsize=(18, 12))
ax.bar(cat_occurs.index, cat_occurs)
[ax.spines[a].set_visible(False) for a in ax.spines]
plt.xticks(rotation=45)
plt.show()

png

The above top 10 categories have more questions than the rest of categories. The category ‘Before & After’ seems to be the most popular one. For more information on what this category means, you can read here.

Now, to find the probability of each category appearing in each round, we need to first group the questions by round and then look at the categories and basically calculate the proportion of questions for a specific category in that round. Let’s get to it.

# list of possible rounds
jeopardy["Round"].value_counts()
Jeopardy!           107384
Double Jeopardy!    105912
Final Jeopardy!       3631
Tiebreaker               3
Name: Round, dtype: int64

Jeopardy has only three main rounds with a potential tiebreaker in case of a tie. Let’s calculate the occurence of each category by round.

jeopardy["Round"].value_counts()
Jeopardy!           107384
Double Jeopardy!    105912
Final Jeopardy!       3631
Tiebreaker               3
Name: Round, dtype: int64
# get number of categories by round
catbyround = jeopardy[["Round", "Category", "Question"]].groupby(["Round", "Category"]).agg({'Question': 'count'})

# calculate the proportion of questions for each category in that round
catbyround["Percentage"] = catbyround.groupby(level=0).apply(lambda x: 100 * x / float(x.sum()))

# check results
catbyround

Since we have lots of categories in each round, we need to think of a way to condense it into a more readable format. For that we can only look at the top 10 categories in each round in terms of probability of occurence.

# groupby Round then take the top 10 largest
catbyround['Percentage'].groupby(level=0, group_keys=False).nlargest(10)
Round             Category               
Double Jeopardy!  BEFORE & AFTER              0.424881
                  LITERATURE                  0.359733
                  SCIENCE                     0.279477
                  WORLD GEOGRAPHY             0.239822
                  OPERA                       0.236045
                  WORLD HISTORY               0.223771
                  BALLET                      0.217161
                  COLLEGES & UNIVERSITIES     0.207720
                  ART                         0.202999
                  ISLANDS                     0.202999
Final Jeopardy!   U.S. PRESIDENTS             1.377031
                  WORD ORIGINS                0.936381
                  STATE CAPITALS              0.853759
                  AUTHORS                     0.743597
                  WORLD LEADERS               0.716056
                  FAMOUS AMERICANS            0.660975
                  FAMOUS NAMES                0.660975
                  HISTORIC NAMES              0.660975
                  BUSINESS & INDUSTRY         0.633434
                  ORGANIZATIONS               0.523272
Jeopardy!         POTPOURRI                   0.237466
                  STUPID ANSWERS              0.237466
                  SPORTS                      0.235603
                  ANIMALS                     0.216978
                  AMERICAN HISTORY            0.211391
                  SCIENCE                     0.202079
                  STATE CAPITALS              0.195560
                  TELEVISION                  0.186247
                  U.S. CITIES                 0.181591
                  BUSINESS & INDUSTRY         0.172279
Tiebreaker        CHILD'S PLAY               33.333333
                  LITERARY CHARACTERS        33.333333
                  THE AMERICAN REVOLUTION    33.333333
Name: Percentage, dtype: float64

The above table is a very nice summary of the probability of each category ocurring within each round. Some notable observations are:

  • It seems tiebreakers only have three category which equal probability of ocurrence
  • US Presidents is a very popular topic in Final Jeopardy! Round with 1.37% chance of ocurrence, which relative to other categories is high
  • Literature, science ans sports seem to be quite popular topics touched in every round

Summary

Here, we have explored the wonderful world of jeopardy and tried to find interesting information that could give us an edge in better preparing for the game and ultimately win in it.

For a complete jupyter notebook file please refer to the github page of the project.

Leave a comment