This is my project that uses machine learning to predict the winners of each of the five major individual NBA awards (Most Valuable Player, Defensive Player of the Year, Rookie of the Year, Most Improved Player, and Sixth Man of the Year). On this website, you will find the notebooks that I created in order to scrape my data, clean and process it, and create my models.
You can find the source code and data for this project on Github at:
If you want to read the final report for the project, you can find it at:
https://github.com/tomerzur/NBA-Award-Prediction/blob/main/Final%20Report.pdf
Image Source: https://hoopshype.com/wp-content/uploads/sites/92/2019/10/gettyimages-168122500.jpg?w=1000&h=600&crop=1
import numpy as np
import pandas as pd
import urllib.request as urllib
from bs4 import BeautifulSoup, Comment
from selenium import webdriver
from datetime import datetime
import time
import random
import os
import string
In order to create my award models, the first step is to run this notebook that scrapes all of the data that we will use. We are scraping the data from basketball-reference.com, and will be scraping by using a gecko driver that will automatically open a firefox window. The driver will visit basketball-reference.com, and visit each player profile on the website to get all of the player season data. It will also visit each award voting page on the site to get all of the award data. In total, this script will run for a few hours before it has scraped all of the data.
First, set the path of your gecko driver. I put my gecko driver file in the same directory as this project.
PATH=os.path.abspath(os.getcwd()) + '/geckodriver'
I will create two dataframes that will be used to store all of the data that I am scraping.
# get totals
player_seasons = pd.DataFrame(columns=['player', 'season', 'age', 'team', 'position', 'g', 'gs', 'mp', 'fg', 'fga', 'fg_pct',
'three_p', 'three_pa', 'three_pct', 'two_p', 'two_pa', 'two_pct', 'efg', 'ft',
'fta', 'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov','pf', 'pts', 'trp_dbl'])
player_seasons.set_index(['player', 'season'], inplace = True)
award_data = pd.DataFrame(columns=['player', 'season', 'award', 'first_place_votes', 'award_pts_won', 'award_pts_max'])
award_data.set_index(['player', 'season'], inplace = True)
# get url of award voting results for a given year
def get_award_url(year):
return f"https://www.basketball-reference.com/awards/awards_{year}.html"
# extract award data from beautiful soup object and add it to award_rows_list
def scrape_award_data(award_name, soup):
#get rows of award votes from table
awardTable = soup.find("table", {"id": award_name})
if awardTable is None:
awardTable = soup.find("table", {"id": f"nba_{award_name}"})
if awardTable is not None:
awardRows = awardTable.find("tbody").find_all("tr")
print(f"Got rows of {award_name} players from table, starting to iterate through rows")
#iterate through votes on page, filling data into award_data dataframe
for row in awardRows:
if row.get('class') == None:
player_name = row.find("td", {"data-stat":"player"}).find("a").get_text()
first_place_votes = row.find("td", {"data-stat":"votes_first"}).get_text()
award_pts_won = row.find("td", {"data-stat":"points_won"}).get_text()
award_pts_max = row.find("td", {"data-stat":"points_max"}).get_text()
award_rows_list.append({'player': player_name, 'season': year, 'award': award_name,
'first_place_votes': first_place_votes, 'award_pts_won': award_pts_won,
'award_pts_max': award_pts_max})
browser = webdriver.Firefox(executable_path = PATH)
award_rows_list = []
# award voting data is available on bbref from 1956
years = range(1956, datetime.now().year)
for year in years:
sada = browser.get(get_award_url(year))
time.sleep(3)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')
print(f"Year: {year}")
scrape_award_data('mvp', soup)
scrape_award_data('roy', soup)
scrape_award_data('dpoy', soup)
scrape_award_data('smoy', soup)
scrape_award_data('mip', soup)
time.sleep(random.randint(0,1))
browser.close()
award_data = pd.DataFrame(award_rows_list, columns=['player', 'season', 'award', 'first_place_votes', 'award_pts_won', 'award_pts_max'])
award_data.set_index(['player', 'season'], inplace = True)
Year: 1956 Got rows of mvp players from table, starting to iterate through rows Year: 1957 Got rows of mvp players from table, starting to iterate through rows Year: 1958 Got rows of mvp players from table, starting to iterate through rows Year: 1959 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1960 Got rows of mvp players from table, starting to iterate through rows Year: 1961 Got rows of mvp players from table, starting to iterate through rows Year: 1962 Got rows of mvp players from table, starting to iterate through rows Year: 1963 Got rows of mvp players from table, starting to iterate through rows Year: 1964 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1965 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1966 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1967 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1968 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1969 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1970 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1971 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1972 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1973 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1974 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1975 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1976 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1977 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1978 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1979 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1980 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1981 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1982 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Year: 1983 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Year: 1984 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Year: 1985 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Year: 1986 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1987 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Year: 1988 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1989 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1990 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1991 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1992 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1993 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1994 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1995 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1996 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1997 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1998 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 1999 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2000 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2001 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2002 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2003 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2004 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2005 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2006 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2007 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2008 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2009 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2010 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2011 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2012 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2013 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2014 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2015 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2016 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2017 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2018 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2019 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows Year: 2020 Got rows of mvp players from table, starting to iterate through rows Got rows of roy players from table, starting to iterate through rows Got rows of dpoy players from table, starting to iterate through rows Got rows of smoy players from table, starting to iterate through rows Got rows of mip players from table, starting to iterate through rows
award_data.to_csv('data/award_data.csv')
# scrape all of the player basic stats data for each season
# get url of all players whose last name starts with the given letter
def get_letter_url(letter):
return f"https://www.basketball-reference.com/players/{letter}/"
browser = webdriver.Firefox(executable_path = PATH)
player_season_list = []
for letter in string.ascii_lowercase[::-1]:
print(f"Letter: {letter}")
time.sleep(random.randint(0,1))
html = urllib.urlopen(get_letter_url(letter))
soup = BeautifulSoup(html.read())
html.close()
#get rows of players from table
playerTable = soup.find("table", {"id":"players"})
playerRows = playerTable.find("tbody").find_all("tr")
#iterate through players on page, filling data into players dataframe
for row in playerRows:
if row.get('class') == None:
player_name = player_link = row.find("th", {"data-stat":"player"}).find("a").get_text()
player_link = row.find("th", {"data-stat":"player"}).find("a")['href']
full_player_link = f"https://www.basketball-reference.com{player_link}"
print(player_name)
time.sleep(random.randint(0,1))
sada = browser.get(full_player_link)
source = browser.page_source
player_soup = BeautifulSoup(source, 'html.parser')
totalsTable = player_soup.find("table", {"id":"totals"})
totalsRows = totalsTable.find("tbody").find_all("tr")
time.sleep(random.randint(0,1))
prev_yr = 0
for row_t in totalsRows:
league_soup = row_t.find("td", {"data-stat":"lg_id"})
if league_soup is not None and league_soup.find("a") is not None:
league = league_soup.find("a").get_text()
else:
league = 'N/A'
if league == "NBA":
season_str = row_t.find("th", {"data-stat":"season"}).find("a").get_text()[0:4]
year = int(season_str) + 1
if year == prev_yr:
team = row_t.find("td", {"data-stat":"team_id"}).find("a").get_text() + " "
update_team = player_season_list[-1]
update_team['team'] = update_team['team'] + team
player_season_list[-1] = update_team
else:
team_soup = row_t.find("td", {"data-stat":"team_id"})
if team_soup.find("a") is not None:
team = row_t.find("td", {"data-stat":"team_id"}).find("a").get_text() + " "
else:
team = ""
age = row_t.find("td", {"data-stat":"age"}).get_text()
position = row_t.find("td", {"data-stat":"pos"}).get_text()
g = row_t.find("td", {"data-stat":"g"}).get_text()
gs = row_t.find("td", {"data-stat":"gs"}).get_text()
mp = row_t.find("td", {"data-stat":"mp"}).get_text()
fg = row_t.find("td", {"data-stat":"fg"}).get_text()
fga = row_t.find("td", {"data-stat":"fga"}).get_text()
fg_pct = row_t.find("td", {"data-stat":"fg_pct"}).get_text()
if row_t.find("td", {"data-stat":"fg3"}) is not None:
three_p = row_t.find("td", {"data-stat":"fg3"}).get_text()
else:
three_p = 0
if row_t.find("td", {"data-stat":"fg3a"}) is not None:
three_pa = row_t.find("td", {"data-stat":"fg3a"}).get_text()
else:
three_pa = 0
if row_t.find("td", {"data-stat":"fg3_pct"}) is not None:
three_pct = row_t.find("td", {"data-stat":"fg3_pct"}).get_text()
else:
three_pct = 0
if row_t.find("td", {"data-stat":"fg2"}) is not None:
two_p = row_t.find("td", {"data-stat":"fg2"}).get_text()
else:
two_p = fg
if row_t.find("td", {"data-stat":"fg2a"}) is not None:
two_pa = row_t.find("td", {"data-stat":"fg2a"}).get_text()
else:
two_pa = fga
if row_t.find("td", {"data-stat":"fg2_pct"}) is not None:
two_pct = row_t.find("td", {"data-stat":"fg2_pct"}).get_text()
else:
two_pct = fg_pct
if row_t.find("td", {"data-stat":"efg_pct"}) is not None:
efg = row_t.find("td", {"data-stat":"efg_pct"}).get_text()
else:
efg = fg_pct
ft = row_t.find("td", {"data-stat":"ft"}).get_text()
fta = row_t.find("td", {"data-stat":"fta"}).get_text()
ft_pct = row_t.find("td", {"data-stat":"ft_pct"}).get_text()
if row_t.find("td", {"data-stat":"orb"}) is not None:
orb = row_t.find("td", {"data-stat":"orb"}).get_text()
else:
orb = ''
if row_t.find("td", {"data-stat":"drb"}) is not None:
drb = row_t.find("td", {"data-stat":"drb"}).get_text()
else:
drb = ''
trb = row_t.find("td", {"data-stat":"trb"}).get_text()
ast = row_t.find("td", {"data-stat":"ast"}).get_text()
if row_t.find("td", {"data-stat":"stl"}) is not None:
stl = row_t.find("td", {"data-stat":"stl"}).get_text()
else:
stl = ''
if row_t.find("td", {"data-stat":"blk"}) is not None:
blk = row_t.find("td", {"data-stat":"blk"}).get_text()
else:
blk = ''
if row_t.find("td", {"data-stat":"tov"}) is not None:
tov = row_t.find("td", {"data-stat":"tov"}).get_text()
else:
tov = ''
pf = row_t.find("td", {"data-stat":"pf"}).get_text()
pts = row_t.find("td", {"data-stat":"pts"}).get_text()
trp_dbl_soup = row_t.find("td", {"data-stat":"trp_dbl"})
if trp_dbl_soup is None:
trp_dbl = ''
else:
trp_dbl = trp_dbl_soup.get_text()
player_season_list.append({'player': player_name, 'season': year, 'age': age, 'team': team.strip(),
'position': position, 'g': g, 'gs': gs, 'mp': mp, 'fg': fg, 'fga': fga,
'fg_pct': fg_pct, 'three_p': three_p, 'three_pa': three_pa,
'three_pct': three_pct, 'two_p': two_p, 'two_pa': two_pa, 'two_pct': two_pct,
'efg': efg, 'ft': ft, 'fta': fta, 'ft_pct': ft_pct, 'orb': orb, 'drb': drb,
'trb': trb, 'ast': ast, 'stl': stl, 'blk': blk, 'tov': tov,'pf': pf,
'pts': pts, 'trp_dbl': trp_dbl})
prev_yr = year
browser.close()
player_seasons = pd.DataFrame(player_season_list, columns=['player', 'season', 'age', 'team', 'position', 'g', 'gs', 'mp', 'fg', 'fga', 'fg_pct',
'three_p', 'three_pa', 'three_pct', 'two_p', 'two_pa', 'two_pct', 'efg', 'ft',
'fta', 'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov','pf', 'pts', 'trp_dbl'])
player_seasons.set_index(['player', 'season'], inplace = True)
player_seasons
player_seasons.to_csv('data/player_seasons.csv')
import pandas as pd
import matplotlib.pyplot as plt
In this notebook, we do some exploratory data analysis. We look at the distribution of award points among players, and how that has changed in recent years compared to early years. We also look at how correlated each feature is to our target variable (award points).
This notebook is intended to be run right after scraping the data ('scrape_data' notebook).
# mvp
mvp_data = pd.read_csv('data/mvp_data.csv')
mvp_data_1956 = mvp_data[mvp_data['season'] == 1956]
mvp_data_2020 = mvp_data[mvp_data['season'] == 2020]
# dpoy
dpoy_data = pd.read_csv('data/dpoy_data.csv')
dpoy_data_1983 = dpoy_data[dpoy_data['season'] == 1983]
dpoy_data_2020 = dpoy_data[dpoy_data['season'] == 2020]
# roy
roy_data = pd.read_csv('data/roy_data.csv')
roy_data_1966 = roy_data[roy_data['season'] == 1966]
roy_data_2020 = roy_data[roy_data['season'] == 2020]
# mip
mip_data = pd.read_csv('data/mip_data.csv')
mip_data_1986 = mip_data[mip_data['season'] == 1986]
mip_data_2020 = mip_data[mip_data['season'] == 2020]
# smoy
smoy_data = pd.read_csv('data/mvp_data.csv')
smoy_data_1983 = mvp_data[mvp_data['season'] == 1983]
smoy_data_2020 = mvp_data[mvp_data['season'] == 2020]
plt.hist(mvp_data_1956['award_pts_won'], bins=10)
plt.ylabel('# of players')
plt.xlabel('Award Points Received')
plt.title("MVP Voting distribution (1956)")
plt.show()
plt.hist(mvp_data_2020['award_pts_won'], bins=10)
plt.ylabel('# of players')
plt.xlabel('Award Points Received')
plt.title("MVP Voting distribution (2020)")
plt.show()
plt.hist(dpoy_data_1983['award_pts_won'], bins=10)
plt.ylabel('# of players')
plt.xlabel('Award Points Received')
plt.title("DPOY Voting distribution (1983)")
plt.show()
plt.hist(dpoy_data_2020['award_pts_won'], bins=10)
plt.ylabel('# of players')
plt.xlabel('Award Points Received')
plt.title("DPOY Voting distribution (2020)")
plt.show()
plt.hist(roy_data_1966['award_pts_won'], bins=10)
plt.ylabel('# of players')
plt.xlabel('Award Points Received')
plt.title("ROY Voting distribution (1966)")
plt.show()
plt.hist(roy_data_2020['award_pts_won'], bins=10)
plt.ylabel('# of players')
plt.xlabel('Award Points Received')
plt.title("ROY Voting distribution (2020)")
plt.show()
plt.hist(mip_data_1986['award_pts_won'], bins=10)
plt.ylabel('# of players')
plt.xlabel('Award Points Received')
plt.title("MIP Voting distribution (1986)")
plt.show()
plt.hist(mip_data_2020['award_pts_won'], bins=10)
plt.ylabel('# of players')
plt.xlabel('Award Points Received')
plt.title("MIP Voting distribution (2020)")
plt.show()
plt.hist(smoy_data_1983['award_pts_won'], bins=10)
plt.ylabel('# of players')
plt.xlabel('Award Points Received')
plt.title("SMOY Voting distribution (1983)")
plt.show()
plt.hist(smoy_data_2020['award_pts_won'], bins=10)
plt.ylabel('# of players')
plt.xlabel('Award Points Received')
plt.title("SMOY Voting distribution (2020)")
plt.show()
mvp_data.corr()['award_pts_won']
Unnamed: 0 0.019247 season 0.002749 age 0.020178 g 0.086676 gs 0.169350 mp 0.173564 fg 0.267111 fga 0.244054 fg_pct 0.071227 three_p 0.114388 three_pa 0.116982 three_pct 0.019931 two_p 0.256113 two_pa 0.230353 two_pct 0.069625 efg 0.066746 ft 0.304983 fta 0.307409 ft_pct 0.038090 orb 0.128628 drb 0.223288 trb 0.206886 ast 0.213928 stl 0.184815 blk 0.172898 tov 0.227186 pf 0.090094 pts 0.282150 trp_dbl 0.274015 first_place_votes 0.762333 award_pts_won 1.000000 award_pts_max 0.024172 Name: award_pts_won, dtype: float64
dpoy_data.corr()['award_pts_won']
Unnamed: 0 0.011335 season 0.038860 age 0.006051 g 0.058056 gs 0.111431 mp 0.098778 fg 0.089285 fga 0.075712 fg_pct 0.054293 three_p 0.018555 three_pa 0.020388 three_pct -0.020168 two_p 0.092770 two_pa 0.079960 two_pct 0.048388 efg 0.046254 ft 0.103890 fta 0.127202 ft_pct -0.021682 orb 0.128232 drb 0.177603 trb 0.167210 ast 0.053317 stl 0.102887 blk 0.205363 tov 0.085688 pf 0.075432 pts 0.091997 trp_dbl 0.066889 first_place_votes 0.828236 award_pts_won 1.000000 award_pts_max 0.056521 Name: award_pts_won, dtype: float64
roy_data.corr()['award_pts_won']
Unnamed: 0 -0.003692 season 0.034789 age -0.078611 g 0.043330 gs 0.074855 mp 0.067852 fg 0.070875 fga 0.074720 fg_pct 0.010842 three_p 0.044714 three_pa 0.050558 three_pct 0.015627 two_p 0.064556 two_pa 0.066084 two_pct 0.011416 efg 0.012403 ft 0.070829 fta 0.072894 ft_pct 0.016343 orb 0.045228 drb 0.063805 trb 0.056960 ast 0.062237 stl 0.056279 blk 0.039203 tov 0.089210 pf 0.047087 pts 0.073712 trp_dbl 0.031105 first_place_votes 0.741657 award_pts_won 1.000000 award_pts_max 0.050702 Name: award_pts_won, dtype: float64
mip_data.corr()['award_pts_won']
Unnamed: 0 0.010919 season 0.041060 age -0.057126 g 0.070200 gs 0.117863 mp 0.113685 fg 0.129356 fga 0.124338 fg_pct 0.038491 three_p 0.092617 three_pa 0.090196 three_pct 0.018871 two_p 0.117753 two_pa 0.112425 two_pct 0.036240 efg 0.041010 ft 0.126453 fta 0.123841 ft_pct 0.029229 orb 0.080966 drb 0.112896 trb 0.106205 ast 0.093343 stl 0.100106 blk 0.067625 tov 0.119548 pf 0.086820 pts 0.133474 trp_dbl 0.036710 first_place_votes 0.849248 award_pts_won 1.000000 award_pts_max 0.061515 Name: award_pts_won, dtype: float64
smoy_data.corr()['award_pts_won']
Unnamed: 0 0.019247 season 0.002749 age 0.020178 g 0.086676 gs 0.169350 mp 0.173564 fg 0.267111 fga 0.244054 fg_pct 0.071227 three_p 0.114388 three_pa 0.116982 three_pct 0.019931 two_p 0.256113 two_pa 0.230353 two_pct 0.069625 efg 0.066746 ft 0.304983 fta 0.307409 ft_pct 0.038090 orb 0.128628 drb 0.223288 trb 0.206886 ast 0.213928 stl 0.184815 blk 0.172898 tov 0.227186 pf 0.090094 pts 0.282150 trp_dbl 0.274015 first_place_votes 0.762333 award_pts_won 1.000000 award_pts_max 0.024172 Name: award_pts_won, dtype: float64
import numpy as np
import pandas as pd
In this notebook I join the award data and player data and then split up the data into individual datasets to be used for each award. I also add a few new features to my dataset (such as % of games started and season number).
This notebook should be run after you have successfully scraped the data using the 'scrape_data' notebook.
award_data = pd.read_csv('data/award_data.csv')
award_data.head()
| player | season | award | first_place_votes | award_pts_won | award_pts_max | |
|---|---|---|---|---|---|---|
| 0 | Bob Pettit | 1956 | mvp | 33.0 | 33.0 | 80 |
| 1 | Paul Arizin | 1956 | mvp | 21.0 | 21.0 | 80 |
| 2 | Bob Cousy | 1956 | mvp | 11.0 | 11.0 | 80 |
| 3 | Mel Hutchins | 1956 | mvp | 9.0 | 9.0 | 80 |
| 4 | Dolph Schayes | 1956 | mvp | 2.0 | 2.0 | 80 |
player_season_data = pd.read_csv('data/player_seasons.csv')
player_season_data.tail()
| player | season | age | team | position | g | gs | mp | fg | fga | ... | orb | drb | trb | ast | stl | blk | tov | pf | pts | trp_dbl | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22383 | Kelenna Azubuike | 2008 | 24 | GSW | SG | 81 | 17.0 | 1732.0 | 261 | 587 | ... | 107.0 | 217.0 | 324.0 | 75 | 45.0 | 34.0 | 58.0 | 159 | 692 | NaN |
| 22384 | Kelenna Azubuike | 2009 | 25 | GSW | SF | 74 | 51.0 | 2375.0 | 392 | 845 | ... | 112.0 | 258.0 | 370.0 | 117 | 57.0 | 50.0 | 95.0 | 183 | 1063 | NaN |
| 22385 | Kelenna Azubuike | 2010 | 26 | GSW | SF | 9 | 7.0 | 231.0 | 48 | 88 | ... | 12.0 | 29.0 | 41.0 | 10 | 5.0 | 9.0 | 7.0 | 16 | 125 | NaN |
| 22386 | Kelenna Azubuike | 2012 | 28 | DAL | SG | 3 | 0.0 | 18.0 | 3 | 8 | ... | 0.0 | 0.0 | 0.0 | 0 | 1.0 | 0.0 | 4.0 | 1 | 7 | NaN |
| 22387 | Udoka Azubuike | 2021 | 21 | UTA | C | 12 | 0.0 | 49.0 | 4 | 7 | ... | 4.0 | 9.0 | 13.0 | 0 | 1.0 | 4.0 | 3.0 | 8 | 12 | NaN |
5 rows × 31 columns
# add season number as column to data
player_season_data["season_num"] = player_season_data.groupby("player")["season"].rank(method="first", ascending=True)
# add games_started_pct feature
player_season_data["gs_pct"] = player_season_data["gs"] / player_season_data["g"]
| player | season | age | team | position | g | gs | mp | fg | fga | ... | trb | ast | stl | blk | tov | pf | pts | trp_dbl | season_num | gs_pct | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 22383 | Kelenna Azubuike | 2008 | 24 | GSW | SG | 81 | 17.0 | 1732.0 | 261 | 587 | ... | 324.0 | 75 | 45.0 | 34.0 | 58.0 | 159 | 692 | NaN | 2.0 | 0.209877 |
| 22384 | Kelenna Azubuike | 2009 | 25 | GSW | SF | 74 | 51.0 | 2375.0 | 392 | 845 | ... | 370.0 | 117 | 57.0 | 50.0 | 95.0 | 183 | 1063 | NaN | 3.0 | 0.689189 |
| 22385 | Kelenna Azubuike | 2010 | 26 | GSW | SF | 9 | 7.0 | 231.0 | 48 | 88 | ... | 41.0 | 10 | 5.0 | 9.0 | 7.0 | 16 | 125 | NaN | 4.0 | 0.777778 |
| 22386 | Kelenna Azubuike | 2012 | 28 | DAL | SG | 3 | 0.0 | 18.0 | 3 | 8 | ... | 0.0 | 0 | 1.0 | 0.0 | 4.0 | 1 | 7 | NaN | 5.0 | 0.000000 |
| 22387 | Udoka Azubuike | 2021 | 21 | UTA | C | 12 | 0.0 | 49.0 | 4 | 7 | ... | 13.0 | 0 | 1.0 | 4.0 | 3.0 | 8 | 12 | NaN | 1.0 | 0.000000 |
5 rows × 33 columns
award_data.set_index(['player', 'season'], inplace=True)
player_season_data.set_index(['player', 'season'], inplace=True)
# fg_pct, two_pct, efg, ft_pct - fill with 0 if NA (because there were no shots attempted - pct should be 0)
# trp dbl - if na, fill with 0
player_season_data[['fg_pct', 'two_pct', 'efg', 'ft_pct', 'trp_dbl']] = player_season_data[['fg_pct', 'two_pct', 'efg', 'ft_pct', 'trp_dbl']].fillna(0)
player_season_data.fillna('N/A', inplace=True)
First, I split up my award data into five datasets (one dataset for each award).
mvp_data = award_data[award_data['award'] == 'mvp']
dpoy_data = award_data[award_data['award'] == 'dpoy']
roy_data = award_data[award_data['award'] == 'roy']
mip_data = award_data[award_data['award'] == 'mip']
smoy_data = award_data[award_data['award'] == 'smoy']
Then, I join my player season data to each award's data in order to get my award-specific dataset.
mvp_data = player_season_data.join(mvp_data, on=['player', 'season'])
dpoy_data = player_season_data.join(dpoy_data, on=['player', 'season'])
roy_data = player_season_data.join(roy_data, on=['player', 'season'])
mip_data = player_season_data.join(mip_data, on=['player', 'season'])
smoy_data = player_season_data.join(smoy_data, on=['player', 'season'])
mvp_data = mvp_data.reset_index()
dpoy_data = dpoy_data.reset_index()
roy_data = roy_data.reset_index()
mip_data = mip_data.reset_index()
smoy_data = smoy_data.reset_index()
First, I filter out player seasons from the individual award datasets that occurred before that award was established or after 2020 (that is the date of the last full season's worth of data that I had when I made these models).
mvp_data = mvp_data[mvp_data['season'] >= 1956]
dpoy_data = dpoy_data[dpoy_data['season'] >= 1983]
roy_data = roy_data[roy_data['season'] >= 1964]
mip_data = mip_data[mip_data['season'] >= 1986]
smoy_data = smoy_data[smoy_data['season'] >= 1984]
mvp_data = mvp_data[mvp_data['season'] <= 2020]
dpoy_data = dpoy_data[dpoy_data['season'] <= 2020]
roy_data = roy_data[roy_data['season'] <= 2020]
mip_data = mip_data[mip_data['season'] <= 2020]
smoy_data = smoy_data[smoy_data['season'] <= 2020]
mvp_data.fillna(value={'first_place_votes': 0, 'award_pts_won': 0, 'award_pts_max': 0}, inplace=True)
dpoy_data.fillna(value={'first_place_votes': 0, 'award_pts_won': 0, 'award_pts_max': 0}, inplace=True)
roy_data.fillna(value={'first_place_votes': 0, 'award_pts_won': 0, 'award_pts_max': 0}, inplace=True)
mip_data.fillna(value={'first_place_votes': 0, 'award_pts_won': 0, 'award_pts_max': 0}, inplace=True)
smoy_data.fillna(value={'first_place_votes': 0, 'award_pts_won': 0, 'award_pts_max': 0}, inplace=True)
Then, I calculate the maximum award points that a player in a given season could have received and add this feature to my award datasets. This number varies each year, as there are different numbers of voters each year. It is important to include this feature because it can range from <100 back when there were few people who voted for each award to >1,000 in more recent years when many media members vote for each award, and thus will have a big impact on the number of award points that the leading vote getters will receive.
def get_yearly_max_pts(award_data):
max_pts_by_yr = {}
for yr in range(1950, 2100):
max_pts_by_yr[yr] = 0
for index, row in award_data.iterrows():
year = row['season']
award_pts_max = row['award_pts_max']
if max_pts_by_yr[year] < award_pts_max:
max_pts_by_yr[year] = award_pts_max
return max_pts_by_yr
mvp_max_pts_by_yr = get_yearly_max_pts(mvp_data)
dpoy_max_pts_by_yr = get_yearly_max_pts(dpoy_data)
roy_max_pts_by_yr = get_yearly_max_pts(roy_data)
mip_max_pts_by_yr = get_yearly_max_pts(mip_data)
smoy_max_pts_by_yr = get_yearly_max_pts(smoy_data)
mvp_data['award_pts_max'] = mvp_data['season'].map(mvp_max_pts_by_yr)
dpoy_data['award_pts_max'] = dpoy_data['season'].map(dpoy_max_pts_by_yr)
roy_data['award_pts_max'] = roy_data['season'].map(roy_max_pts_by_yr)
mip_data['award_pts_max'] = mip_data['season'].map(mip_max_pts_by_yr)
smoy_data['award_pts_max'] = smoy_data['season'].map(smoy_max_pts_by_yr)
Next, I filter out non-rookies from my rookie of the year data.
# filter out non-rookies for roy data
roy_data = roy_data[roy_data['season_num'] == 1]
Then, in these next two cells, I add columns with net change features for each stat in the most improved player dataset. This is important for predicting the Most Improved Player award because it will tell you how much a player improved at each statistic compared to their previous season.
# add net change compared to previous season for mip data
stat_cols = ['g', 'gs', 'mp', 'fg', 'fga', 'fg_pct', 'three_p', 'three_pa',
'three_pct', 'two_p', 'two_pa', 'two_pct', 'efg', 'ft', 'fta',
'ft_pct', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov',
'pf', 'pts', 'gs_pct']
net_change_cols = [f'net_{stat}' for stat in stat_cols]
mip_data[net_change_cols] = pd.DataFrame([np.zeros(len(net_change_cols))], index=mip_data.index)
mip_data = mip_data.sort_values(by=['player', 'season_num'])
mip_data = mip_data.reset_index()
for row_num in range(len(mip_data)):
cur_row = mip_data.iloc[row_num]
season_num = cur_row['season_num']
if season_num > 1:
prev_row = mip_data.iloc[row_num - 1]
for stat in stat_cols:
cur_yr_stat = cur_row[stat]
prev_yr_stat = prev_row[stat]
if cur_yr_stat == 'N/A' or prev_yr_stat == 'N/A':
stat_change = 0
else:
stat_change = cur_row[stat] - prev_row[stat]
cur_row[f'net_{stat}'] = stat_change
mip_data.iloc[row_num] = cur_row
/Users/tomer/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
I also need to filter out other players who are ineligible to receive specific awards. I filtered out rookies from the most improved player data, and I filtered out starters (players who started at least 50% of the games they played in) from the sixth man of the year data.
# filter out rookies for mip data
mip_data = mip_data[mip_data['season_num'] != 1]
# filter out starters from smoy data
smoy_data = smoy_data[smoy_data['gs_pct'] < 0.5]
mvp_data.to_csv('data/mvp_data.csv')
dpoy_data.to_csv('data/dpoy_data.csv')
roy_data.to_csv('data/roy_data.csv')
mip_data.to_csv('data/mip_data.csv')
smoy_data.to_csv('data/smoy_data.csv')
# drop award, first place vote columns
# one hot encode position, team
# train test development split
import pandas as pd
In this notebook I clean up the datasets that I have created, one-hot encode my categorical variables (position and team), and split up my data into test, validation, and train sets, and separate my x values (features) and my y value (award points a player received).
This notebook should be run after running the 'get_full_dataset' notebook.
mvp_data = pd.read_csv('data/mvp_data.csv')
dpoy_data = pd.read_csv('data/dpoy_data.csv')
roy_data = pd.read_csv('data/roy_data.csv')
mip_data = pd.read_csv('data/mip_data.csv')
smoy_data = pd.read_csv('data/smoy_data.csv')
# if player played with multiple teams, replace team value with 'Multiple' (reduces num cols after one hot encoding from 1000+ to 103)
def cleanup_multiple_teams(team):
if len(team) > 3:
return 'Multiple'
else:
return team
mvp_data['team'] = mvp_data['team'].apply(cleanup_multiple_teams)
dpoy_data['team'] = dpoy_data['team'].apply(cleanup_multiple_teams)
roy_data['team'] = roy_data['team'].apply(cleanup_multiple_teams)
mip_data['team'] = mip_data['team'].apply(cleanup_multiple_teams)
smoy_data['team'] = smoy_data['team'].apply(cleanup_multiple_teams)
# get rid of columns that are not needed, one-hot encode categorical variables
def cleanup_cols(award_data):
award_data = award_data.drop(columns=['Unnamed: 0', 'award', 'first_place_votes'])
award_data = pd.get_dummies(award_data, columns=['position'])
award_data = pd.get_dummies(award_data, columns=['team'])
return award_data
mvp_data = cleanup_cols(mvp_data)
dpoy_data = cleanup_cols(dpoy_data)
roy_data = cleanup_cols(roy_data)
mip_data = cleanup_cols(mip_data)
smoy_data = cleanup_cols(smoy_data)
# split data into train, dev, and test data
def train_test_split(award_data):
test_data = award_data[award_data['season'] >= 2016]
dev_data = award_data[award_data['season'] >= 2011]
dev_data = dev_data[dev_data['season'] <= 2015]
train_data = award_data[award_data['season'] <= 2010]
test_data = test_data.reset_index(drop=True)
dev_data = dev_data.reset_index(drop=True)
train_data = train_data.reset_index(drop=True)
return train_data, dev_data, test_data
def x_y_split(award_data):
x_data = award_data.drop(columns=['award_pts_won'])
y_data = award_data[['award_pts_won']]
return x_data, y_data
# run functions defined above and save new datasets to csv files
for award_name, dataset in [('mvp', mvp_data), ('dpoy', dpoy_data), ('roy', roy_data), ('mip', mip_data), ('smoy', smoy_data)]:
train, dev, test = train_test_split(dataset)
x_train, y_train = x_y_split(train)
x_dev, y_dev = x_y_split(dev)
x_test, y_test = x_y_split(test)
x_train.to_csv(f'data/train_x_{award_name}.csv')
y_train.to_csv(f'data/train_y_{award_name}.csv')
x_dev.to_csv(f'data/dev_x_{award_name}.csv')
y_dev.to_csv(f'data/dev_y_{award_name}.csv')
x_test.to_csv(f'data/test_x_{award_name}.csv')
y_test.to_csv(f'data/test_y_{award_name}.csv')
import copy
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import rbo
In this notebook, I will be using a linear regression to make predictions for each award. This notebook includes the code I used to create the linear regression and make predictions for each award.
This code should be run after you have finished running the 'preprocessing' notebook.
Here I am loading helper functions that I wrote in order to load my data, scale the data, fill in missing/NA values, and calculate and print my accuracy.
GET_TEST_RESULTS = True
# retrieve data for the specified award
def get_data(award_name):
x_train_pnames = pd.read_csv(f'data/train_x_{award}.csv', index_col=0)
y_train = pd.read_csv(f'data/train_y_{award}.csv', index_col=0)
x_dev_pnames = pd.read_csv(f'data/dev_x_{award}.csv', index_col=0)
y_dev = pd.read_csv(f'data/dev_y_{award}.csv', index_col=0)
x_test_pnames = pd.read_csv(f'data/test_x_{award}.csv', index_col=0)
y_test = pd.read_csv(f'data/test_y_{award}.csv', index_col=0)
x_train = x_train_pnames.drop(columns=['player', 'season'])
x_dev = x_dev_pnames.drop(columns=['player', 'season'])
x_test = x_test_pnames.drop(columns=['player', 'season'])
return x_train_pnames, x_train, y_train, x_dev_pnames, x_dev, y_dev, x_test_pnames, x_test, y_test
There are some missing statistics in my dataset. These occur mostly for older players, as there are some statistics (such as steals and blocks) that weren't kept track of when the NBA first started. I will deal with these missing statistics in three ways: Method 2 - Fill in zeros for all of the missing stats, and use all the data Method 5 - Fill in mean values for all of missing stats, and use all of the data Method 4 - Use linear regressions to predict each of the missing stats, and use all of the data
Some other ways I may implement in the future to explore dealing with missing statistics are: Remove all of these rows (This should end up using all player seasons after 1979) Drop all the columns with missing stats, and use all of the data
# fill in the missing statistics (that were not kept track of for older players)
# using one of the three methods mentioned above
def fill_missing_stats(x_train, x_train_pnames, x_dev, x_dev_pnames, x_test, x_test_pnames, method=4):
if method == 2:
x_train_filled = x_train.fillna(0)
x_dev_filled = x_dev.fillna(0)
x_test_filled = x_test.fillna(0)
elif method == 4:
x_train_filled = copy.copy(x_train_pnames)
x_dev_filled = copy.copy(x_dev_pnames)
x_test_filled = copy.copy(x_test_pnames)
# three_p and three_pa - if season < 1980, set to NA
x_train_filled['three_p'] = np.where(x_train_filled.season < 1980, float('NaN'), x_train_filled.three_p)
x_train_filled['three_pa'] = np.where(x_train_filled.season < 1980, float('NaN'), x_train_filled.three_pa)
x_train_filled = x_train_filled.drop(columns=['player', 'season'])
x_dev_filled = x_dev_filled.drop(columns=['player', 'season'])
x_test_filled = x_test_filled.drop(columns=['player', 'season'])
# predict all missing values for these stats using lin reg
# train data: player seasons where stat != NaN
# test data: player seasons where stat == NaN
for stat in ['three_p', 'three_pa', 'gs', 'orb', 'drb', 'stl', 'blk', 'tov']:
# check if this stat has missing values
if x_train_filled[stat].isnull().values.any():
train_stat = x_train_filled[x_train_filled[stat].notna()]
# for all lin reg predictions, don't use any of the columns that have missing values
x_train_stat = train_stat.drop(columns=['three_p', 'three_pa', 'gs', 'orb', 'drb', 'stl', 'blk', 'tov', 'three_pct', 'gs_pct'])
y_train_stat = train_stat[[stat]]
test_stat = x_train_filled[~x_train_filled[stat].notna()]
x_test_stat = test_stat.drop(columns=['three_p', 'three_pa', 'gs', 'orb', 'drb', 'stl', 'blk', 'tov', 'three_pct', 'gs_pct'])
reg_stat = LinearRegression().fit(x_train_stat, y_train_stat)
pred_test_stat = reg_stat.predict(x_test_stat)
pred_stat_dict = {index: round(value[0], 3) for index, value in zip(x_train_filled[~x_train_filled[stat].notna()].index, pred_test_stat)}
x_train_filled[stat] = x_train_filled[stat].fillna(pred_stat_dict)
# fill in three_pct - three_p / three_pa, gs_pct - gs / g
x_train_filled['three_pct'] = x_train_filled['three_p'] / x_train_filled['three_pa']
x_train_filled['gs_pct'] = x_train_filled['gs'] / x_train_filled['g']
x_train_filled = x_train_filled.fillna(0)
x_dev_filled['three_pct'] = x_dev_filled['three_p'] / x_dev_filled['three_pa']
x_dev_filled = x_dev_filled.fillna(0)
x_test_filled['three_pct'] = x_test_filled['three_p'] / x_test_filled['three_pa']
x_test_filled = x_test_filled.fillna(0)
elif method == 5:
x_train_filled = x_train.fillna(x_train.mean())
x_dev_filled = x_dev.fillna(x_dev.mean())
x_test_filled = x_test.fillna(x_test.mean())
else:
print('method of dealing with missing stats must be either 2, 4, or 5')
return x_train_filled, x_dev_filled, x_test_filled
I decided to measure my models' performance using three metrics: Mean Squared Error, % of correct MVP predictions, and Rank-Biased Overlap. I chose to use rank-biased overlap because it is an accuracy metric for rankings that weights higher ranked items more than lower ranked items. In addition, when comparing two lists using this metric, rank-biased overlap can deal with items that occur in one list but are not seen in the other list. For more information on rank-biased overlap, see this article: http://codalism.com/research/papers/wmz10_tois.pdf
# print accuracy metrics by comparing the given lists (y_pred and y_actual)
# accuracy metrics that I am using are % of correct winners picked, rank biased overlap, and mean squared error
def print_accuracy(y_pred, y_actual, x_pnames, rbo_cutoff = None, verbose=0):
all_data = copy.copy(x_pnames)
all_data['award_pts_actual'] = y_actual['award_pts_won']
all_data['award_pts_pred'] = y_pred
num_correct = 0
num_yrs = 0
rbo_vals = []
for year in set(all_data.season):
# a. % of correct winners picked
data_in_yr = all_data[all_data['season'] == year]
pred_winner_row = data_in_yr['award_pts_pred'].argmax()
actual_winner_row = data_in_yr['award_pts_actual'].argmax()
pred_winner, pred_pts = data_in_yr.iloc[pred_winner_row]['player'], data_in_yr.iloc[pred_winner_row]['award_pts_pred']
actual_winner, actual_pts = data_in_yr.iloc[actual_winner_row]['player'], data_in_yr.iloc[actual_winner_row]['award_pts_actual']
if verbose > 0:
print(f'{year}')
print(f'Predicted Winner: {pred_winner} ({pred_pts} award pts)')
print(f'Actual Winner: {actual_winner} ({actual_pts} award pts)')
if pred_winner == actual_winner:
num_correct += 1
num_yrs += 1
# b. Rank-Biased Overlap
# calculate RBO:
# get rows in given year with players that received votes - sorted by num votes
vote_getters_df = data_in_yr[data_in_yr['award_pts_actual'] > 0]
num_vote_getters = len(vote_getters_df)
vote_getters_df = vote_getters_df.sort_values(by=['award_pts_actual'], ascending=False)
# get top-(num_vote_getters) rows from predictions
pred_vote_getters_df = data_in_yr.sort_values(by=['award_pts_pred'], ascending=False)
if rbo_cutoff == None:
pred_vote_getters_df = pred_vote_getters_df[:num_vote_getters]
else:
cutoff = min(rbo_cutoff, num_vote_getters)
vote_getters_df = vote_getters_df[:cutoff]
pred_vote_getters_df = pred_vote_getters_df[:cutoff]
vote_getters = vote_getters_df['player'].values
pred_vote_getters = pred_vote_getters_df['player'].values
# deal with edge case where two vote getters have exact same name
vote_getters = list(set(vote_getters))
pred_vote_getters = list(set(pred_vote_getters))
#print(len(vote_getters))
#print(len(pred_vote_getters))
if verbose > 1:
print('Actual vote getters:')
print(vote_getters)
print(f'Predicted top-{num_vote_getters} vote getters:')
print(pred_vote_getters)
# compute RBO from these two lists
rbo_num = rbo.RankingSimilarity(vote_getters, pred_vote_getters).rbo()
rbo_vals.append(rbo_num)
if verbose > 0:
print(f'Rank Biased Overlap: {rbo_num}')
print(f'% of winners predicted correctly: {round(num_correct / num_yrs * 100, 2)}%')
print(f'Average Rank-Biased Overlap: {round(sum(rbo_vals) / len(rbo_vals), 3)}')
# c. MSE
mse = mean_squared_error(y_actual, y_pred)
print(f'Mean Squared Error for Linear Regression: {mse}')
Here, I implement a separate linear regression for predicting each of the five awards. I also print out the accuracy of each linear regression using the three accuracy metrics mentioned above.
for award in ['mvp', 'dpoy', 'roy', 'mip', 'smoy']:
print(f'\n\n*****{award}*****')
x_train_pnames, x_train, y_train, x_dev_pnames, x_dev, y_dev, x_test_pnames, x_test, y_test = get_data(award)
for fill_na_method in [4, 2, 5]:
print(f'\nfilling na values using method {fill_na_method}')
x_train_filled, x_dev_filled, x_test_filled = fill_missing_stats(x_train, x_train_pnames, x_dev, x_dev_pnames,
x_test, x_test_pnames, method=fill_na_method)
reg = LinearRegression().fit(x_train_filled, y_train)
train_pred = reg.predict(x_train_filled)
dev_pred = reg.predict(x_dev_filled)
print('\nTrain accuracy:')
print_accuracy(train_pred, y_train, x_train_pnames)
print('\nDev accuracy:')
print_accuracy(dev_pred, y_dev, x_dev_pnames)
*****mvp***** filling na values using method 4 Train accuracy: % of winners predicted correctly: 43.64% Average Rank-Biased Overlap: 0.568 Mean Squared Error for Linear Regression: 2530.6445213701336 Dev accuracy: % of winners predicted correctly: 60.0% Average Rank-Biased Overlap: 0.698 Mean Squared Error for Linear Regression: 4041.8058795340175 filling na values using method 2 Train accuracy: % of winners predicted correctly: 43.64% Average Rank-Biased Overlap: 0.574 Mean Squared Error for Linear Regression: 2541.3643425513865 Dev accuracy: % of winners predicted correctly: 40.0% Average Rank-Biased Overlap: 0.673 Mean Squared Error for Linear Regression: 4001.1947572209865 filling na values using method 5 Train accuracy: % of winners predicted correctly: 41.82% Average Rank-Biased Overlap: 0.583 Mean Squared Error for Linear Regression: 2534.300604737239 Dev accuracy: % of winners predicted correctly: 60.0% Average Rank-Biased Overlap: 0.693 Mean Squared Error for Linear Regression: 4010.8966288581823 *****dpoy***** filling na values using method 4 Train accuracy: % of winners predicted correctly: 28.57% Average Rank-Biased Overlap: 0.342 Mean Squared Error for Linear Regression: 204.7148158312748 Dev accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.36 Mean Squared Error for Linear Regression: 523.1768281454953 filling na values using method 2 Train accuracy: % of winners predicted correctly: 28.57% Average Rank-Biased Overlap: 0.342 Mean Squared Error for Linear Regression: 204.71481547322094 Dev accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.36 Mean Squared Error for Linear Regression: 523.1768566474012 filling na values using method 5 Train accuracy: % of winners predicted correctly: 28.57% Average Rank-Biased Overlap: 0.347 Mean Squared Error for Linear Regression: 204.6933151380242 Dev accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.345 Mean Squared Error for Linear Regression: 523.199263213497 *****roy***** filling na values using method 4 Train accuracy: % of winners predicted correctly: 65.96% Average Rank-Biased Overlap: 0.541 Mean Squared Error for Linear Regression: 1050.8713637655542 Dev accuracy: % of winners predicted correctly: 40.0% Average Rank-Biased Overlap: 0.586 Mean Squared Error for Linear Regression: 4517.666729629855 filling na values using method 2 Train accuracy: % of winners predicted correctly: 65.96% Average Rank-Biased Overlap: 0.543 Mean Squared Error for Linear Regression: 1029.9108156377913 Dev accuracy: % of winners predicted correctly: 60.0% Average Rank-Biased Overlap: 0.562 Mean Squared Error for Linear Regression: 4476.544595591514 filling na values using method 5 Train accuracy: % of winners predicted correctly: 63.83% Average Rank-Biased Overlap: 0.538 Mean Squared Error for Linear Regression: 1030.1871555038801 Dev accuracy: % of winners predicted correctly: 60.0% Average Rank-Biased Overlap: 0.562 Mean Squared Error for Linear Regression: 4464.292050653677 *****mip***** filling na values using method 4 Train accuracy: % of winners predicted correctly: 28.0% Average Rank-Biased Overlap: 0.417 Mean Squared Error for Linear Regression: 208.58511792790853 Dev accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.392 Mean Squared Error for Linear Regression: 533.810351093619 filling na values using method 2 Train accuracy: % of winners predicted correctly: 28.0% Average Rank-Biased Overlap: 0.416 Mean Squared Error for Linear Regression: 208.58510588546207 Dev accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.392 Mean Squared Error for Linear Regression: 533.8111070547918 filling na values using method 5 Train accuracy: % of winners predicted correctly: 28.0% Average Rank-Biased Overlap: 0.418 Mean Squared Error for Linear Regression: 208.59917282173424 Dev accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.384 Mean Squared Error for Linear Regression: 533.8946490819793 *****smoy***** filling na values using method 4 Train accuracy: % of winners predicted correctly: 51.85% Average Rank-Biased Overlap: 0.458 Mean Squared Error for Linear Regression: 372.3787418148702 Dev accuracy: % of winners predicted correctly: 60.0% Average Rank-Biased Overlap: 0.552 Mean Squared Error for Linear Regression: 1070.6729631340697 filling na values using method 2 Train accuracy: % of winners predicted correctly: 51.85% Average Rank-Biased Overlap: 0.458 Mean Squared Error for Linear Regression: 372.37868068854 Dev accuracy: % of winners predicted correctly: 60.0% Average Rank-Biased Overlap: 0.552 Mean Squared Error for Linear Regression: 1070.6727699891849 filling na values using method 5 Train accuracy: % of winners predicted correctly: 51.85% Average Rank-Biased Overlap: 0.455 Mean Squared Error for Linear Regression: 372.34742398764087 Dev accuracy: % of winners predicted correctly: 60.0% Average Rank-Biased Overlap: 0.551 Mean Squared Error for Linear Regression: 1070.6539209511527
if GET_TEST_RESULTS:
for award in ['mvp', 'dpoy', 'roy', 'mip', 'smoy']:
print(f'\n\n*****{award}*****')
x_train_pnames, x_train, y_train, x_dev_pnames, x_dev, y_dev, x_test_pnames, x_test, y_test = get_data(award)
for fill_na_method in [4]:
print(f'\nfilling na values using method {fill_na_method}')
x_train_filled, x_dev_filled, x_test_filled = fill_missing_stats(x_train, x_train_pnames, x_dev,
x_dev_pnames, x_test, x_test_pnames,
method=fill_na_method)
reg = LinearRegression().fit(x_train_filled, y_train)
test_pred = reg.predict(x_test_filled)
print('\nTest accuracy:')
print_accuracy(test_pred, y_test, x_test_pnames)
*****mvp***** filling na values using method 4 Test accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.726 Mean Squared Error for Linear Regression: 2892.411088615582 *****dpoy***** filling na values using method 4 Test accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.251 Mean Squared Error for Linear Regression: 576.6248621263551 *****roy***** filling na values using method 4 Test accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.63 Mean Squared Error for Linear Regression: 2818.265612266596 *****mip***** filling na values using method 4 Test accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.365 Mean Squared Error for Linear Regression: 626.4988654704664 *****smoy***** filling na values using method 4 Test accuracy: % of winners predicted correctly: 40.0% Average Rank-Biased Overlap: 0.462 Mean Squared Error for Linear Regression: 731.9334129988895
import copy
import numpy as np
import pandas as pd
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.regularizers import l1, l2
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import MinMaxScaler
import rbo
In this notebook, I will be using a multi-layer perceptron to make predictions for each award. This notebook includes the code I used to create this neural model and make predictions for each award.
This code should be run after you have finished running the 'preprocessing' notebook.
Here I am loading helper functions that I wrote in order to load my data, scale the data, fill in missing/NA values, and calculate and print my accuracy.
GET_TEST_RESULTS = True
# retrieves train, dev, and test data for the specified award
def get_data(award_name):
x_train_pnames = pd.read_csv(f'data/train_x_{award}.csv', index_col=0)
y_train = pd.read_csv(f'data/train_y_{award}.csv', index_col=0)
x_dev_pnames = pd.read_csv(f'data/dev_x_{award}.csv', index_col=0)
y_dev = pd.read_csv(f'data/dev_y_{award}.csv', index_col=0)
x_test_pnames = pd.read_csv(f'data/test_x_{award}.csv', index_col=0)
y_test = pd.read_csv(f'data/test_y_{award}.csv', index_col=0)
x_train = x_train_pnames.drop(columns=['player', 'season'])
x_dev = x_dev_pnames.drop(columns=['player', 'season'])
x_test = x_test_pnames.drop(columns=['player', 'season'])
return x_train_pnames, x_train, y_train, x_dev_pnames, x_dev, y_dev, x_test_pnames, x_test, y_test
There are some missing statistics in my dataset. These occur mostly for older players, as there are some statistics (such as steals and blocks) that weren't kept track of when the NBA first started. I will deal with these missing statistics in three ways:
Method 2 - Fill in zeros for all of the missing stats, and use all the data
Method 5 - Fill in mean values for all of missing stats, and use all of the data
Method 4 - Use linear regressions to predict each of the missing stats, and use all of the data
Some other ways I may implement in the future to explore dealing with missing statistics are:
Remove all of these rows (This should end up using all player seasons after 1979)
Drop all the columns with missing stats, and use all of the data
# fill in the missing statistics (that were not kept track of for older players)
# using one of the three methods mentioned above
def fill_missing_stats(x_train, x_train_pnames, x_dev, x_dev_pnames, x_test, x_test_pnames, method=4):
if method == 2:
x_train_filled = x_train.fillna(0)
x_dev_filled = x_dev.fillna(0)
x_test_filled = x_test.fillna(0)
elif method == 4:
x_train_filled = copy.copy(x_train_pnames)
x_dev_filled = copy.copy(x_dev_pnames)
x_test_filled = copy.copy(x_test_pnames)
# three_p and three_pa - if season < 1980, set to NA
x_train_filled['three_p'] = np.where(x_train_filled.season < 1980, float('NaN'), x_train_filled.three_p)
x_train_filled['three_pa'] = np.where(x_train_filled.season < 1980, float('NaN'), x_train_filled.three_pa)
x_train_filled = x_train_filled.drop(columns=['player', 'season'])
x_dev_filled = x_dev_filled.drop(columns=['player', 'season'])
x_test_filled = x_test_filled.drop(columns=['player', 'season'])
# predict all missing values for these stats using lin reg
# train data: player seasons where stat != NaN
# test data: player seasons where stat == NaN
for stat in ['three_p', 'three_pa', 'gs', 'orb', 'drb', 'stl', 'blk', 'tov']:
# check if this stat has missing values
if x_train_filled[stat].isnull().values.any():
train_stat = x_train_filled[x_train_filled[stat].notna()]
# for all lin reg predictions, don't use any of the columns that have missing values
x_train_stat = train_stat.drop(columns=['three_p', 'three_pa', 'gs', 'orb', 'drb', 'stl', 'blk', 'tov', 'three_pct', 'gs_pct'])
y_train_stat = train_stat[[stat]]
test_stat = x_train_filled[~x_train_filled[stat].notna()]
x_test_stat = test_stat.drop(columns=['three_p', 'three_pa', 'gs', 'orb', 'drb', 'stl', 'blk', 'tov', 'three_pct', 'gs_pct'])
reg_stat = LinearRegression().fit(x_train_stat, y_train_stat)
pred_test_stat = reg_stat.predict(x_test_stat)
pred_stat_dict = {index: round(value[0], 3) for index, value in zip(x_train_filled[~x_train_filled[stat].notna()].index, pred_test_stat)}
x_train_filled[stat] = x_train_filled[stat].fillna(pred_stat_dict)
# fill in three_pct - three_p / three_pa, gs_pct - gs / g
x_train_filled['three_pct'] = x_train_filled['three_p'] / x_train_filled['three_pa']
x_train_filled['gs_pct'] = x_train_filled['gs'] / x_train_filled['g']
x_train_filled = x_train_filled.fillna(0)
x_dev_filled['three_pct'] = x_dev_filled['three_p'] / x_dev_filled['three_pa']
x_dev_filled = x_dev_filled.fillna(0)
x_test_filled['three_pct'] = x_test_filled['three_p'] / x_test_filled['three_pa']
x_test_filled = x_test_filled.fillna(0)
elif method == 5:
x_train_filled = x_train.fillna(x_train.mean())
x_dev_filled = x_dev.fillna(x_dev.mean())
x_test_filled = x_test.fillna(x_test.mean())
else:
print('method of dealing with missing stats must be either 2, 4, or 5')
return x_train_filled, x_dev_filled, x_test_filled
The award points that players receive are very skewed, as most players received zero points for a given award. To reduce this skew, I scaled all of the award point values logarithmically.
I also applied a Min Max scaler so that each feature would be in the range (0, 1).
# log scale all the y values, then min-max scale all the x and y values
def scale_vals(x_vals, y_vals):
# add 1 so that you can take log of players with 0 award points
y_log_vals = np.log(y_vals.award_pts_won + 1).values.reshape(-1, 1)
x_scaler = MinMaxScaler().fit(x_vals, y_vals)
x_vals = x_scaler.transform(x_vals)
y_scaler = MinMaxScaler()
y_vals = y_scaler.fit_transform(y_log_vals)
return x_vals, y_vals, y_scaler
def unscale_vals(y_vals_scaled, y_scaler):
y_log_vals = y_scaler.inverse_transform(y_vals_scaled)
y_vals = np.expm1(y_log_vals)
return y_vals
I decided to measure my models' performance using three metrics: Mean Squared Error, % of correct MVP predictions, and Rank-Biased Overlap.
I chose to use rank-biased overlap because it is an accuracy metric for rankings that weights higher ranked items more than lower ranked items. In addition, when comparing two lists using this metric, rank-biased overlap can deal with items that occur in one list but are not seen in the other list.
For more information on rank-biased overlap, see this article: http://codalism.com/research/papers/wmz10_tois.pdf
# print accuracy metrics by comparing the given lists (y_pred and y_actual)
# accuracy metrics that I am using are % of correct winners picked, rank biased overlap, and mean squared error
def print_accuracy(y_pred, y_actual, x_pnames, rbo_cutoff = None, verbose=0):
all_data = copy.copy(x_pnames)
all_data['award_pts_actual'] = y_actual['award_pts_won']
all_data['award_pts_pred'] = y_pred
num_correct = 0
num_yrs = 0
rbo_vals = []
for year in set(all_data.season):
# a. % of correct winners picked
data_in_yr = all_data[all_data['season'] == year]
pred_winner_row = data_in_yr['award_pts_pred'].argmax()
actual_winner_row = data_in_yr['award_pts_actual'].argmax()
pred_winner, pred_pts = data_in_yr.iloc[pred_winner_row]['player'], data_in_yr.iloc[pred_winner_row]['award_pts_pred']
actual_winner, actual_pts = data_in_yr.iloc[actual_winner_row]['player'], data_in_yr.iloc[actual_winner_row]['award_pts_actual']
if verbose > 0:
print(f'{year}')
print(f'Predicted Winner: {pred_winner} ({pred_pts} award pts)')
print(f'Actual Winner: {actual_winner} ({actual_pts} award pts)')
if pred_winner == actual_winner:
num_correct += 1
num_yrs += 1
# b. Rank-Biased Overlap
# calculate RBO:
# get rows in given year with players that received votes - sorted by num votes
vote_getters_df = data_in_yr[data_in_yr['award_pts_actual'] > 0]
num_vote_getters = len(vote_getters_df)
vote_getters_df = vote_getters_df.sort_values(by=['award_pts_actual'], ascending=False)
# get top-(num_vote_getters) rows from predictions
pred_vote_getters_df = data_in_yr.sort_values(by=['award_pts_pred'], ascending=False)
if rbo_cutoff == None:
pred_vote_getters_df = pred_vote_getters_df[:num_vote_getters]
else:
cutoff = min(rbo_cutoff, num_vote_getters)
vote_getters_df = vote_getters_df[:cutoff]
pred_vote_getters_df = pred_vote_getters_df[:cutoff]
vote_getters = vote_getters_df['player'].values
pred_vote_getters = pred_vote_getters_df['player'].values
# deal with edge case where two vote getters have exact same name
vote_getters = list(set(vote_getters))
pred_vote_getters = list(set(pred_vote_getters))
#print(len(vote_getters))
#print(len(pred_vote_getters))
if verbose > 1:
print('Actual vote getters:')
print(vote_getters)
print(f'Predicted top-{num_vote_getters} vote getters:')
print(pred_vote_getters)
# compute RBO from these two lists
rbo_num = rbo.RankingSimilarity(vote_getters, pred_vote_getters).rbo()
rbo_vals.append(rbo_num)
if verbose > 0:
print(f'Rank Biased Overlap: {rbo_num}')
print(f'% of winners predicted correctly: {round(num_correct / num_yrs * 100, 2)}%')
print(f'Average Rank-Biased Overlap: {round(sum(rbo_vals) / len(rbo_vals), 3)}')
# c. MSE
mse = mean_squared_error(y_actual, y_pred)
print(f'Mean Squared Error: {mse}')
Here, I implement a separate neural model for predicting each of the five awards. I manually tested changing a bunch of different hyperparameters one at a time in order to come up with the best set of hyperparameters I could. One area for future improvement would be to use grid search or another method to search for better hyperparameters.
for award in ['mvp', 'dpoy', 'roy', 'mip', 'smoy']:
print(f'\n\n*****{award}*****')
x_train_pnames, x_train, y_train, x_dev_pnames, x_dev, y_dev, x_test_pnames, x_test, y_test = get_data(award)
for fill_na_method in [4, 2, 5]:
print(f'\nfilling na values using method {fill_na_method}')
x_train_filled, x_dev_filled, x_test_filled = fill_missing_stats(x_train, x_train_pnames, x_dev, x_dev_pnames,
x_test, x_test_pnames, method=fill_na_method)
x_train_scaled, y_train_scaled, y_train_scaler = scale_vals(x_train_filled, y_train)
x_dev_scaled, y_dev_scaled, y_dev_scaler = scale_vals(x_dev_filled, y_dev)
# create model
model = Sequential()
#model.add(Dense(40, activation='relu', kernel_regularizer=l2(l=0.6)))
model.add(Dense(40, input_dim=x_train_filled.shape[1], activation='relu', kernel_regularizer=l2(l=0.6)))
#model.add(Dense(40, activation='relu'))
#model.add(Dense(40, input_dim=x_train_filled.shape[1], activation='relu'))
#model.add(Dense(20, activation='relu', kernel_regularizer=l2(l=0.1)))
#model.add(Dense(1, activation='sigmoid', kernel_regularizer=l2(l=0.1)))
model.add(Dense(40, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#model.add(Dense(1))
#model.compile(loss='mse', optimizer=Adam(lr=0.01), metrics=['accuracy'])
model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9, clipnorm=1.0), metrics=['accuracy'])
#model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['accuracy'])
model.fit(x_train_scaled, y_train_scaled, epochs=50, batch_size=300, verbose=0)
train_pred_scaled = model.predict(x_train_scaled)
dev_pred_scaled = model.predict(x_dev_scaled)
train_pred = unscale_vals(train_pred_scaled, y_train_scaler)
dev_pred = unscale_vals(dev_pred_scaled, y_dev_scaler)
print('\nTrain accuracy:')
print_accuracy(train_pred, y_train, x_train_pnames)
print('\nDev accuracy:')
print_accuracy(dev_pred, y_dev, x_dev_pnames, verbose=0)
*****mvp***** filling na values using method 4 Train accuracy: % of winners predicted correctly: 40.0% Average Rank-Biased Overlap: 0.608 Mean Squared Error: 3455.0053425848796 Dev accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.448 Mean Squared Error: 5248.896052518296 filling na values using method 2 Train accuracy: % of winners predicted correctly: 38.18% Average Rank-Biased Overlap: 0.59 Mean Squared Error: 3455.0033176317493 Dev accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.438 Mean Squared Error: 5248.89100670029 filling na values using method 5 Train accuracy: % of winners predicted correctly: 38.18% Average Rank-Biased Overlap: 0.584 Mean Squared Error: 3454.952450350006 Dev accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.439 Mean Squared Error: 5248.842520457459 *****dpoy***** filling na values using method 4 Train accuracy: % of winners predicted correctly: 10.71% Average Rank-Biased Overlap: 0.308 Mean Squared Error: 227.08927280303638 Dev accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.23 Mean Squared Error: 582.2330934135449 filling na values using method 2 Train accuracy: % of winners predicted correctly: 14.29% Average Rank-Biased Overlap: 0.302 Mean Squared Error: 227.0916748798138 Dev accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.221 Mean Squared Error: 582.2394186062369 filling na values using method 5 Train accuracy: % of winners predicted correctly: 14.29% Average Rank-Biased Overlap: 0.298 Mean Squared Error: 227.08884748022555 Dev accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.228 Mean Squared Error: 582.2319740598647 *****roy***** filling na values using method 4 Train accuracy: % of winners predicted correctly: 53.19% Average Rank-Biased Overlap: 0.66 Mean Squared Error: 1366.0977613702298 Dev accuracy: % of winners predicted correctly: 80.0% Average Rank-Biased Overlap: 0.511 Mean Squared Error: 6098.135192549801 filling na values using method 2 Train accuracy: % of winners predicted correctly: 55.32% Average Rank-Biased Overlap: 0.653 Mean Squared Error: 1366.214568667455 Dev accuracy: % of winners predicted correctly: 80.0% Average Rank-Biased Overlap: 0.555 Mean Squared Error: 6098.545499102555 filling na values using method 5 Train accuracy: % of winners predicted correctly: 55.32% Average Rank-Biased Overlap: 0.662 Mean Squared Error: 1366.2546153386809 Dev accuracy: % of winners predicted correctly: 80.0% Average Rank-Biased Overlap: 0.555 Mean Squared Error: 6098.6887173563755 *****mip***** filling na values using method 4 Train accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.207 Mean Squared Error: 222.23047162270657 Dev accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.29 Mean Squared Error: 568.1234863951729 filling na values using method 2 Train accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.235 Mean Squared Error: 222.2224549828748 Dev accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.352 Mean Squared Error: 568.1035358804248 filling na values using method 5 Train accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.229 Mean Squared Error: 222.21962040530929 Dev accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.343 Mean Squared Error: 568.0965813814662 *****smoy***** filling na values using method 4 Train accuracy: % of winners predicted correctly: 33.33% Average Rank-Biased Overlap: 0.494 Mean Squared Error: 436.618432056016 Dev accuracy: % of winners predicted correctly: 60.0% Average Rank-Biased Overlap: 0.37 Mean Squared Error: 1240.921690824693 filling na values using method 2 Train accuracy: % of winners predicted correctly: 37.04% Average Rank-Biased Overlap: 0.514 Mean Squared Error: 436.61896126415667 Dev accuracy: % of winners predicted correctly: 80.0% Average Rank-Biased Overlap: 0.343 Mean Squared Error: 1240.9229953253944 filling na values using method 5 Train accuracy: % of winners predicted correctly: 37.04% Average Rank-Biased Overlap: 0.495 Mean Squared Error: 436.61534384748603 Dev accuracy: % of winners predicted correctly: 60.0% Average Rank-Biased Overlap: 0.356 Mean Squared Error: 1240.9141818119017
if GET_TEST_RESULTS:
for award in ['mvp', 'dpoy', 'roy', 'mip', 'smoy']:
print(f'\n\n*****{award}*****')
x_train_pnames, x_train, y_train, x_dev_pnames, x_dev, y_dev, x_test_pnames, x_test, y_test = get_data(award)
for fill_na_method in [4]:
print(f'\nfilling na values using method {fill_na_method}')
x_train_filled, x_dev_filled, x_test_filled = fill_missing_stats(x_train, x_train_pnames, x_dev, x_dev_pnames,
x_test, x_test_pnames, method=fill_na_method)
x_train_scaled, y_train_scaled, y_train_scaler = scale_vals(x_train_filled, y_train)
x_dev_scaled, y_dev_scaled, y_dev_scaler = scale_vals(x_dev_filled, y_dev)
x_test_scaled, y_test_scaled, y_test_scaler = scale_vals(x_test_filled, y_test)
# create model
model = Sequential()
#model.add(Dense(40, activation='relu', kernel_regularizer=l2(l=0.6)))
model.add(Dense(40, input_dim=x_train_filled.shape[1], activation='relu', kernel_regularizer=l2(l=0.6)))
#model.add(Dense(40, activation='relu'))
#model.add(Dense(40, input_dim=x_train_filled.shape[1], activation='relu'))
#model.add(Dense(20, activation='relu', kernel_regularizer=l2(l=0.1)))
#model.add(Dense(1, activation='sigmoid', kernel_regularizer=l2(l=0.1)))
model.add(Dense(40, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#model.add(Dense(1))
#model.compile(loss='mse', optimizer=Adam(lr=0.01), metrics=['accuracy'])
model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9, clipnorm=1.0), metrics=['accuracy'])
#model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['accuracy'])
model.fit(x_train_scaled, y_train_scaled, epochs=50, batch_size=300, verbose=0)
train_pred_scaled = model.predict(x_train_scaled)
test_pred_scaled = model.predict(x_test_scaled)
train_pred = unscale_vals(train_pred_scaled, y_train_scaler)
test_pred = unscale_vals(test_pred_scaled, y_test_scaler)
print('\nTest accuracy:')
print_accuracy(test_pred, y_test, x_test_pnames, verbose=0)
*****mvp***** filling na values using method 4 Test accuracy: % of winners predicted correctly: 20.0% Average Rank-Biased Overlap: 0.367 Mean Squared Error: 3766.1329367187127 *****dpoy***** filling na values using method 4 Test accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.291 Mean Squared Error: 616.9849134132764 *****roy***** filling na values using method 4 Test accuracy: % of winners predicted correctly: 60.0% Average Rank-Biased Overlap: 0.755 Mean Squared Error: 3684.5887546505296 *****mip***** filling na values using method 4 Test accuracy: % of winners predicted correctly: 0.0% Average Rank-Biased Overlap: 0.407 Mean Squared Error: 660.1662631386804 *****smoy***** filling na values using method 4 Test accuracy: % of winners predicted correctly: 40.0% Average Rank-Biased Overlap: 0.518 Mean Squared Error: 823.7149630759205