data_science

Team Gaming Rankings

08 Feb 2021

Ranking My Friends on Generals

During the course of the pandemic, I started playing a decade old game called Command&Conquer Generals:Zero Hour with my brother and a bunch of friends.
Out of curiousity, we started keeping track of the stats from all the games and show wins and losses for each person.
This is a team game though so being able to pull out an individual’s overall excellence and measure of how much better the person was a little more tricky.
For this project, I wanted to ingest the stats, create a model for predicting each person’s relative strength, and end up with a score that could be used to create more even teams for future games.

The scores are stored in a google sheet located here

The output of tableau dashboard can be found here or embedded at the end of this page.

#import all the libraries we'll use
import math
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from numpy import mean, std

import matplotlib.pyplot as plt
from itertools import combinations

import pickle
import os.path

from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request

RANDOM_STATE = 42 
MAX_NUMBER_OF_GAMES = 25

The games were stored on a google sheet so the following functions ingest the most up-to-date data.

def gsheet_api_check(SCOPES):
    creds = None
    if os.path.exists('token.pickle'):
        with open('token.pickle', 'rb') as token:
            creds = pickle.load(token)
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)
            creds = flow.run_local_server(port=0)
        with open('token.pickle', 'wb') as token:
            pickle.dump(creds, token)
    return creds

def pull_sheet_data(SCOPES,SPREADSHEET_ID,DATA_TO_PULL):
    creds = gsheet_api_check(SCOPES)
    service = build('sheets', 'v4', credentials=creds)
    sheet = service.spreadsheets()
    result = sheet.values().get(
        spreadsheetId=SPREADSHEET_ID,
        range=DATA_TO_PULL).execute()
    values = result.get('values', [])
    
    if not values:
        print('No data found.')
    else:
        rows = sheet.values().get(spreadsheetId=SPREADSHEET_ID,
                                  range=DATA_TO_PULL).execute()
        data = rows.get('values')
        print("COMPLETE: Data copied")
        return data
    
def push_sheet_data(SCOPES,SPREADSHEET_ID,RANGE, DATA_TO_PUSH):
    creds = gsheet_api_check(SCOPES)
    service = build('sheets', 'v4', credentials=creds)
    sheet = service.spreadsheets()
    body = {
        'values': DATA_TO_PUSH
    }
    result = sheet.values().update(
        spreadsheetId=SPREADSHEET_ID, range=RANGE,
        valueInputOption='USER_ENTERED', body=body).execute()
    data = result.get('updatedCells')
    print('{0} cells updated.'.format(data))
    
    return data    

Import Data from Google Sheets

    
# If modifying these scopes, delete the file token.pickle.
#SCOPES = ['https://www.googleapis.com/auth/spreadsheets.readonly']
SCOPES = ['https://www.googleapis.com/auth/spreadsheets']

# The ID and range of a sample spreadsheet.
SPREADSHEET_ID = '1ks6mqMbTgVFkQE-rZDByKVnGH4WMRMOdLSdabeMfZaA'

#Pulls data from the entire spreadsheet tab.
#DATA_TO_PULL = 'Games'
#or
#Pulls data only from the specified range of cells.
DATA_TO_PULL = 'Games!A1:Q4000'
data = pull_sheet_data(SCOPES,SPREADSHEET_ID,DATA_TO_PULL)
games = pd.DataFrame(data[1:], columns=data[0])
games = games.set_index('Index',drop=True)
#df.head()

numeric_columns = ['Team', 'Win', 'Game', 'Units Created',
       'Units Lost', 'Units Destroyed', 'Buildings Constructed',
       'Buildings Lost', 'Buildings Destroyed', 'Supplies Collected', 'Rank',
       'Inverse Rank', 'Normalized Rank']

for col in numeric_columns:
    games[col] = pd.to_numeric(games[col]).copy()
games['Date'] = pd.to_datetime(games['Date'])
games.head()

COMPLETE: Data copied

	Date	Name	Faction	Team	Win	Game	Units Created	Units Lost	Units Destroyed	Buildings Constructed	Buildings Lost	Buildings Destroyed	Supplies Collected	Rank	Inverse Rank	Normalized Rank
Index
1	2021-02-03	Matt	USA	1.0	1.0	1	216.0	195.0	140.0	47.0	6.0	14.0	360860.0	1.0	7.0	0.125
2	2021-02-03	Skippy	USA	2.0	NaN	1	129.0	132.0	115.0	63.0	12.0	9.0	260440.0	2.0	6.0	0.250
3	2021-02-03	Neo	China	1.0	1.0	1	175.0	83.0	83.0	53.0	12.0	8.0	233450.0	3.0	5.0	0.375
4	2021-02-03	TVH	USA	1.0	1.0	1	98.0	93.0	93.0	47.0	11.0	8.0	211565.0	4.0	4.0	0.500
5	2021-02-03	Pancake	China	2.0	NaN	1	122.0	64.0	137.0	40.0	11.0	6.0	192521.0	5.0	3.0	0.625

Each row will have the player’s name (which must be consistent across the dataset), what faction they played as (USA, GLA, or China), their team number (which, by convention is 1 for the winning team), whether they won (1 if so, blank if not), stats from the end game screen, then three ways of calculating rank.

For basic familiarization, here’s some of the stats from the columns:

games.describe()

	Team	Win	Game	Units Created	Units Lost	Units Destroyed	Buildings Constructed	Buildings Lost	Buildings Destroyed	Supplies Collected	Rank	Inverse Rank	Normalized Rank
count	1962.000000	961.0	2017.000000	1992.000000	1992.000000	1992.000000	1992.000000	1992.000000	1992.000000	1.992000e+03	1998.000000	1998.000000	1998.000000
mean	1.544852	1.0	167.371344	130.930723	124.909137	116.313755	32.239960	13.498996	13.303715	1.515431e+05	3.644645	4.355355	0.589339
std	0.593417	0.0	107.399041	109.309164	136.530457	132.316169	22.944272	16.695097	17.853459	1.763671e+05	2.037910	2.037910	0.286096
min	1.000000	1.0	1.000000	1.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000e+00	1.000000	0.000000	0.125000
25%	1.000000	1.0	75.000000	58.000000	43.000000	35.000000	17.000000	2.000000	2.000000	5.753900e+04	2.000000	3.000000	0.333333
50%	2.000000	1.0	154.000000	96.000000	85.000000	78.000000	27.000000	9.000000	7.000000	9.960250e+04	3.000000	5.000000	0.600000
75%	2.000000	1.0	265.000000	169.000000	163.000000	149.000000	39.000000	19.000000	19.000000	1.808620e+05	5.000000	6.000000	0.833333
max	5.000000	1.0	357.000000	1038.000000	2108.000000	1953.000000	221.000000	159.000000	226.000000	3.142320e+06	8.000000	7.000000	1.000000

Basic Cleaning

There may be some rows of data that need cleaning. The main method of dealing with it is to eliminate the problem games since we have enough data regardless.

#Remove any row that doesn't have an index
games = games[(~games.index.isna())&(~games.Game.isna())].copy()

#Set the Game row to be integers instead of floats since we'll use it to make ranges
games.loc[:,'Game']=games.Game.astype(np.int32)

#Remove any game where team data isn't present and only include the columns up to Normalized Rank
games=games.loc[~games.Team.isna(),games.columns[0:16]].copy()

#Win's are designated with a 1 if there's a win, and are empty (NA) if it's a loss
#If it's a loss, we need to use a 0, otherwise it will throw off our average win calculations
games.loc[games.Win.isna(),'Win'] = games.loc[games.Win.isna(),'Win'].fillna(0)

#For every person calculate Win ratio and average rank from normalized rankings
for name in games.Name.unique():
    #print(name)
    games.loc[games.Name==name,'Win Ratio'] = games.loc[games.Name==name,'Win'].mean()
    games.loc[games.Name==name,'Avg Rank'] = games.loc[games.Name==name,'Normalized Rank'].mean()

#We use team 1 to designate which team won in another program to see which people win and lose the most together,
#but we need to mix this up or the computer's predictive model would take that as way to easily cheat.

for i in range( int(games.Game.max()+1)):
    #randomize the team numbers for each game
    team_1 = np.random.choice([0,1])
    team_2 = 1-team_1
    games.loc[(games.Game==i)&(games.Team==1),'Team'] = team_1
    games.loc[(games.Game==i)&(games.Team==2),'Team'] = team_2
    
games.head()    

	Date	Name	Faction	Team	Win	Game	Units Created	Units Lost	Units Destroyed	Buildings Constructed	Buildings Lost	Buildings Destroyed	Supplies Collected	Rank	Inverse Rank	Normalized Rank	Win Ratio	Avg Rank
Index
1	2021-02-03	Matt	USA	0.0	1.0	1	216.0	195.0	140.0	47.0	6.0	14.0	360860.0	1.0	7.0	0.125	0.606383	0.470567
2	2021-02-03	Skippy	USA	1.0	0.0	1	129.0	132.0	115.0	63.0	12.0	9.0	260440.0	2.0	6.0	0.250	0.456000	0.725281
3	2021-02-03	Neo	China	0.0	1.0	1	175.0	83.0	83.0	53.0	12.0	8.0	233450.0	3.0	5.0	0.375	0.521739	0.439990
4	2021-02-03	TVH	USA	0.0	1.0	1	98.0	93.0	93.0	47.0	11.0	8.0	211565.0	4.0	4.0	0.500	0.439024	0.708537
5	2021-02-03	Pancake	China	1.0	0.0	1	122.0	64.0	137.0	40.0	11.0	6.0	192521.0	5.0	3.0	0.625	0.450331	0.618046

Create New Training Data

We’re now going to create synthetic training data. For this, we’ll look at each player’s statistics, figure out for this particular game, what his stats were for the last N games, take the average, and use that as their nominal stats each game, then pretend the two teams played each other and make the logistic regression model predict which team will win.

#Create a new column that contains the winning team
df =  pd.DataFrame(games.loc[games.Win==1,:].groupby('Game').mean().Team.astype(np.int32))
df.columns = ['Winning_Team']

#We'll use this for segmenting out which columns to use for predicting the winning team
prediction_columns = ['Units Created', 
                      'Units Lost', 
                      'Units Destroyed',
                      'Buildings Constructed',
                      'Buildings Lost', 
                      'Buildings Destroyed', 
                      'Supplies Collected', 
                      'Avg Rank', 
                      'Win Ratio']

games_copy = games.copy()
df.head()

	Winning_Team
Game
1	0
2	1
3	1
4	1
5	1

#for each player, and each game, create their average win and rank stats for the previous N games
for i in range( int(games_copy.Game.max()+1)):
    for name in games_copy.loc[games_copy.Game==i,'Name'].unique():
        name_bool=games_copy.Name==name
        game_bool=games_copy.Game==i
        games_copy.loc[(name_bool)&(game_bool),'Win Ratio'] = games_copy.loc[(name_bool)&(games_copy.Game<=i),'Win'].tail(MAX_NUMBER_OF_GAMES).mean()
        games_copy.loc[(name_bool)&(game_bool),'Avg Rank'] = games_copy.loc[(name_bool)&(games_copy.Game<=i),'Normalized Rank'].tail(MAX_NUMBER_OF_GAMES).mean()
games_copy.head()    

	Date	Name	Faction	Team	Win	Game	Units Created	Units Lost	Units Destroyed	Buildings Constructed	Buildings Lost	Buildings Destroyed	Supplies Collected	Rank	Inverse Rank	Normalized Rank	Win Ratio	Avg Rank
Index
1	2021-02-03	Matt	USA	0.0	1.0	1	216.0	195.0	140.0	47.0	6.0	14.0	360860.0	1.0	7.0	0.125	1.0	0.125
2	2021-02-03	Skippy	USA	1.0	0.0	1	129.0	132.0	115.0	63.0	12.0	9.0	260440.0	2.0	6.0	0.250	0.0	0.250
3	2021-02-03	Neo	China	0.0	1.0	1	175.0	83.0	83.0	53.0	12.0	8.0	233450.0	3.0	5.0	0.375	1.0	0.375
4	2021-02-03	TVH	USA	0.0	1.0	1	98.0	93.0	93.0	47.0	11.0	8.0	211565.0	4.0	4.0	0.500	1.0	0.500
5	2021-02-03	Pancake	China	1.0	0.0	1	122.0	64.0	137.0	40.0	11.0	6.0	192521.0	5.0	3.0	0.625	0.0	0.625

Reduce each game to a single row of stats

For predicting each game, we’re going to sum the stats for each team, then take the difference.
For games with large negative numbers, this will indicate that team 0 won, for mostly positive, it would show team 1 won.
For predicting each game, most of the stats should be added except for the game, team, and win stats.

values = games_copy.loc[games_copy.Team==1,:].iloc[:,3:].groupby('Game').agg({'Team':'mean', 
                                                           'Win':'mean', 
                                                           'Game':'mean', 
                                                           'Units Created':'sum', 
                                                           'Units Lost':'sum', 
                                                           'Units Destroyed':'sum',
                                                           'Buildings Constructed':'sum', 
                                                           'Buildings Lost':'sum', 
                                                           'Buildings Destroyed':'sum',
                                                           'Supplies Collected':'sum', 
                                                           'Rank':'sum', 
                                                           'Inverse Rank':'sum', #not used
                                                           'Normalized Rank':'sum', #not used
                                                           'Win Ratio':'sum', 
                                                           'Avg Rank':'sum' 
                                                           }) - games_copy.loc[
                                                               games_copy.Team==0,:].iloc[:,3:].groupby('Game').agg({'Team':'mean', 
                                                           'Win':'mean', 
                                                           'Game':'mean', 
                                                           'Units Created':'sum', 
                                                           'Units Lost':'sum', 
                                                           'Units Destroyed':'sum',
                                                           'Buildings Constructed':'sum', 
                                                           'Buildings Lost':'sum', 
                                                           'Buildings Destroyed':'sum',
                                                           'Supplies Collected':'sum', 
                                                           'Rank':'sum', 
                                                           'Inverse Rank':'sum', #not used
                                                           'Normalized Rank':'sum', #not used
                                                           'Win Ratio':'sum', 
                                                           'Avg Rank':'sum', 
                                                           })

#create the difference columns
diff_cols = []
for col in prediction_columns:
    column_name = col+'_diff'
    diff_cols += [column_name]
    #this really isn't needed anymore but the winning team will still be needed for training value
    df.loc[:,column_name] = values.loc[:,col]
    
#create nominal game stats based on median stats for each player    
predicted_games = []    

#make stats for each game
for game in games_copy.Game.unique():  
    #print('game ', game)
    team_values = []
    
    #make stats for each team
    for team in range(2):
        #print('team ', team)
        names = []
        #make stats for each player on this team
        games_copy.loc[(games_copy.Name==name)&(games.Game<=i),'Win'].tail(MAX_NUMBER_OF_GAMES).mean()
        for name in games_copy.loc[(games_copy.Game==game)&(games_copy.Team==team),'Name'].values:
            name_stats = games_copy.loc[(games_copy.Name==name)&(games.Game<=game)].tail(MAX_NUMBER_OF_GAMES).iloc[:,6:].median()
            name_stats['Win Ratio'] = games_copy.loc[(games_copy.Name==name)&(games_copy.Game<=game),'Win'].tail(MAX_NUMBER_OF_GAMES).mean()
            name_stats['Avg Rank'] = games_copy.loc[(games_copy.Name==name)&(games_copy.Game<=game),'Normalized Rank'].tail(MAX_NUMBER_OF_GAMES).mean()
            names += [name_stats]
            names[-1].loc['Win_avg'] = games_copy.loc[(games_copy.Name==name)&
                                                      (~games_copy.Team.isna())&
                                                      (games_copy.Game<=game)].Win.fillna(0).tail(MAX_NUMBER_OF_GAMES).mean()
            #print(name)
        #combine all the medians and sum them together
        #Summing works better than an average or median since if the teams have uneven number of players, the weight is on the side with more players

        team_values  += [pd.concat(names, axis = 1).T.sum()]
    predicted_games +=[team_values[1]-team_values[0]]
X_generated = pd.concat(predicted_games, axis = 1).T[prediction_columns]
X_generated.columns = diff_cols
X_generated.head()

	Units Created_diff	Units Lost_diff	Units Destroyed_diff	Buildings Constructed_diff	Buildings Lost_diff	Buildings Destroyed_diff	Supplies Collected_diff	Avg Rank_diff	Win Ratio_diff
0	-90.0	-79.0	75.0	4.0	0.0	-1.0	-99050.0	0.500000	-4.000000
1	199.0	251.5	65.0	57.5	-13.0	45.0	378716.0	-0.229167	2.500000
2	19.0	50.0	-96.5	64.5	-31.5	25.5	81672.5	-0.638889	2.166667
3	208.0	202.0	15.0	54.0	58.5	-37.0	173173.5	0.173611	1.083333
4	95.5	162.5	56.5	44.5	22.5	12.0	350781.5	-1.067361	1.633333

Create the Training Data

y_cols= ['Winning_Team']
X = X_generated
y = np.ravel(df[y_cols])
y[0:5]

array([0, 1, 1, 1, 1])

Gridsearch for parameters

The following is how I decided which solver algorithm and inverse regularization strength C to use for the logistic regresssion part of the model.

solvers = ['newton-cg', 'lbfgs','liblinear', 'sag', 'saga']
Cs = [1, 3, 10, 30, 100]
parameters = {'logisticregression__solver':solvers, 'logisticregression__C':Cs}

#standard test, train split should be 20-30% held back for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

#cross validation
cv = KFold(n_splits = 10, random_state=10, shuffle=True)


#create model
scaler = StandardScaler()
logreg = LogisticRegression(random_state=RANDOM_STATE)

#simple pipeline of normalizing all the stats then applying logistic regression
pipe = make_pipeline(scaler, logreg)

clf = GridSearchCV(pipe, parameters, cv=10)
clf.fit(X, y)

#make a heatmap of the results
def make_heatmap(ax, gs):
    """Helper to make a heatmap."""
    results = pd.DataFrame.from_dict(gs.cv_results_)
    results['params_str'] = results.params.apply(str)

    scores_matrix = results.pivot(index='param_logisticregression__solver', columns='param_logisticregression__C',
                                      values='mean_test_score')

    im = ax.imshow(scores_matrix)

    ax.set_xticks(np.arange(len(Cs)))
    ax.set_xticklabels([x for x in Cs])
    ax.set_xlabel('C', fontsize=15)

    ax.set_yticks(np.arange(len(solvers)))
    ax.set_yticklabels([x for x in solvers])
    ax.set_ylabel('solver', fontsize=15)

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    fig.subplots_adjust(right=0.8)
    cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])
    fig.colorbar(im, cax=cbar_ax)
    cbar_ax.set_ylabel('Mean Test Score', rotation=-90, va="bottom",
                       fontsize=15)


fig, axes = plt.subplots(ncols=1, sharey=True)
ax2 = axes

make_heatmap(ax2, clf)


ax2.set_title('GridSearch', fontsize=15)

plt.show()

Prediction Test

Create a logistic regression model, with test and training splits, and 10 cross validation folds for determing accuracy.

Also, since the supplies column is so much larger than the rest of the variables, we’re going to normalize the data set with the StandardScaler model which brings all the columns to a normal distribution by subtracting the mean and dividing by the standard deviation.

#standard test, train split should be 20-30% held back for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

#cross validation
cv = KFold(n_splits = 10, random_state=10, shuffle=True)

#create model
scaler = StandardScaler()
logreg = LogisticRegression(random_state=RANDOM_STATE,C=1, solver='lbfgs')

#simple pipeline of normalizing all the stats then applying logistic regression
pipe = make_pipeline(scaler, logreg)

pipe.fit(X_train, y_train)  # apply scaling on training data
pipe.fit(X, y)

Pipeline(steps=[('standardscaler', StandardScaler()),
                ('logisticregression',
                 LogisticRegression(C=3, random_state=42))])

#score it
scores = cross_val_score(pipe, X, y, scoring = 'accuracy', cv=cv, n_jobs = -1)
print('10-fold cross validation accuracy: %.3f (%.3f stdev)' % (mean(scores), std(scores)))

plt.hist(scores)
plt.xlabel("Scores")
plt.show()

10-fold cross validation accuracy: 0.761 (0.075 stdev)

For this run, we got a 10-fold cross validation accuracy of 78%. This is normal for the process.

Relative importance of each feature

fig, ax = plt.subplots()
ax.bar(x = np.arange(len(pipe.steps[1][1].coef_[0])), height = -(pipe.steps[1][1].coef_[0]))
ax.set_xticks(np.arange(len(prediction_columns)))
ax.set_xticklabels(prediction_columns)
plt.xticks(rotation=-90)
plt.title('Feature Importance')
plt.show()

We can see that the model is rewarding the Win Ratio a lot, if the person has been winning a good deal recently, it will guess they continue to win.

There’s also some rewards for losing a lot of units but also destroying units, and less of a reward for creating a lot of units, and making a ton of supplies

Example ranking

Let’s see what the odds would be now for one team to play against some others, for instance, how likely is it that our best player, CoreDawg, is able to fend off two hard armies?

#chances of one team possibilities

first_team=['Neo', 'Shift', 'Matt', 'Skippy'] 
second_team= ['Hard', 'Hard', 'Hard', 'Hard']
test_team = [first_team, second_team]


team_values = []

for team in range(2):
    #print('team ', team)
    names = []
    for name in test_team[team]:
        #"Make sure we're getting correct names input:
        if name not in games.Name.unique():
            print(name +' not found')
            break
        name_stats = games_copy.loc[(games_copy.Name==name)].tail(MAX_NUMBER_OF_GAMES).iloc[:,6:].mean()
        name_stats['Win Ratio'] = games_copy.loc[(games_copy.Name==name)&(games_copy.Game<=game),'Win'].tail(MAX_NUMBER_OF_GAMES).mean()
        name_stats['Avg Rank'] = games_copy.loc[(games_copy.Name==name)&(games_copy.Game<=game),'Normalized Rank'].tail(MAX_NUMBER_OF_GAMES).mean()
        names += [name_stats]
        names[-1].loc['Win_avg'] = games_copy.loc[(games_copy.Name==name)&
                                                  (~games_copy.Team.isna())&
                                                  (games_copy.Game<=game)].Win.fillna(0).tail(MAX_NUMBER_OF_GAMES).mean()

    team_values  += [pd.concat(names, axis = 1).T.sum()]
predicted_games =[team_values[1]-team_values[0]]
X_predict = pd.concat(predicted_games, axis = 1).T[prediction_columns]
predicted_win = pipe.predict(X_predict)[0]
probability = pipe.predict_proba(X_predict)[0][predicted_win]

print('Between '+ ', '.join(first_team)+' and ' + ', '.join(second_team)+ ',\nModel predicts ' + ', '.join(test_team[predicted_win]) + ' with a '+"{:.2%}".format(probability)+' chance')

Between Neo, Shift, Matt, Skippy and Hard, Hard, Hard, Hard,
Model predicts Neo, Shift, Matt, Skippy with a 83.50% chance

Team Generator

What we’re really interested in though, is making even teams given a list of contestants.

We’ll use the combinations function from itertools to generate all possible teams that take half the players. For 8 players, there’s 70 different unique combinations. However, for each of those 70, there’s another unique team uses the other four players. This means we have effectively 35 different ways we can organize the players into two teams.

#all possibilities
all_names = ['Hard', 'Hard', 'Hard', 'Hard', 'Neo', 'Shift','Matt', 'pcap']

#how many combos are possible?
possible_combos = list(combinations(all_names,int(len(all_names)/2)))
non_parity_combos= int(len(possible_combos)/2)
possible_combos    

[('Hard', 'Hard', 'Hard', 'Hard'),
 ('Hard', 'Hard', 'Hard', 'Neo'),
 ('Hard', 'Hard', 'Hard', 'Shift'),
 ('Hard', 'Hard', 'Hard', 'Matt'),
 ('Hard', 'Hard', 'Hard', 'pcap'),
 ('Hard', 'Hard', 'Hard', 'Neo'),
 ('Hard', 'Hard', 'Hard', 'Shift'),
 ('Hard', 'Hard', 'Hard', 'Matt'),
 ('Hard', 'Hard', 'Hard', 'pcap'),
 ('Hard', 'Hard', 'Neo', 'Shift'),
 ('Hard', 'Hard', 'Neo', 'Matt'),
 ('Hard', 'Hard', 'Neo', 'pcap'),
 ('Hard', 'Hard', 'Shift', 'Matt'),
 ('Hard', 'Hard', 'Shift', 'pcap'),
 ('Hard', 'Hard', 'Matt', 'pcap'),
 ('Hard', 'Hard', 'Hard', 'Neo'),
 ('Hard', 'Hard', 'Hard', 'Shift'),
 ('Hard', 'Hard', 'Hard', 'Matt'),
 ('Hard', 'Hard', 'Hard', 'pcap'),
 ('Hard', 'Hard', 'Neo', 'Shift'),
 ('Hard', 'Hard', 'Neo', 'Matt'),
 ('Hard', 'Hard', 'Neo', 'pcap'),
 ('Hard', 'Hard', 'Shift', 'Matt'),
 ('Hard', 'Hard', 'Shift', 'pcap'),
 ('Hard', 'Hard', 'Matt', 'pcap'),
 ('Hard', 'Hard', 'Neo', 'Shift'),
 ('Hard', 'Hard', 'Neo', 'Matt'),
 ('Hard', 'Hard', 'Neo', 'pcap'),
 ('Hard', 'Hard', 'Shift', 'Matt'),
 ('Hard', 'Hard', 'Shift', 'pcap'),
 ('Hard', 'Hard', 'Matt', 'pcap'),
 ('Hard', 'Neo', 'Shift', 'Matt'),
 ('Hard', 'Neo', 'Shift', 'pcap'),
 ('Hard', 'Neo', 'Matt', 'pcap'),
 ('Hard', 'Shift', 'Matt', 'pcap'),
 ('Hard', 'Hard', 'Hard', 'Neo'),
 ('Hard', 'Hard', 'Hard', 'Shift'),
 ('Hard', 'Hard', 'Hard', 'Matt'),
 ('Hard', 'Hard', 'Hard', 'pcap'),
 ('Hard', 'Hard', 'Neo', 'Shift'),
 ('Hard', 'Hard', 'Neo', 'Matt'),
 ('Hard', 'Hard', 'Neo', 'pcap'),
 ('Hard', 'Hard', 'Shift', 'Matt'),
 ('Hard', 'Hard', 'Shift', 'pcap'),
 ('Hard', 'Hard', 'Matt', 'pcap'),
 ('Hard', 'Hard', 'Neo', 'Shift'),
 ('Hard', 'Hard', 'Neo', 'Matt'),
 ('Hard', 'Hard', 'Neo', 'pcap'),
 ('Hard', 'Hard', 'Shift', 'Matt'),
 ('Hard', 'Hard', 'Shift', 'pcap'),
 ('Hard', 'Hard', 'Matt', 'pcap'),
 ('Hard', 'Neo', 'Shift', 'Matt'),
 ('Hard', 'Neo', 'Shift', 'pcap'),
 ('Hard', 'Neo', 'Matt', 'pcap'),
 ('Hard', 'Shift', 'Matt', 'pcap'),
 ('Hard', 'Hard', 'Neo', 'Shift'),
 ('Hard', 'Hard', 'Neo', 'Matt'),
 ('Hard', 'Hard', 'Neo', 'pcap'),
 ('Hard', 'Hard', 'Shift', 'Matt'),
 ('Hard', 'Hard', 'Shift', 'pcap'),
 ('Hard', 'Hard', 'Matt', 'pcap'),
 ('Hard', 'Neo', 'Shift', 'Matt'),
 ('Hard', 'Neo', 'Shift', 'pcap'),
 ('Hard', 'Neo', 'Matt', 'pcap'),
 ('Hard', 'Shift', 'Matt', 'pcap'),
 ('Hard', 'Neo', 'Shift', 'Matt'),
 ('Hard', 'Neo', 'Shift', 'pcap'),
 ('Hard', 'Neo', 'Matt', 'pcap'),
 ('Hard', 'Shift', 'Matt', 'pcap'),
 ('Neo', 'Shift', 'Matt', 'pcap')]

Now for each unique combination, we’ll create the two teams, and give a score of which team is predicted to win.

possibilities = []
for combo_index in range(non_parity_combos):
    #create the two teams for this unique combination
    first_team=list(possible_combos[combo_index])
    second_team= list(possible_combos[2*non_parity_combos-combo_index-1])
    test_team = [first_team, second_team]
    
    #holders for each team's stats
    team_values = []
    
    for team in range(2):
        #print('team ', team)
        names = []
        for name in test_team[team]:
            #each person's stats from the last N games
            names += [games.loc[games.Name==name].tail(MAX_NUMBER_OF_GAMES).iloc[:,6:].mean()]
            #except for Win average which requires some extra calculations for blank spots
            names[-1].loc['Win_avg'] = games.loc[(games.Name==name)&(~games.Team.isna())].Win.fillna(0).tail(MAX_NUMBER_OF_GAMES).mean()
        #combine each team member's stats into a table for that team
        team_values  += [pd.concat(names, axis = 1).T.sum()]
    predicted_game =[team_values[1]-team_values[0]]
    X_predict = pd.concat(predicted_game, axis = 1).T[prediction_columns]
    
    team_predicted_to_win = pipe.predict(X_predict)[0]
    probability_of_winning = pipe.predict_proba(X_predict)[0][team_predicted_to_win]
    possibilities +=[{"Team 1":first_team, "Team 2":second_team, "Predicted Team": team_predicted_to_win+1, "Probability": probability_of_winning}]
all_runs = pd.DataFrame(possibilities)
all_runs

	Team 1	Team 2	Predicted Team	Probability
0	[Hard, Hard, Hard, Hard]	[Neo, Shift, Matt, pcap]	2	0.965438
1	[Hard, Hard, Hard, Neo]	[Hard, Shift, Matt, pcap]	2	0.590536
2	[Hard, Hard, Hard, Shift]	[Hard, Neo, Matt, pcap]	2	0.594433
3	[Hard, Hard, Hard, Matt]	[Hard, Neo, Shift, pcap]	2	0.499961
4	[Hard, Hard, Hard, pcap]	[Hard, Neo, Shift, Matt]	2	0.996667
5	[Hard, Hard, Hard, Neo]	[Hard, Shift, Matt, pcap]	2	0.590536
6	[Hard, Hard, Hard, Shift]	[Hard, Neo, Matt, pcap]	2	0.594433
7	[Hard, Hard, Hard, Matt]	[Hard, Neo, Shift, pcap]	2	0.499961
8	[Hard, Hard, Hard, pcap]	[Hard, Neo, Shift, Matt]	2	0.996667
9	[Hard, Hard, Neo, Shift]	[Hard, Hard, Matt, pcap]	1	0.908797
10	[Hard, Hard, Neo, Matt]	[Hard, Hard, Shift, pcap]	1	0.940660
11	[Hard, Hard, Neo, pcap]	[Hard, Hard, Shift, Matt]	2	0.939488
12	[Hard, Hard, Shift, Matt]	[Hard, Hard, Neo, pcap]	1	0.944947
13	[Hard, Hard, Shift, pcap]	[Hard, Hard, Neo, Matt]	2	0.940093
14	[Hard, Hard, Matt, pcap]	[Hard, Hard, Neo, Shift]	2	0.914317
15	[Hard, Hard, Hard, Neo]	[Hard, Shift, Matt, pcap]	2	0.590536
16	[Hard, Hard, Hard, Shift]	[Hard, Neo, Matt, pcap]	2	0.594433
17	[Hard, Hard, Hard, Matt]	[Hard, Neo, Shift, pcap]	2	0.499961
18	[Hard, Hard, Hard, pcap]	[Hard, Neo, Shift, Matt]	2	0.996667
19	[Hard, Hard, Neo, Shift]	[Hard, Hard, Matt, pcap]	1	0.908797
20	[Hard, Hard, Neo, Matt]	[Hard, Hard, Shift, pcap]	1	0.940660
21	[Hard, Hard, Neo, pcap]	[Hard, Hard, Shift, Matt]	2	0.939488
22	[Hard, Hard, Shift, Matt]	[Hard, Hard, Neo, pcap]	1	0.944947
23	[Hard, Hard, Shift, pcap]	[Hard, Hard, Neo, Matt]	2	0.940093
24	[Hard, Hard, Matt, pcap]	[Hard, Hard, Neo, Shift]	2	0.914317
25	[Hard, Hard, Neo, Shift]	[Hard, Hard, Matt, pcap]	1	0.908797
26	[Hard, Hard, Neo, Matt]	[Hard, Hard, Shift, pcap]	1	0.940660
27	[Hard, Hard, Neo, pcap]	[Hard, Hard, Shift, Matt]	2	0.939488
28	[Hard, Hard, Shift, Matt]	[Hard, Hard, Neo, pcap]	1	0.944947
29	[Hard, Hard, Shift, pcap]	[Hard, Hard, Neo, Matt]	2	0.940093
30	[Hard, Hard, Matt, pcap]	[Hard, Hard, Neo, Shift]	2	0.914317
31	[Hard, Neo, Shift, Matt]	[Hard, Hard, Hard, pcap]	1	0.986479
32	[Hard, Neo, Shift, pcap]	[Hard, Hard, Hard, Matt]	1	0.545065
33	[Hard, Neo, Matt, pcap]	[Hard, Hard, Hard, Shift]	1	0.639590
34	[Hard, Shift, Matt, pcap]	[Hard, Hard, Hard, Neo]	1	0.639314

This gives us all the possible matchups and which teams are predicted to win and by how much

All that’s left is to rank the teams, smallest chance of winning to largest to get the most even teams. In other words, if the model has difficulty guessing which team would win, they’re more evenly matched.

desired_team_bool = all_runs.Probability == all_runs.Probability.min()
first_team  = all_runs.loc[desired_team_bool, 'Team 1'].values[0]
second_team = all_runs.loc[desired_team_bool, 'Team 2'].values[0]
team_predicted_to_win= all_runs.loc[desired_team_bool, 'Predicted Team'].values[0]-1

test_team = [first_team, second_team]
probability = all_runs.Probability.min()

print('\n\nFor '+ ', '.join(all_names)+',\nThe most even teams are '+ ', '.join(first_team)+' and ' + ', '.join(second_team)+ ',\nI predict ' + ', '.join(test_team[team_predicted_to_win]) + ' with a '+"{:.2%}".format(probability)+' chance')

For Hard, Hard, Hard, Hard, Neo, Shift, Matt, pcap,
The most even teams are Hard, Hard, Hard, Matt and Hard, Neo, Shift, pcap,
I predict Hard, Neo, Shift, pcap with a 50.00% chance

Update the google sheet with the updated team maker stats

This boils down to the model’s rating of each person, they can just be added together for each team and google sheets or tableau can use that number to recreate the teams generated. Because logistic regression uses the sigmoid function so the probability will shift as more ratings are added or subtracted. This boils down to trying to find the team with the most even summation of their individual ratings.

#Player rankings
names = []
stats= {}
sheets_stats = [['Name', 'Predictive Rating']]
for name in games.Name.unique():
    only_team=[name]
    predicted_games = []
    team_values = []
    for team in range(2):
        #print('team ', team)
        names = []
        for name in only_team:
            name_stats = games_copy.loc[(games_copy.Name==name)].tail(MAX_NUMBER_OF_GAMES).iloc[:,6:].median()
            name_stats['Win Ratio'] = games_copy.loc[(games_copy.Name==name),'Win'].tail(MAX_NUMBER_OF_GAMES).mean()
            name_stats['Avg Rank'] = games_copy.loc[(games_copy.Name==name),'Normalized Rank'].tail(MAX_NUMBER_OF_GAMES).mean()
            names += [name_stats]
            names[-1].loc['Win_avg'] = games_copy.loc[(games_copy.Name==name)&
                                                      (~games_copy.Team.isna())].Win.fillna(0).tail(MAX_NUMBER_OF_GAMES).mean()
        team_values  += [pd.concat(names, axis = 1).T.sum()]
    predicted_games +=[team_values[0]]
    X_predict = pd.concat(predicted_games, axis = 1).T[prediction_columns]
    #predicted_win = pipe.predict(X_predict)[0]
    probability = pipe.predict_proba(X_predict)[0][1] #the player should always be 1
    #stats[name] = -math.log((1 - probability)/probability)
    stats[name] = probability
    #sheets_stats +=[[name, -math.log((1 - probability)/probability)]]
    sheets_stats +=[[name, probability]]
    games.loc[games.Name==name, 'Predictive Rating'] = stats[name]

RANGE = "'Team Maker'!P1:Q"+str(len(sheets_stats))
push_sheet_data(SCOPES,SPREADSHEET_ID,RANGE, sheets_stats)    

#make a quick plot to ensure no one is getting a good deal
games.groupby('Name')['Predictive Rating'].max().sort_values().plot.barh()
plt.title('Ranking based on last '+str(MAX_NUMBER_OF_GAMES)+' games')
plt.show()

38 cells updated.

Update Tableau spreadsheet

The tableau data source will have two tabs, one of all the indivdual games which is a copy and paste of our games dataframe.

The other tab will be a pre-calculated list of possible combinations for a 4v4, 3v3, or 2v2 game in binary form. This gives the possible name combinations for both the google sheet and tableau to ensure they’re hitting all the name combos, create the resulting team, and rank them.

v2 = []
v3 = []
v4 = []

#Do 4v4 since it will be the longest column and we don't have to error check the other two for rows with no names and ratings
for i in range(256):
    if i == 0:
        sheets_stats[0]+=['4v4']    
    number = bin(i+1).replace('0b',"")
    if (sum([int(x) for x in number])==4) & (len(number)==8):
        v4 +=[number]
        try:
            sheets_stats[len(v4)]+=[number]
        except:
            sheets_stats+=[["", "", number]]
for i in range(64):
    if i == 0:
        sheets_stats[0]+=['3v3']    
    number = bin(i+1).replace('0b',"")
    if (sum([int(x) for x in number])==3) & (len(number)==6):
        v3 +=[number]
        sheets_stats[len(v3)]+=[number]
for i in range(16):
    if i == 0:
        sheets_stats[0]+=['2v2']
    number = bin(i+1).replace('0b',"")
    if (sum([int(x) for x in number])==2) & (len(number)==4):
        v2 +=[number]
        sheets_stats[len(v2)]+=[number]
        
pd.DataFrame(sheets_stats[1:], columns=sheets_stats[0])

	Name	Predictive Rating	4v4	3v3	2v2
0	Matt	0.945633	10000111	100011	1001
1	Skippy	0.74034	10001011	100101	1010
2	Neo	0.872259	10001101	100110	1100
3	TVH	0.772892	10001110	101001	None
4	Pancake	0.698927	10010011	101010	None
5	Jack	0.868168	10010101	101100	None
6	Scottagorn	0.894642	10010110	110001	None
7	SS	0.339708	10011001	110010	None
8	STM	0.536517	10011010	110100	None
9	CoreDawg	0.964444	10011100	111000	None
10	Mike	0.95847	10100011	None	None
11	Medium	0.73781	10100101	None	None
12	Shift	0.884507	10100110	None	None
13	Pcap	0.794414	10101001	None	None
14	Hard	0.822299	10101010	None	None
15	Spiff	0.879811	10101100	None	None
16	Copper Kettle	0.17295	10110001	None	None
17	Tytan	0.96384	10110010	None	None
18			10110100	None	None
19			10111000	None	None
20			11000011	None	None
21			11000101	None	None
22			11000110	None	None
23			11001001	None	None
24			11001010	None	None
25			11001100	None	None
26			11010001	None	None
27			11010010	None	None
28			11010100	None	None
29			11011000	None	None
30			11100001	None	None
31			11100010	None	None
32			11100100	None	None
33			11101000	None	None
34			11110000	None	None

Save the data source

        
games.to_excel('Generals Statistics.xlsx',sheet_name="Games")

from openpyxl import load_workbook
workbook = load_workbook(filename="Generals Statistics.xlsx")
workbook.create_sheet('Teams')
sheet = workbook['Teams']
for i, rows in enumerate(sheets_stats):
    sheet.cell(row=i+1, column=1).value = rows[0]
    sheet.cell(row=i+1, column=2).value = rows[1]
    try:
        sheet.cell(row=i+1, column=3).value = rows[2]
    except:
        pass
    try:
        sheet.cell(row=i+1, column=4).value = rows[3]
    except:
        pass
    try:
        sheet.cell(row=i+1, column=5).value = rows[4]
    except:
        pass

#add a final name of 'None' with a value of 0 so it tableau doesn't have to have eight people every time
sheet.cell(row=len(games.Name.unique())+2, column=1).value = 'None'
sheet.cell(row=len(games.Name.unique())+2, column=2).value = 0
workbook.save(filename="Generals Statistics.xlsx")   

 

Conclusion

This was a great project for taking each person’s individual stats, creating some synthetic games based on them, and having a logistic regression model predict which team would win. With an ~80% accuracy rate, this could probably be improved with some more complex models but they’d be a lot less explainable as well as implementable inside of tableau and google sheets. Making this usable to the other team members when I’m not around was therefore more important than an extremely accurate model.

If you’d like to see the tableau page of the latest stats, they can be found here but I’ve also attempted to embed it below.

For a more interactive version click here

https://public.tableau.com/views/LifegroupTraxxNight/TraxxRaceStats?:language=en-US&publish=yes&:display_count=n&:showVizHome=no