An analysis of Counter-Strike: Global Offensive pro matches from 2015-2020 ("The Golden Age")¶

Summer 2025 Data Science Project¶
Contributions:
Bailey Jones: Project idea, dataset curation/preprocessing, data exploration, ML algo design/analysis, prose, formatting, the entire project and each of the checkpoints. Cleaned up C1. Did C2, C3, ML model.
Angelo Parker: Dataset curation, designed and completed C1 (CT vs T win rate).
Chris Duong: conclusion and project idea.
Jiho Lee: dataset curation.
Introduction¶
Counter-Strike: Global Offensive was a competitive First Person Shooter (FPS) video game that existed from August 2012 to September 2023, when it was replaced by its successor Counter Strike 2 (CS2).
Due to the competitive nature of CS:GO, there is data from nearly every professional match played on community forums like HLTV.org.
We would like to explore data from professional matches between November 2015 to February 2020, which can be described at the golden age of CS:GO by some.
Basic overview of CS:GO¶
In CS:GO, there are two sides: Terrorists (T), and Counter-terrorists (CT). Teams will randomly start on either side, CT (Defense) or T (Offense), and then alternate after completing 12 rounds.
To win a round, T side must successfully detonate a bomb on one of the two designated sites, or eliminate all members of CT side. Inversely, CT side must either defuse a planted T side bomb (or rescue a hostage in the Hostage Rescue gamemode) or eliminate all of T side's players.
All players start with a pistol at the start of the game/after death. Better weapons and equipment are bought at the start of each round with money, with individual performance (kills/assists/bomb defusal or plant, etc) granting a player money, as well as round wins that give the entire team money. Teams who lose rounds also get a small amount of money that increases every round lost in succession to prevent a "snowball" win for the other team.
Most competitive matches will play until one team reaches 16 round wins, but some overtime rules may be in place so that if a match is tied at match point (15-15), each team will play an additional 3 rounds per side until one team gets 4 wins. Ties of 18-18, 21-21, and so forth will trigger the overtime process to restart.
Pro sessions will usually play X amount of matches matches on X maps with a best-of-X rule to decide a winner. (X = 1,2,3)
Our analysis¶
With the introduction out of the way, with the dataset available to us we should be able to answer many questions about CS:GO. A common utterance in public chat during a match is "Ugh, why are we losing?" with the response being "Because this map is so CT-sided it's unfair". We would like to answer questions like this and more, such as how impactful a player's damage is to their performance.
Data Preprocessing¶
We have chosen to use a pre-scraped from HLTV.com dataset from Kaggle (CS:GO Professional Matches) that includes data from November 2015 to February 2020.
Our dataset includes 4 CSV files, in the forms of economy.csv, results.csv, players.csv, and picks.csv.
After late revision, we will not be including economy.csv in any sort of evaluation.
import pandas as pd
import seaborn as sns
from statsmodels.stats.proportion import proportions_ztest
import matplotlib.pyplot as plt
economy_df = pd.read_csv("economy.csv", low_memory=False)
results_df = pd.read_csv("results.csv")
players_df = pd.read_csv("players.csv")
picks_df = pd.read_csv("picks.csv")
Firstly, the players.csv file is massive (around 130MB). There is a lot of unnecessary information for what we aim to do, such as individual match data for each of the pro games. We can cut a significant portion of the fat out by just focusing on the end result of pro matches (results after best of 3 games).
essential_columns = [
'date', 'player_name', 'team', 'opponent', 'country',
'player_id', 'match_id', 'event_id', 'event_name', 'best_of',
'kills', 'assists', 'deaths', 'hs', 'flash_assists',
'kast', 'kddiff', 'adr', 'fkdiff', 'rating',
'kills_ct', 'deaths_ct', 'kddiff_ct', 'adr_ct', 'kast_ct', 'rating_ct',
'kills_t', 'deaths_t', 'kddiff_t', 'adr_t', 'kast_t', 'rating_t'
]
players_df = players_df[essential_columns].copy()
display(players_df)
| date | player_name | team | opponent | country | player_id | match_id | event_id | event_name | best_of | ... | kddiff_ct | adr_ct | kast_ct | rating_ct | kills_t | deaths_t | kddiff_t | adr_t | kast_t | rating_t | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-02-26 | Brehze | Evil Geniuses | Liquid | United States | 9136 | 2339385 | 4901 | IEM Katowice 2020 | 3 | ... | 4.0 | 81.6 | 79.2 | 1.10 | 23.0 | 31.0 | -8.0 | 77.5 | 60.0 | 0.97 |
| 1 | 2020-02-26 | CeRq | Evil Geniuses | Liquid | Bulgaria | 11219 | 2339385 | 4901 | IEM Katowice 2020 | 3 | ... | 12.0 | 77.4 | 72.9 | 1.16 | 17.0 | 29.0 | -12.0 | 63.9 | 54.3 | 0.73 |
| 2 | 2020-02-26 | EliGE | Liquid | Evil Geniuses | United States | 8738 | 2339385 | 4901 | IEM Katowice 2020 | 3 | ... | 14.0 | 96.6 | 71.4 | 1.39 | 24.0 | 34.0 | -10.0 | 64.2 | 64.6 | 0.86 |
| 3 | 2020-02-26 | Ethan | Evil Geniuses | Liquid | United States | 10671 | 2339385 | 4901 | IEM Katowice 2020 | 3 | ... | 10.0 | 74.0 | 75.0 | 1.11 | 10.0 | 31.0 | -21.0 | 37.8 | 51.4 | 0.43 |
| 4 | 2020-02-26 | NAF | Liquid | Evil Geniuses | Canada | 8520 | 2339385 | 4901 | IEM Katowice 2020 | 3 | ... | 11.0 | 96.3 | 85.7 | 1.36 | 24.0 | 29.0 | -5.0 | 61.0 | 70.8 | 0.87 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 383312 | 2015-10-07 | kIMERA | ExAequo | RIP Fonty | Italy | 7607 | 2298497 | 1957 | Milan Games Week 2015 League by FACEIT | 2 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 383313 | 2015-10-07 | morphiw0w | ExAequo | RIP Fonty | Italy | 9752 | 2298497 | 1957 | Milan Games Week 2015 League by FACEIT | 2 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 383314 | 2015-10-07 | overfly | RIP Fonty | ExAequo | Italy | 7698 | 2298497 | 1957 | Milan Games Week 2015 League by FACEIT | 2 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 383315 | 2015-10-07 | simozor | RIP Fonty | ExAequo | Italy | 9753 | 2298497 | 1957 | Milan Games Week 2015 League by FACEIT | 2 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 383316 | 2015-10-07 | xullE | RIP Fonty | ExAequo | Italy | 9754 | 2298497 | 1957 | Milan Games Week 2015 League by FACEIT | 2 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
383317 rows × 32 columns
Next, picks.csv has a lot of useless information including some columns that look like identifiers for the site. We don't need those.
essential_columns = [
'date', 'team_1', 'team_2', 'match_id', 'event_id', 'best_of',
't1_removed_1', 't1_removed_2', 't1_removed_3',
't2_removed_1', 't2_removed_2', 't2_removed_3',
't1_picked_1', 't2_picked_1', 'left_over'
]
picks_df = picks_df[essential_columns].copy()
picks_df['date'] = pd.to_datetime(picks_df['date'], errors='coerce')
display(picks_df)
| date | team_1 | team_2 | match_id | event_id | best_of | t1_removed_1 | t1_removed_2 | t1_removed_3 | t2_removed_1 | t2_removed_2 | t2_removed_3 | t1_picked_1 | t2_picked_1 | left_over | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-03-18 | TeamOne | Recon 5 | 2340454 | 5151 | 3 | Vertigo | Train | 0.0 | Nuke | Overpass | 0.0 | Dust2 | Inferno | Mirage |
| 1 | 2020-03-18 | Rugratz | Bad News Bears | 2340453 | 5151 | 3 | Dust2 | Nuke | 0.0 | Mirage | Train | 0.0 | Vertigo | Inferno | Overpass |
| 2 | 2020-03-18 | New England Whalers | Station7 | 2340461 | 5243 | 1 | Mirage | Dust2 | Vertigo | Nuke | Train | Overpass | 0.0 | 0.0 | Inferno |
| 3 | 2020-03-17 | Complexity | forZe | 2340279 | 5226 | 3 | Inferno | Nuke | 0.0 | Overpass | Vertigo | 0.0 | Dust2 | Train | Mirage |
| 4 | 2020-03-17 | Singularity | Endpoint | 2340456 | 5247 | 3 | Train | Mirage | 0.0 | Nuke | Inferno | 0.0 | Overpass | Vertigo | Dust2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 16030 | 2016-04-12 | GODSENT | Natus Vincere | 2302059 | 2099 | 1 | Dust2 | Cobblestone | Mirage | Cache | Inferno | Overpass | 0.0 | 0.0 | Train |
| 16031 | 2016-04-12 | Liquid | mousesports | 2302058 | 2099 | 1 | Inferno | Train | Mirage | Overpass | Cobblestone | Cache | 0.0 | 0.0 | Dust2 |
| 16032 | 2016-04-12 | Luminosity | TYLOO | 2302057 | 2099 | 1 | Dust2 | Cache | Inferno | Train | Overpass | Cobblestone | 0.0 | 0.0 | Mirage |
| 16033 | 2016-04-12 | FaZe | Virtus.pro | 2302063 | 2099 | 1 | Overpass | Cobblestone | Cache | Dust2 | Inferno | Mirage | 0.0 | 0.0 | Train |
| 16034 | 2016-04-12 | Tempo Storm | Envy | 2302064 | 2099 | 1 | Cache | Train | Inferno | Overpass | Mirage | Dust2 | 0.0 | 0.0 | Cobblestone |
16035 rows × 15 columns
Let's repeat again with results.csv and removing unnecessary columns.
essential_columns = [
'date', 'team_1', 'team_2', '_map',
'ct_1', 't_1', 'ct_2', 't_2',
'map_winner', 'starting_ct',
'match_winner', 'event_id', 'match_id'
]
results_df = results_df[essential_columns].copy()
display(results_df)
| date | team_1 | team_2 | _map | ct_1 | t_1 | ct_2 | t_2 | map_winner | starting_ct | match_winner | event_id | match_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2020-03-18 | Recon 5 | TeamOne | Dust2 | 0 | 0 | 15 | 1 | 2 | 2 | 2 | 5151 | 2340454 |
| 1 | 2020-03-18 | Recon 5 | TeamOne | Inferno | 8 | 5 | 10 | 6 | 2 | 2 | 2 | 5151 | 2340454 |
| 2 | 2020-03-18 | New England Whalers | Station7 | Inferno | 9 | 3 | 10 | 6 | 2 | 1 | 2 | 5243 | 2340461 |
| 3 | 2020-03-18 | Rugratz | Bad News Bears | Inferno | 0 | 7 | 8 | 8 | 2 | 2 | 2 | 5151 | 2340453 |
| 4 | 2020-03-18 | Rugratz | Bad News Bears | Vertigo | 4 | 4 | 11 | 5 | 2 | 2 | 2 | 5151 | 2340453 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 45768 | 2015-11-05 | G2 | E-frag.net | Inferno | 8 | 5 | 9 | 7 | 2 | 1 | 2 | 1970 | 2299059 |
| 45769 | 2015-11-05 | G2 | E-frag.net | Dust2 | 10 | 6 | 8 | 5 | 1 | 1 | 2 | 1970 | 2299059 |
| 45770 | 2015-11-04 | CLG | Liquid | Inferno | 7 | 9 | 4 | 8 | 1 | 1 | 1 | 1934 | 2299011 |
| 45771 | 2015-11-03 | NiP | Dignitas | Train | 4 | 12 | 3 | 1 | 1 | 2 | 1 | 1934 | 2299001 |
| 45772 | 2015-11-03 | NiP | Envy | Cobblestone | 4 | 12 | 3 | 6 | 1 | 2 | 1 | 1934 | 2299003 |
45773 rows × 13 columns
C1: CT vs. T Round Win Rate By Map¶
In CS:GO, there are two sides: Terrorists (T), and Counterterrorists (CT). Teams will start on one side, CT (Defense) or T (Offense), and then alternate after completing 12 rounds.
Each map is structured differently, so there may be inherent advantages for a certain side given a map's layout. One could assume that, given such a competitive game, all maps would have balanced win rates and it wouldn't matter whether a team was on the CT side or the T side.
The data found in results.csv might show a different picture. We will attempt to examine total rounds won by both CT and T sides on every map, and calculate win rates over time (the data is from 2015-2020). The results will show whether maps have balanced win rates, or whether certain maps offer implicit advantages when it comes to what side each team is on.
To test for statistical significance, we will use a 2-proportion Z-test with a null hypothesis of CT win rate = 50% and an alternative hypothesis of CT win rate ≠ 50%. This will test whether certain maps have implicit advantages for CT or T sides.
# 2-Proportion-Z-Test
# Null Hypothesis: CT win rate = 50%
# Alternative Hypothesis: CT win rate ≠ 50%
# Also shows graph of rolling average CT win rates over time per each map (rolling avg window: 500 rounds)
results_df['date'] = pd.to_datetime(results_df['date'], errors='coerce')
results_df = results_df.dropna(subset=['date', '_map'])
ct_1 = results_df[['date', '_map', 'ct_1']].rename(columns={'ct_1': 'ct'})
ct_2 = results_df[['date', '_map', 'ct_2']].rename(columns={'ct_2': 'ct'})
t_1 = results_df[['date', '_map', 't_1']].rename(columns={'t_1': 't'})
t_2 = results_df[['date', '_map', 't_2']].rename(columns={'t_2': 't'})
ct = pd.concat([ct_1, ct_2]).sort_values('date').set_index('date')
t = pd.concat([t_1, t_2]).sort_values('date').set_index('date')
maps = ['Cache', 'Cobblestone', 'Dust2', 'Inferno', 'Mirage', 'Nuke', 'Overpass', 'Train', 'Vertigo']
rolling_ct_avg = {}
hypothesis_results = []
for map_name in maps:
ct_map = ct[ct['_map'] == map_name]
t_map = t[t['_map'] == map_name]
ct_avg = ct_map['ct'].rolling(window=500, min_periods=20, center=True).sum()
t_avg = t_map['t'].rolling(window=500, min_periods=20, center=True).sum()
win_pct = (ct_avg / (ct_avg + t_avg)) * 100
rolling_ct_avg[map_name] = win_pct
ct_total = ct_map['ct'].sum()
t_total = t_map['t'].sum()
z_stat, p_val = proportions_ztest([ct_total, t_total], [ct_total + t_total, ct_total + t_total])
hypothesis_results.append({
'Map': map_name,
'CT Round Wins': ct_total,
'T Round Wins': t_total,
'CT Round Win Rate Percentage': round((ct_total / (ct_total + t_total)) * 100, 2),
'Z-Statistic': round(z_stat, 3),
'p-value': p_val,
'Significant? ': 'Yes' if p_val < 0.05 else 'No'
})
hypothesis_df = pd.DataFrame(hypothesis_results).sort_values('CT Round Win Rate Percentage', ascending=False)
display(hypothesis_df)
plt.figure(figsize=(14, 7))
for map_name, win_pct_data in rolling_ct_avg.items():
plt.plot(win_pct_data.index, win_pct_data.values, label=map_name, linewidth=1.5, alpha=0.9)
plt.axhline(50, color='black', linestyle='--', linewidth=1)
plt.xlabel('Date')
plt.ylabel('CT Round Win Percentage')
plt.title('CT Round Win Percentage Over Time by Map')
plt.legend(title='Map', loc='center left', bbox_to_anchor=(1, 0.5))
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()
| Map | CT Round Wins | T Round Wins | CT Round Win Rate Percentage | Z-Statistic | p-value | Significant? | |
|---|---|---|---|---|---|---|---|
| 5 | Nuke | 58561 | 48102 | 54.90 | 45.290 | 0.000000e+00 | Yes |
| 7 | Train | 90147 | 76095 | 54.23 | 48.740 | 0.000000e+00 | Yes |
| 6 | Overpass | 75508 | 67366 | 52.85 | 30.463 | 8.113994e-204 | Yes |
| 4 | Mirage | 119334 | 111006 | 51.81 | 24.540 | 5.557588e-133 | Yes |
| 3 | Inferno | 93465 | 96862 | 49.11 | -11.012 | 3.350468e-28 | Yes |
| 1 | Cobblestone | 43390 | 45332 | 48.91 | -9.220 | 2.960643e-20 | Yes |
| 2 | Dust2 | 50899 | 54195 | 48.43 | -14.378 | 7.061988e-47 | Yes |
| 8 | Vertigo | 7559 | 8130 | 48.18 | -6.447 | 1.141327e-10 | Yes |
| 0 | Cache | 55647 | 61750 | 47.40 | -25.190 | 5.142636e-140 | Yes |
After analysis, we can see that it is statistically significant that every map has had an implicit advantage towards a certain side at any given moment, some more than others.
For example, we can conclude that teams have statistically been more likely to win rounds if they are on the CT side on the map Nuke. Additionally, we can conclude that teams statistically have been likely to win rounds if they are on the T side on the map Cache.
CS:GO constantly rotated maps in and out of competitive play to avoid stagnation as well as to create major overhauls. This explains why some lines start later than others (in Cache's case) or there is a flat line for some periods (Dust2/Inferno). There simply just weren't games being played professionally on these maps during these times.
Due to CS:GO's nature as a video game that was constantly adjusted with balance changes, we can see that there are non-insignificant dips at certain points in time. For example, in February 2017, Valve created a new version of Inferno that significantly changed features of the map in order to try and balance it. Players naturally develop new strategies over time to account for changes made by developers. A greater look at this history can be seen here: The History of Inferno Banana Control.
C2: Has starting side ever mattered for match wins?¶
As an extension of the previous result of CT vs T round win rate, we would like to determine whether which side the team starts on has an impact on whether or not they win the entire match.
After the 12th round, players swap sides from either T or CT and continue the game from scratch. All money is wiped and it is effectively like the game beginning again from round 1. Intuitively, the side you randomly start on should not have an effect on who ultimately wins the match, but we would like to explore this possibility.
results_df['ct_start_win'] = results_df['starting_ct'] == results_df['map_winner']
We would like to map out any possible changes in CT start team win trends over time. As discussed previously, reworks to maps would aim to squash major discrepencies like this.
rolling_window = 1500
plt.figure(figsize=(14, 7))
match_results = []
for map_name in maps:
map_df = results_df[results_df['_map'] == map_name].sort_values('date')
ct_wins = (map_df['ct_start_win'].sum())
total = len(map_df)
z_stat, p_val = proportions_ztest([ct_wins, total - ct_wins], [total, total])
win_rate = map_df['ct_start_win'].rolling(window=rolling_window, min_periods=20, center=True).mean()
match_results.append({
'Map': map_name,
'CT Start Wins': ct_wins,
'Total Matches': total,
'CT Start Win Rate Percentage': round((ct_wins / total) * 100, 2),
})
plt.plot(map_df['date'], win_rate * 100, label=map_name, alpha=0.9)
ct_match_df = pd.DataFrame(match_results).sort_values('CT Start Win Rate Percentage', ascending=False)
display(ct_match_df)
plt.axhline(50, color='gray', linestyle='--')
plt.xlabel("Date")
plt.ylabel("CT-starting Side Team Win Rate (%)")
plt.title(f"CT-Starting Side Team Match Win Rate (Rolling {rolling_window} Matches)")
plt.legend(title='Map', loc='center left', bbox_to_anchor=(1, 0.5))
plt.grid(True)
plt.tight_layout()
plt.show()
| Map | CT Start Wins | Total Matches | CT Start Win Rate Percentage | |
|---|---|---|---|---|
| 1 | Cobblestone | 1828 | 3513 | 52.04 |
| 3 | Inferno | 3864 | 7485 | 51.62 |
| 0 | Cache | 2361 | 4613 | 51.18 |
| 8 | Vertigo | 310 | 609 | 50.90 |
| 2 | Dust2 | 2093 | 4114 | 50.88 |
| 4 | Mirage | 4585 | 9021 | 50.83 |
| 5 | Nuke | 2114 | 4206 | 50.26 |
| 7 | Train | 3280 | 6566 | 49.95 |
| 6 | Overpass | 2786 | 5625 | 49.53 |
Interestingly, we reach different results for this than from the round win mapping. It appears that some maps are ultimately CT sided for overall match wins within the timeframe of our dataset.
Cobblestone, Inferno, Cache, and Mirage all show statistically significant CT-start advantage, while the rest do not.
Inferno, after receiving a rework in 2017, almost immediately jumped up in CT starting wins after teams figured out new strategies like mentioned previously.
A very early prediction is that this can be caused by round 1 and 2 of the match (pistol round + 1) heavily favor CT side due to the longer range engagements on these maps.
This prediction could be supported by the fact that Cobblestone is such an outlier here, with its engagements being some of the longest in the game by distance.
C3: Graphing Pro Player ADR vs Rating¶
Average damage per round (ADR) is a pretty handy way of determining the impact an individual player had on the outcome of the match.
There are many ways of doing damage to your opponents in CS:GO. Grenade damage, direct gun damage, etc. You might lose an engagement to someone that you never even saw, but you could have also damaged them with a well placed Frag or Molotov around a corner, leading your surviving teammates to have an easier time taking them down.
Rating is a much more involved calculation of player impact in a match which you can read about here: Introducing Rating 2.0.
As of data collection date, the HLTV.org site used a measure of kill rating, survival rating, KAST rating, impact rating, and damage rating.
In simpler terms, 1.00 rating is average. Anything above is above average performance, and vice versa.
We want to try and see if there is a correlation between these two metrics of player performance.
plt.figure(figsize=(10,6))
plt.scatter(players_df['adr'], players_df['rating'], alpha=0.6, s=0.3)
plt.title("Pro Player ADR vs Rating")
plt.xlabel("ADR (Average Damage per Round)")
plt.ylabel("Rating")
plt.grid(True)
plt.show()
correlation = players_df['adr'].corr(players_df['rating'])
display(f"Pearson correlation of ADR vs Rating: {correlation:.3f}")
'Pearson correlation of ADR vs Rating: 0.880'
With the sheer number of data points (over 300000), we can almost certainly conclude that there is a strong correlation between HLTV's in depth Rating system and the basic ADR number.
In fact, it is so strong that it almost seems unnecessary to do anything more than observe ADR to measure player performance.
Predicting Map Picks with Machine Learning¶
In a match of pro CS:GO, at the start, teams will perform a short veto process. At any point in time, up to the developer Valve's choosing, 7 maps will usually be in the "active pool", with some taken out or put back in to make changes and avoid the game from being stale.
Of this "pool", teams will ban two maps that they do not want to play on and pick one of their choice. This leaves one leftover map. Usually in a best of 3 match, the 2 picked maps and the leftover will be played until either team wins.
With that follows an interesting observation to be made. With two bans, you'd like to ban maps that you know your team struggles on or you know the opposing team dominates at. You'd pick the map you are best at. But the leftover map shows a mutual indifference from either team towards a certain map.
So, can we predict what this leftover map will be just off of the team's history? Let's see. We will use Random Forest Classification to try.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, ConfusionMatrixDisplay, confusion_matrix
import numpy as np
# choose all columns but date and left_over
X = picks_df.drop(columns=['left_over', 'date', 'match_id', 'event_id'])
# one hot encoding
X = pd.get_dummies(X)
y = picks_df['left_over']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# get the top feature importances
importances = model.feature_importances_
indices = np.argsort(importances)[-10:]
top_feats = X.columns[indices]
top_vals = importances[indices]
# plot it
plt.figure(figsize=(10, 6))
plt.barh(top_feats, top_vals)
plt.xlabel("Feature Importance")
plt.title("Top 10 Feature Importances")
plt.tight_layout()
plt.show()
# show accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Random forest accuracy: {accuracy:.2%}")
Random forest accuracy: 74.31%
To interpret the results of this, on the y-axis labels, t1/2 means the team who picks in either first or second. removed_1/2 is their first or second ban pick. The map name is self-explanatory.
So as a conclusion, notice that our model is quite good with an accuracy of 74.31%. We see that interestingly, Mirage favors heavily on this list. This is probably due to Mirage's staple status as a map in the game. Most teams enjoy playing and are good at this map, leading to the fact that if it is banned 4th, we probably can assume what the leftover map is with great accuracy.
Conclusion¶
Our group was determined to understand what makes CS:GO matches competitive and what factors help predict outcomes when in a game. Since not everyone knows CS:GO as a pro gamer would understand, we started by explaining the basics how Counter-Terrorist and Terrorist sides work, what teams are trying to accomplish each round, and how the in-game economy functions. This quick briefing of the game helps people understand everything that follows for our group findings.
Key Findings¶
Our statistical analysis revealed significant map-specific imbalances in CS:GO. The two-proportion Z-tests demonstrated that Nuke and Train exhibit substantial Counter-Terrorist advantages, with CT win rates of 55-58% (p < 0.001). This confirms a bias from the community about these maps' defensive nature. One the other hand Cache showed a clear trend toward a 50/50 split during 2018-2019, indicating a map balancing efforts by from Valve.
The computing the data provided valuable insights into predictive modeling in esports. Our Random Forest classifier achieved 75% accuracy in predicting final map selections from ban/pick phases, indicating that draft strategies contain substantial information about team intentions and capabilities. There was strong correlation (r = 0.880) between Average Damage per Round and HLTV Rating confirms ADR as a reliable performance metric for player evaluation.
Implications for the CS:GO Community¶
These findings have practical applications for teams, analysts, and tournament organizers. Teams can leverage map-specific win rate data. This helps with particularly prioritizing CT-favored maps when they exceed on defense. The meta trends suggests that analysts should pay closer attention to early-round bans and picks as indicators of team strategy and confidence.
Limitations and Future Work¶
Several limitations constrain our conclusions. The economic analysis remains incomplete, representing a significant gap given the importance of in-game economy in CS:GO strategy. Our dataset's temporal scope may not capture recent meta shifts, and we focused primarily on tier-one professional matches, potentially limiting generalizability to other competitive levels.
Future research should integrate round-by-round economic data to build more comprehensive predictive models. Additionally, investigating how teams might adapt their strategies based on these statistical insights could bridge the gap between data science and practical competitive application. Expanding the analysis to include communication patterns and in-game decision-making would provide even richer insights into what drives success in professional CS:GO.
This tutorial demonstrates how data science techniques can illuminate the strategic depth of competitive gaming, providing both statistical rigor and practical insights for the esports community.