Kunal Singh (Senior Data Scientist at StarX), Rustin Domingos (VP of Sports Analytics at StarX) and Michelle Ritter (CEO/Founder at StarX)
As soon as early January hits and you come to terms with the fact that your favorite NFL team will not be playing in the playoffs, Mock Draft Season arrives. Between ESPN, The Athletic, The Ringer, PFF, and seemingly hundreds of other sources, it feels like analysts are updating their mock drafts and big boards each and every day. A high percentage of those mock drafts are driven by film evaluations and occasionally have some data points mixed in — especially after players test at the combine. At StarX, we’ve dedicated ourselves to the development of machine learning-based projections for college football players, something that virtually doesn’t exist in the public domain.
Given that our team has experience with MLB and NFL teams, we understand that data sources are the least of a club’s concern. Between official data from the NCAA, granular play level stats and grades from PFF, and a highly robust dataset of historical internal scouting reports — that contained overall evaluations as well as grades on specific attributes — we were spoiled.
We are thrilled to have a data partnership with Pro Football Focus (PFF). They provide us with dense play-by-play data with a row for every player on every play — 22 rows per play. Columns include stats that go beyond the box score: time to pressure, tackles avoided, quarterback pressure allowed, first contact by a defender etc…. In this post we’re planning to highlight how we leverage this data to build predictive models about how college football players will fare in the NFL.
To help our evaluation process, we use public mock drafts and big boards for expert opinion. We’ve scraped mock draft sources that have data going back to the 2015 draft class and all of them give meaningful lift to our projections. Those sources include Walter Football, Kevin Hanson, ESPN, and MyNFLDraft. We are looking to continually add more sources over time as they are an important part of our models.
Lastly, we pull data from Spotrac to provide salary information about NFL players that include both cash earnings as well as cap numbers.
One of the biggest issues when blending multiple different data sources is figuring out how to join them. When these different sources use birth names rather than nicknames, have different ways of writing out college names, and project these players out at different positions; it leads to a headache for us.
Here’s an example — how #3 pick in the 2023 NFL Draft Will Anderson Jr. is labeled across all our data sources:
By our tally, that’s 3 different spellings of his name, 2 different colleges, and every source having them labeled differently in terms of his projected position. Positions and colleges are easy to clean up — although slightly tedious — because there is a finite number of options. Names are where we need to bring in the fuzzywuzzy package. This package uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package. An example from their website with two strings and the similarity between two strings:
We wrote a function match_name that utilizes the ratio above and matches players across the datasets mentioned above.
args — tuple with 3 objects: name1, ds2, min_score
name1 — the name to be matched
ds2 — a list that contains all names in a different dataset to match name1 against
min_score — minimum threshold to categorize a “match”, in our case we use 90
Players can have multiple matches, so we have another function that selects the highest score. This method requires high amounts of computational power, but to speed this up I recommend using the multiprocessing package coupled with the tdqm package to add progress bars to your output. Multiprocessing is a game changer for anything related to looping through thousands (in our case tens of thousands) of rows.
Finally, here is a quick example of how this process actually gets utilized. As mentioned above, ds2 is our list of potential match candidates, while “Will Anderson DL 2022 ALUN” is the string we actually want to match. The function process.extract() creates a score using the Levenshtein Distance for each potential match, and “Will Anderson Jr. DL 2022 ALUN” is 95/100, just the “Jr.” in his name is the difference.
When working for an NFL team, front offices only care about a college player’s value for three days out of the year: the NFL draft. From a data and modeling perspective, it makes life relatively simple; make sure your predictions are ready before draft meetings on April 1st. Here at StarX, we want to be able to provide projections for players at any given point throughout the year whether it be during the middle of the season or right after the combine. Therefore, all datasets need to be ready for modeling when given a date parameter, which makes it slightly more strenuous to build.
Here is a glimpse into how our mock drafts are stored internally, and we update the mock drafts weekly. As previously mentioned, we want information to be dynamic and be able to value players at any given point, so we create rolling statistics based on mock data. Below is an example of calculating the rolling percentage of mock drafts that a player shows up in. The rolling function where you specify the window works great.
rolling_mock_percent = player_mocks[‘in_mock’].rolling(window=len(player_mocks), min_periods=1).sum().div(range(1, len(player_mocks)+1))
The trickier problem is when we want to join our rolling stats with some of our mock draft data. Unfortunately, these outside sources don’t have a “game_id” or “game_date” where we can match data sources — most mock drafts aren’t updated the same day as a game but rather during the week after analysts get another week of film. Instead of a traditional left_join or inner_join, we utilize pd.merge_asof(). This type of merge is similar to a left_join except that it matches on the nearest key rather than requiring an exact match. It’s super useful for joining time-series data where you want to match on the closest timestamp like we do. One more key piece that we need is the final input where direction = ‘backwards’ which ensures we only match the closest mock drafts before our game_date. If we don’t specify this, we could have a mock draft after the combine that gets included after a November game since that could be the closest match.
player_stats = pd.merge_asof(rolling_player_stat_dataset, rolling_mock_draft_dataset, left_on=[‘game_date’], right_on= [‘mock_date_hanson’], direction=’backward’).sort_values(‘game_date’)
From a modeling perspective, our first steps are to make sure there is no data leakage. As previously mentioned, our dataset doesn’t simply contain one row per prospect, but many rows across multiple seasons. We pass in a max_date input whenever updating predictions and make sure there is no data included in our training set past that point. If we want to make projections on this draft class for March 1st, all of our historical prospects must not have information included past March 1st — easier said than done.
In terms of modeling itself, we’ve found that the XGBRegressor() has been the most accurate model for us. We use RandomSearchCV(), where our KFolds are grouped by season, and we pass in a comprehensive set of parameters to test.
Our target variable is a player’s cap hit as a percentage of that year’s salary cap x years out from when they are drafted. Using this helps normalize for the salary cap inflating every year, and provides a more stable metric than using inflated-adjusted cash earnings by year, which can get pretty noisy in the NFL.
Let’s use Will Anderson Jr. again. When signing a rookie contract, the player is paid based on where he is drafted; it isn’t an open market with negotiations like free agency. Anderson Jr. signed a 4 Yr / 35.2M dollar contract. Upon signing, 22.6M of that contract is in bonus form and he is awarded that cash right away, plus he had a base salary of, just, 750K. His yearly cash then plummets over the rest of his rookie contract as most of his income came in year 1: he then makes 2.35M, 3.95M, 5.5M. However, his cap hit (relative to the yearly salary cap) is a lot smoother — and it is easy to translate that number back into dollar earnings.
Here is how we see the consensus top three WRs for this upcoming draft class. Marvin Harrison Jr. has been touted as a generational WR prospect — and we see it the same way. In the last 5 draft classes, he is the highest valued WR prospect we have seen. Given how he looks on tape, it wouldn’t have been shocking if he tested extremely well at the combine which would’ve further boosted his projection.
The more interesting argument follows, Odunze vs. Nabers where we have Nabers a tick higher.
Here are some of the most important WR-related metrics that show up in our projections. 8 of the 25 are related to mock drafts, and 8 of the 25 are related to having attended the combine — both logical. If you show up in mock drafts, you’re likely a higher rated prospect coming out of college, and the same logic applies to the players invited to the combine. Nabers put up some incredible numbers in yards after contact per reception, first downs gained per pass route snap, and tackles avoided per reception which are sticky stats when projecting WRs into the NFL. Although Odunze tested well at the combine, we see Nabers as a better prospect because of his outstanding production, specifically after the catch.
Another topic of discussion is this year’s offensive tackle class. Similarly to the WR class, we see a consensus #1 prospect followed by a lot of discussion about a few really talented players at the position.
We are slightly higher on Tyler Guyton than what a consensus big board might have. Although the mock draft data is slightly better for some of the other prospects on this list, we lean Guyton because of how he tested and his production; one key feature being his lack of post-snap penalties in his last season. The table below showcases one of our closest comps for Guyton coming out of college, Giants starting LT Andrew Thomas.
As you can see, they have extremely similar testing numbers as well as similar production in terms of both having very few penalties allowed (virtually none) and low quarterback pressure rates allowed. Andrew Thomas recently signed a 5YR / 117.5 Million dollar extension with the Giants.
At the corner position, we find a unique example. Since we predict players salaries from year to year, there are certain times within a position group where we see a player as top in the class during the first four years, then drop below other players in later years. This could be interpreted as an overvalued player in the draft. If we could predict salaries perfectly over the first four years, we would be essentially predicting the draft order given the rookie-wage scale has a set payout for each pick. A key subtlety to understand is that a higher ranking in the first year indicates our model’s prediction of a player being selected early in the draft, which doesn’t necessarily imply that the player ought to be chosen that early. Conversely, a higher ranking in the sixth year reflects the player’s long-term value, suggesting that such a player could be an undervalued pick in the draft. This means that our stack in year 1 is not always how we would see it in year 6.
In our example here, we see it as Wiggins, Mitchell, Arnold, DeJean over their rookie contract. DeJean has pretty consistently been mocked as the fourth corner throughout the season, and we see him as a late 1st/early 2nd round pick. However in year 6 where the market is completely open (5th year option can still be relatively cost-controlled for 1st round draft picks), we see a different order than before. We see DeJean take the top spot by a small margin over Mitchell and Arnold, and Wiggins falls behind the pack. Since the 2020 draft class, here are our highest year-6 salary predictions (adjusted relative to percentage of the salary cap): Pat Surtain II, Derek Stingley Jr, Sauce Gardner, Cooper DeJean, and Jeff Okudah.
As we continue to refine our models and incorporate broader datasets, the potential to accurately forecast a player’s NFL trajectory and financial outlook becomes increasingly within reach. We hope you found this interesting and would love to hear about any similar projects you have worked on. Please feel free to reach out!