NFL Player Performance Predicting
Full report and implementation details are available at:
Tools & Techniques: Python, sklearn, XGBoost, Random Forest, NLP (VADER), Reddit API, PCA, K-Means
Overview
In this project, we developed a predictive framework to forecast NFL player performance using both in-game statistics and public sentiment. Our target metric was Average Expected Points Added (Avg_EPA), enabling consistent evaluation across multiple positions (QB, RB, WR/TE).
We explored a variety of models:
- Linear Models (OLS, Ridge, Lasso, Best Subset)
- Generalized Additive Models (GAM)
- Random Forest Regressors
- XGBoost with Dart Booster
To examine the psychological side of player performance, we conducted sentiment analysis on 688,000+ Reddit posts, integrating weekly compound sentiment scores using VADER. While sentiment was hypothesized to impact performance, it proved to be statistically insignificant in our models.
Key Findings
- XGBoost outperformed all other models, achieving R² ≈ 0.78 for quarterback performance.
- Rolling average EPA and in-game stats like passing yards, completions, and interceptions were among the most important predictors.
- Contrary to expectations, online public sentiment had minimal predictive value.
- Players showed more stable performance over time, suggesting early volatility due to inexperience or injury.
Data Sources
- nflFastR datasets (roster, play-by-play, next-gen stats)
- Reddit submission data (2018–2023, Academic Torrents)
- Web scraping from NFL.com (limited by infinite scroll)
Modeling Techniques
- Data preprocessing with one-hot encoding and rolling EPA features
- Dimensionality reduction (PCA) + clustering (K-Means)
- Sentiment integration pipeline: VADER → Aggregation → Modeling
