NFL Player Performance Predicting

Full report and implementation details are available at:

Tools & Techniques: Python, sklearn, XGBoost, Random Forest, NLP (VADER), Reddit API, PCA, K-Means

Overview

In this project, we developed a predictive framework to forecast NFL player performance using both in-game statistics and public sentiment. Our target metric was Average Expected Points Added (Avg_EPA), enabling consistent evaluation across multiple positions (QB, RB, WR/TE).

We explored a variety of models:

  • Linear Models (OLS, Ridge, Lasso, Best Subset)
  • Generalized Additive Models (GAM)
  • Random Forest Regressors
  • XGBoost with Dart Booster

To examine the psychological side of player performance, we conducted sentiment analysis on 688,000+ Reddit posts, integrating weekly compound sentiment scores using VADER. While sentiment was hypothesized to impact performance, it proved to be statistically insignificant in our models.

Key Findings

  • XGBoost outperformed all other models, achieving R² ≈ 0.78 for quarterback performance.
  • Rolling average EPA and in-game stats like passing yards, completions, and interceptions were among the most important predictors.
  • Contrary to expectations, online public sentiment had minimal predictive value.
  • Players showed more stable performance over time, suggesting early volatility due to inexperience or injury.

Data Sources

  • nflFastR datasets (roster, play-by-play, next-gen stats)
  • Reddit submission data (2018–2023, Academic Torrents)
  • Web scraping from NFL.com (limited by infinite scroll)

Modeling Techniques

  • Data preprocessing with one-hot encoding and rolling EPA features
  • Dimensionality reduction (PCA) + clustering (K-Means)
  • Sentiment integration pipeline: VADER → Aggregation → Modeling

workflow