A machine learning system for equity selection that predicts which S&P stocks will outperform the market benchmark. The project combines market data, company fundamentals and sentiment indicators to construct long-only portfolios, using through a rigorous ML pipeline: data ingestion → feature engineering → expanding-window cross-validation → portfolio construction → comprehensive backtesting against SPY.
| Metric | Value |
|---|---|
| Sharpe Ratio | 0.93 |
| Annual Return | 20.2% |
| Max Drawdown | -22.9% |
| Alpha vs Random | 1.72% |
| Win Rate | 69.8% |
| Data Source | Features | Purpose |
|---|---|---|
| Market Data Yahoo Finance |
r12 (12m return) mom121 (momentum) vol3, vol12 (volatility) |
Momentum and risk regime signals |
| Fundamentals SEC EDGAR |
BookToMarket ROE, ROA, NetMargin Leverage, Asset Growth, Net Share Issuance |
Value, quality, and financial health |
| Sentiment VIX Index |
VIX percentile (12-month rolling) | Market stress detection |
Why Random Forest won: It maintains a wider range of confidence scores, making it better at ranking stocks. Boosting methods were too aggressive and didn't improve results.
The model uses expanding-window time series cross-validation, where each month's predictions are tested on future data the model has never seen. This prevents overfitting and mimics real-world deployment.
git clone https://github.com/pmatorras/financial-ml.git
cd financial-ml
python -m venv .venv
source .venv/bin/activate
pip install -e .
make data # Collect market, fundamentals, sentiment
make train # Train models
make backtest # Analyze and backtest
make test # Run pipeline on subset with debug mode
For advanced usage, flags, and development modes, see the
GitHub repository