Financial-ML:

Predict S&P 500 Outperformers with ML

A machine learning system for equity selection that predicts which S&P stocks will outperform the market benchmark. The project combines market data, company fundamentals and sentiment indicators to construct long-only portfolios, using through a rigorous ML pipeline: data ingestion → feature engineering → expanding-window cross-validation → portfolio construction → comprehensive backtesting against SPY.

Key Results

Metric	Value
Sharpe Ratio	0.93
Annual Return	20.2%
Max Drawdown	-22.9%
Alpha vs Random	1.72%
Win Rate	69.8%

100% Long-Only Strategy: Top 10% stocks, equal-weighted, monthly rebalancing
Transaction costs included: 10 bps per trade, ~0.5% annual drag from 42% turnover
Regime awareness: VIX-based features improve downside protection during volatile periods
Statistically significant: Sharpe ratio outperformance (p < 0.001, Bonferroni-adjusted)

Data & Features

Data Source	Features	Purpose
Market Data Yahoo Finance	r12 (12m return) mom121 (momentum) vol3, vol12 (volatility)	Momentum and risk regime signals
Fundamentals SEC EDGAR	BookToMarket ROE, ROA, NetMargin Leverage, Asset Growth, Net Share Issuance	Value, quality, and financial health
Sentiment VIX Index	VIX percentile (12-month rolling)	Market stress detection

Models Evaluated

Logistic Regression: Simple baseline for comparison
Random Forest: Selected for production—balances accuracy with stable predictions
Gradient Boosting & XGBoost: Tested but tend to overfit in market prediction tasks

Why Random Forest won: It maintains a wider range of confidence scores, making it better at ranking stocks. Boosting methods were too aggressive and didn't improve results.

How It Works

The Pipeline

Collect Data: Download market prices (Yahoo Finance), fundamentals from SEC filings, and VIX sentiment
Engineer Features: Extract 13 features from the data (momentum, volatility, value ratios, regime indicators)
Train Model: Use 15 years of historical data with time series cross-validation to prevent overfitting
Generate Predictions: Random Forest model predicts which stocks will outperform S&P 500 next month
Construct Portfolio: Select top 10% highest-confidence stocks, equally weight, rebalance monthly
Validate Results: Backtest against S&P 500 benchmark with real transaction costs

Model Validation

The model uses expanding-window time series cross-validation, where each month's predictions are tested on future data the model has never seen. This prevents overfitting and mimics real-world deployment.

Quick Start

Installation

git clone https://github.com/pmatorras/financial-ml.git
cd financial-ml
python -m venv .venv
source .venv/bin/activate
pip install -e .

Run Full Pipeline

make data     # Collect market, fundamentals, sentiment
make train    # Train models
make backtest # Analyze and backtest

Quick Test

make test    # Run pipeline on subset with debug mode

For advanced usage, flags, and development modes, see the GitHub repository