Project Phoenix Domain

Pipeline Modules

Five Layers · Load → Engineer → Train → Monitor → Deploy

Module Overview

Each module is a narrow, verifiable layer. No module crosses pipeline stages. Outputs are written to disk before downstream modules begin.

Data Client

OptiverDataClient wraps the Kaggle dataset. Fast access to price, imbalance, and temporal signals. Slice by stock, date, or feature subset.

5 tools · 480k rows · 200 stocks

Feature Forge

V1 baseline features and V2 Numba-accelerated microstructure features. WAP, bid-ask spread, triplet imbalance, temporal lags, global stock statistics.

16 tools · walk-forward safe

Model Lab

LightGBM with time-series CV, walk-forward backtesting, and Optuna hyperparameter tuning. SHAP explainability and stacking ensembles.

19 tools · MAE-optimized

Drift Radar

PSI, KS test, JS divergence, concept drift, covariate shift, and structural break detection. Dashboard output gates deployment.

16 tools · deployment gate

Synthesis

Full pipeline orchestration: load → engineer → train → monitor → gate → report. Single command, profile-driven depth.

5 tools · V6 capstone

Module File Map

Module File Responsibility
Data Client data_client.py Dataset loading, slicing, and summary statistics
Feature Forge V1 feature_engineering.py Imbalance, price, and temporal baseline features
Feature Forge V2 feature_engineering.py Numba-accelerated microstructure features
Model Lab model_pipeline.py LightGBM CV, evaluation, zero-sum adjustment
Optuna Tuner optuna_tuner.py TPE Bayesian hyperparameter search
Walk-Forward walk_forward.py Production-simulating backtesting
Drift Radar feature_drift.py PSI, KS, JS divergence feature monitoring
Concept Drift target_drift.py Target and feature-target relationship drift
Data Quality data_quality.py Missing values, outlier frequency, structural breaks
Alerts monitoring_alerts.py Alert routing from drift and quality signals
Synthesis agentic_engine.py End-to-end orchestration and deployment gate
Reports report_templates.py Structured analysis, model, and monitoring reports

Pipeline Execution Rules

Synthesis Profiles

Quick

Explore mode: load, summary, target analysis, report. Fast turnaround for initial data review.

~2 min

Standard

Model mode: features, LightGBM CV, drift check, basic report. Production-ready validation.

~8 min

Comprehensive

Full mode: parallel features, CV, walk-forward backtest, full monitoring suite, complete report.

~20 min