How To Rank and Screen Candidate Factors¶
Use ranking to identify factors worth deeper review. Ranking is a screening step, not final variable selection.
With GLMStudy¶
study.split(seed=42)
ranking = study.rank_candidates(
["score", "age", "segment", "region"],
bins=6,
max_groups=6,
)
ranking.sort_values("deviance_improvement", ascending=False).head(20)
Useful columns:
deviance_improvement: validation deviance improvement versus intercept-onlyrelative_improvement: improvement divided by baseline validation deviancep_value: chi-square style screening p-value when SciPy is availabletrain_missing_ratevalidation_missing_ratetrain_measure_coveragevalidation_measure_coveragebinsmin_bin_sizesmall_bins
With the Low-Level API¶
from glm_factor_optimizer import rank_factors
ranking = rank_factors(
train,
valid,
target="events",
exposure="hours",
factors=["score", "segment", "region"],
factor_kinds={"segment": "categorical", "region": "categorical"},
)
Recommended Review Rules¶
Prefer factors that:
- improve validation deviance materially
- have stable train and validation behavior
- cover enough exposure, weight, or rows
- have low missing rate or a meaningful missing group
- can be explained to someone reviewing the model
- avoid too many bins for a tiny improvement
Do not accept factors only because the screening p-value is small. In large datasets, tiny effects can be statistically significant but not useful.