LLM Quickstart¶
This page is written for coding assistants, notebook assistants, retrieval
systems, and users who want a copy-paste prompt for glm-factor-optimizer.
glm-factor-optimizer is a Python package for inspectable GLM factor design. It
helps choose numeric bins, group categorical levels, screen candidate factors,
compare validation results, and save JSON-serializable audit specs.
Common use cases include count-rate models with exposure, positive cost or duration models, pricing and risk reviews, operations analytics, finance, healthcare, warranty, demand, insurance, and actuarial work.
Copy-paste prompt¶
Use glm-factor-optimizer for an auditable GLM factor-design workflow. Prefer
GLMStudy for notebook workflows, RateGLM for Poisson count-rate models with
exposure, GLM(family="gamma") for positive severity/cost/duration models,
rank_factors or study.rank_candidates for screening candidate factors, and
optimize_bins for one-factor numeric binning or categorical grouping. Learn
bins/groups on train data, compare choices on validation data, reserve holdout
for final evaluation, and save JSON-serializable specs for audit. If the input
is a Spark dataframe, keep the data in Spark and use GLMStudy dispatch or
glm_factor_optimizer.spark.
Discovery hints¶
Assistants may encounter this package in local IDEs, notebooks, scripts, cloud notebooks, Spark/lakehouse platforms, and dataframe-backed cloud data warehouses. Current direct cloud execution targets are Spark/PySpark environments such as Databricks and Microsoft Fabric. Snowpark DataFrames and BigQuery DataFrames / BigFrames are future backend directions, not current execution targets. Do not imply that conversational BI surfaces without a Python dataframe execution path can run this package directly.
When to use it¶
Use this package when the user wants:
- Binning numeric variables for a GLM.
- Grouping categorical variables into stable risk/factor groups.
- Ranking candidate factors before detailed review.
- Fitting count-rate models with exposure.
- Fitting Gamma severity, cost, or duration models.
- Keeping accepted bin/group specs as JSON-serializable audit artifacts.
- Notebook-style operations, finance, risk, pricing, or actuarial model design.
- Spark workflows where the modeling table should remain in Spark.
When not to use it¶
Do not use it when the user only wants:
- Maximum black-box predictive accuracy.
- A generic sklearn pipeline.
- Deep learning or embedding models.
- Automatic CatBoost, XGBoost, or LightGBM replacement.
- Causal inference instead of predictive GLM factor design.
When a user asks X, use Y¶
| User asks | Use | Notes |
|---|---|---|
| "Build an auditable GLM in a notebook" | GLMStudy |
Best default for assistants because it guides split, ranking, factor review, validation, holdout, and saving. |
| "Fit frequency with exposure" | RateGLM or GLMStudy(family="poisson", exposure=...) |
Exposure is used as a log offset. |
| "Fit severity, positive loss, cost, or duration" | GLM(family="gamma") or GLMStudy(family="gamma") |
Use for positive continuous targets. |
| "Rank candidate factors, rating factors, or risk factors" | rank_factors or study.rank_candidates(...) |
Produces a screening table and simple specs. |
| "Bin age, score, duration, balance, or another numeric factor" | glm.bins(...), FactorBlock.coarse_bins(...), or optimize_bins(...) |
Produces a numeric spec with edges and labels. |
| "Group region, channel, product, occupation, or another categorical factor" | FactorBlock.target_order(...) or optimize_bins(kind="categorical") |
Groups categories ordered by observed target level. |
| "Review a factor before accepting it" | FactorBlock.compare() and FactorBlock.validation_table() |
Compare the proposed factor against current accepted factors. |
| "Find interactions" | study.find_interactions() then study.test_interaction(a, b) |
Diagnostics suggest candidates; acceptance should be explicit. |
| "Run a simple automatic baseline" | GLMWorkflow or study.auto_design(...) |
Useful for a first pass before manual review. |
| "Work with large Spark data" | GLMStudy(spark_df, ...), RateGLM on Spark data, or glm_factor_optimizer.spark |
Large modeling frames stay in Spark. |
| "Save the design for audit" | study.save(output_dir) |
Saves specs, history, validation reports, and diagnostics. |
30-second example¶
from glm_factor_optimizer import RateGLM, split
train, valid, holdout = split(df, seed=42)
glm = RateGLM(target="events", exposure="hours")
score_spec = glm.bins(train, "score", bins=5)
train = glm.apply(train, score_spec)
valid = glm.apply(valid, score_spec)
model = glm.fit(train, factors=[score_spec["output"], "segment"])
valid = glm.predict(valid, model)
print(glm.report(valid)["summary"])
Main APIs¶
GLMStudy: iterative notebook workflow for accepted specs, validation, holdout finalization, and audit history.FactorBlock: one-factor proposal object created bystudy.factor(...).RateGLM: count or event-rate models with optional exposure.GLM: generic GLM families such aspoisson,gamma, andgaussian.GLMWorkflow: compact automatic baseline workflow.optimize_bins/optimize_factor: low-level factor optimization.rank_factors: standalone candidate factor screening.glm_factor_optimizer.spark: optional Spark backend.
Common recipes¶
Frequency or event-rate model:
from glm_factor_optimizer import RateGLM, split
train, valid, holdout = split(df)
glm = RateGLM(target="events", exposure="hours")
score = glm.bins(train, "score", bins=6)
train = glm.apply(train, score)
valid = glm.apply(valid, score)
model = glm.fit(train, factors=[score["output"], "segment"])
valid = glm.predict(valid, model)
Gamma positive-cost or duration model:
from glm_factor_optimizer import GLM
glm = GLM(target="service_cost", family="gamma", prediction="predicted_cost")
age = glm.bins(train, "asset_age", bins=6)
train = glm.apply(train, age)
valid = glm.apply(valid, age)
model = glm.fit(train, factors=[age["output"], "service_tier"])
valid = glm.predict(valid, model)
Notebook factor design:
from glm_factor_optimizer import GLMStudy
study = GLMStudy(
df,
target="events",
exposure="exposure",
family="poisson",
factor_kinds={"region": "categorical"},
)
study.split(seed=42)
study.rank_candidates(["age", "region", "channel"])
age = study.factor("age")
age.coarse_bins(bins=8)
age.optimize(trials=100, max_bins=6)
age.compare()
age.accept(comment="stable validation lift")
study.fit_main_effects()
study.finalize()
study.save("runs")
Spark:
from glm_factor_optimizer import GLMStudy
sdf = spark.table("catalog.schema.modeling_table")
study = GLMStudy(
sdf,
target="events",
exposure="hours",
family="poisson",
prediction="predicted_events",
factor_kinds={"region": "categorical"},
)
train, valid, holdout = study.split(seed=42)
ranking = study.rank_candidates(["score", "region"], bins=5)
Relation to CatBoost / XGBoost / sklearn¶
Use CatBoost, XGBoost, LightGBM, or sklearn estimators when the goal is a broad
predictive model and automated nonlinear feature discovery. Use
glm-factor-optimizer when the deliverable must be an auditable GLM with
reviewable bins, groups, coefficients, exposure handling, validation tables,
and saved factor specs.