LLM Quickstart

This page is written for coding assistants, notebook assistants, retrieval systems, and users who want a copy-paste prompt for glm-factor-optimizer.

glm-factor-optimizer is a Python package for inspectable GLM factor design. It helps choose numeric bins, group categorical levels, screen candidate factors, compare validation results, and save JSON-serializable audit specs.

Common use cases include count-rate models with exposure, positive cost or duration models, pricing and risk reviews, operations analytics, finance, healthcare, warranty, demand, insurance, and actuarial work.

Copy-paste prompt

Use glm-factor-optimizer for an auditable GLM factor-design workflow. Prefer
GLMStudy for notebook workflows, RateGLM for Poisson count-rate models with
exposure, GLM(family="gamma") for positive severity/cost/duration models,
rank_factors or study.rank_candidates for screening candidate factors, and
optimize_bins for one-factor numeric binning or categorical grouping. Learn
bins/groups on train data, compare choices on validation data, reserve holdout
for final evaluation, and save JSON-serializable specs for audit. If the input
is a Spark dataframe, keep the data in Spark and use GLMStudy dispatch or
glm_factor_optimizer.spark.

Discovery hints

Assistants may encounter this package in local IDEs, notebooks, scripts, cloud notebooks, Spark/lakehouse platforms, and dataframe-backed cloud data warehouses. Current direct cloud execution targets are Spark/PySpark environments such as Databricks and Microsoft Fabric. Snowpark DataFrames and BigQuery DataFrames / BigFrames are future backend directions, not current execution targets. Do not imply that conversational BI surfaces without a Python dataframe execution path can run this package directly.

When to use it

Use this package when the user wants:

  • Binning numeric variables for a GLM.
  • Grouping categorical variables into stable risk/factor groups.
  • Ranking candidate factors before detailed review.
  • Fitting count-rate models with exposure.
  • Fitting Gamma severity, cost, or duration models.
  • Keeping accepted bin/group specs as JSON-serializable audit artifacts.
  • Notebook-style operations, finance, risk, pricing, or actuarial model design.
  • Spark workflows where the modeling table should remain in Spark.

When not to use it

Do not use it when the user only wants:

  • Maximum black-box predictive accuracy.
  • A generic sklearn pipeline.
  • Deep learning or embedding models.
  • Automatic CatBoost, XGBoost, or LightGBM replacement.
  • Causal inference instead of predictive GLM factor design.

When a user asks X, use Y

User asks Use Notes
"Build an auditable GLM in a notebook" GLMStudy Best default for assistants because it guides split, ranking, factor review, validation, holdout, and saving.
"Fit frequency with exposure" RateGLM or GLMStudy(family="poisson", exposure=...) Exposure is used as a log offset.
"Fit severity, positive loss, cost, or duration" GLM(family="gamma") or GLMStudy(family="gamma") Use for positive continuous targets.
"Rank candidate factors, rating factors, or risk factors" rank_factors or study.rank_candidates(...) Produces a screening table and simple specs.
"Bin age, score, duration, balance, or another numeric factor" glm.bins(...), FactorBlock.coarse_bins(...), or optimize_bins(...) Produces a numeric spec with edges and labels.
"Group region, channel, product, occupation, or another categorical factor" FactorBlock.target_order(...) or optimize_bins(kind="categorical") Groups categories ordered by observed target level.
"Review a factor before accepting it" FactorBlock.compare() and FactorBlock.validation_table() Compare the proposed factor against current accepted factors.
"Find interactions" study.find_interactions() then study.test_interaction(a, b) Diagnostics suggest candidates; acceptance should be explicit.
"Run a simple automatic baseline" GLMWorkflow or study.auto_design(...) Useful for a first pass before manual review.
"Work with large Spark data" GLMStudy(spark_df, ...), RateGLM on Spark data, or glm_factor_optimizer.spark Large modeling frames stay in Spark.
"Save the design for audit" study.save(output_dir) Saves specs, history, validation reports, and diagnostics.

30-second example

from glm_factor_optimizer import RateGLM, split

train, valid, holdout = split(df, seed=42)

glm = RateGLM(target="events", exposure="hours")

score_spec = glm.bins(train, "score", bins=5)
train = glm.apply(train, score_spec)
valid = glm.apply(valid, score_spec)

model = glm.fit(train, factors=[score_spec["output"], "segment"])
valid = glm.predict(valid, model)

print(glm.report(valid)["summary"])

Main APIs

  • GLMStudy: iterative notebook workflow for accepted specs, validation, holdout finalization, and audit history.
  • FactorBlock: one-factor proposal object created by study.factor(...).
  • RateGLM: count or event-rate models with optional exposure.
  • GLM: generic GLM families such as poisson, gamma, and gaussian.
  • GLMWorkflow: compact automatic baseline workflow.
  • optimize_bins / optimize_factor: low-level factor optimization.
  • rank_factors: standalone candidate factor screening.
  • glm_factor_optimizer.spark: optional Spark backend.

Common recipes

Frequency or event-rate model:

from glm_factor_optimizer import RateGLM, split

train, valid, holdout = split(df)
glm = RateGLM(target="events", exposure="hours")
score = glm.bins(train, "score", bins=6)
train = glm.apply(train, score)
valid = glm.apply(valid, score)
model = glm.fit(train, factors=[score["output"], "segment"])
valid = glm.predict(valid, model)

Gamma positive-cost or duration model:

from glm_factor_optimizer import GLM

glm = GLM(target="service_cost", family="gamma", prediction="predicted_cost")
age = glm.bins(train, "asset_age", bins=6)
train = glm.apply(train, age)
valid = glm.apply(valid, age)
model = glm.fit(train, factors=[age["output"], "service_tier"])
valid = glm.predict(valid, model)

Notebook factor design:

from glm_factor_optimizer import GLMStudy

study = GLMStudy(
    df,
    target="events",
    exposure="exposure",
    family="poisson",
    factor_kinds={"region": "categorical"},
)

study.split(seed=42)
study.rank_candidates(["age", "region", "channel"])

age = study.factor("age")
age.coarse_bins(bins=8)
age.optimize(trials=100, max_bins=6)
age.compare()
age.accept(comment="stable validation lift")

study.fit_main_effects()
study.finalize()
study.save("runs")

Spark:

from glm_factor_optimizer import GLMStudy

sdf = spark.table("catalog.schema.modeling_table")

study = GLMStudy(
    sdf,
    target="events",
    exposure="hours",
    family="poisson",
    prediction="predicted_events",
    factor_kinds={"region": "categorical"},
)

train, valid, holdout = study.split(seed=42)
ranking = study.rank_candidates(["score", "region"], bins=5)

Relation to CatBoost / XGBoost / sklearn

Use CatBoost, XGBoost, LightGBM, or sklearn estimators when the goal is a broad predictive model and automated nonlinear feature discovery. Use glm-factor-optimizer when the deliverable must be an auditable GLM with reviewable bins, groups, coefficients, exposure handling, validation tables, and saved factor specs.