Tutorial: Notebook GLM Factor Study Workflow¶

In this notebook workflow, you rank candidate factors, optimize bins/groups, accept a small model, test an interaction, and save the artifacts.

The example models incident counts per site operating hour. The same pattern also works for defects per machine hour, orders per visit, support tickets per active account, or positive continuous outcomes with a Gamma GLM.

1. Create Example Data¶

The example data is generated in the notebook.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)
rows = 800
hours = rng.uniform(0.5, 2.5, size=rows)
machine_age = rng.normal(6.0, 2.5, size=rows).clip(0.2, 15.0)
usage_score = rng.normal(size=rows)
site_region = rng.choice(["north", "south", "east", "west"], size=rows)
equipment_type = rng.choice(["standard", "compact", "heavy"], size=rows)

region_effect = pd.Series(site_region).map(
    {"north": -0.10, "south": 0.15, "east": 0.0, "west": 0.08}
).to_numpy()
equipment_effect = pd.Series(equipment_type).map(
    {"standard": 0.0, "compact": -0.15, "heavy": 0.25}
).to_numpy()
mean_rate = np.exp(
    -1.0
    + 0.25 * (machine_age > 8.0)
    + 0.35 * (usage_score > 0.7)
    + region_effect
    + equipment_effect
)
events = rng.poisson(mean_rate * hours)

df = pd.DataFrame(
    {
        "events": events,
        "hours": hours,
        "machine_age": machine_age,
        "usage_score": usage_score,
        "site_region": site_region,
        "equipment_type": equipment_type,
    }
)

2. Create the Study¶

from glm_factor_optimizer import GLMStudy

study = GLMStudy(
    df,
    target="events",
    exposure="hours",
    family="poisson",
    prediction="predicted_count",
    factor_kinds={
        "site_region": "categorical",
        "equipment_type": "categorical",
    },
    min_bin_size=20.0,
    seed=42,
)

GLMStudy keeps track of the modeling state:

raw data
train, validation, and holdout splits
accepted factor specs
accepted interaction specs
model versions
ranking results
validation reports
audit history

3. Split Once¶

train, validation, holdout = study.split(
    train_fraction=0.6,
    validation_fraction=0.2,
    holdout_fraction=0.2,
    seed=42,
)

Use validation for design decisions. Keep holdout for final evaluation; ordinary ranking and factor optimization do not score it.

4. Rank Candidate Factors¶

candidate_factors = [
    "machine_age",
    "usage_score",
    "site_region",
    "equipment_type",
]

ranking = study.rank_candidates(candidate_factors, bins=5, max_groups=4)
ranking[
    [
        "factor",
        "kind",
        "deviance_improvement",
        "p_value",
        "validation_missing_rate",
        "validation_measure_coverage",
        "bins",
        "min_bin_size",
    ]
].head(20)

Ranking is only screening, not final variable selection. Before accepting a factor, check stability, interpretation, and whether the grouping will be usable in practice.

5. Build One Factor Block¶

age = study.factor("machine_age", kind="numeric")

age.coarse_bins(bins=5)
age.bin_table()

For a categorical factor:

region = study.factor("site_region", kind="categorical")
region.target_order(max_groups=4)
region.bin_table()

6. Optimize and Compare¶

age_result = age.optimize(
    trials=5,  # use 100+ for a real model review
    max_bins=5,
    n_prebins=8,
    min_bin_size=20.0,
)

age.compare()
age.validation_table()

compare() fits a candidate model using:

all currently accepted factors fixed
the proposed factor spec added or replacing the old spec
validation deviance as the primary comparison score

7. Accept or Reject¶

age.accept(comment="Five-bin machine_age proposal looked consistent")

Rejecting also records an audit event:

region.reject(comment="Validation improvement too small")

Accepted specs become part of the current model design. Rejected proposals stay in the study history.

8. Add Another Factor¶

equipment = study.factor("equipment_type", kind="categorical")
equipment.optimize(trials=5, min_bin_size=20.0)  # use more trials in production
equipment.compare()
equipment.accept(comment="Equipment grouping looked consistent")

9. Fit the Current Main-Effects Model¶

model = study.fit_main_effects()

report = study.validation_report()
report["summary"]
report["train_validation"]
report["model_versions"]

The validation report includes:

overall summary
calibration table
lift table
by-factor reports for accepted transformed factors
train-vs-validation comparison
model version comparison

10. Refine With the Full Model Fixed¶

Once a baseline main-effects model exists, refine one accepted factor while all other accepted factors remain fixed:

refined_age = study.refine_factor(
    "machine_age",
    trials=5,  # use 100+ for a real model review
    max_bins=5,
    n_prebins=8,
)

refined_age.compare()
refined_age.accept(comment="Full-model refinement accepted")

Or propose refinements for all accepted factors:

proposals = study.refine_all(trials=5, accept=False)

You can set accept=True, but review proposals first for any model you intend to keep.

11. Search for Interactions¶

interactions = study.find_interactions(min_bin_size=20.0)
interactions.head(20)

Interactions are diagnostic candidates. They are not added automatically.

Test one interaction:

test = study.test_interaction("machine_age", "equipment_type")
test

Accept only after checking that the pattern is consistent and explainable:

study.accept_interaction(
    "machine_age",
    "equipment_type",
    comment="Accepted after checking interaction cells",
)
study.fit_main_effects()

12. Finalize on Holdout¶

holdout_report = study.finalize()
holdout_report["summary"]

finalize() scores holdout and records a final audit event. Use it when the model design is ready for final evaluation.

13. Save the Study¶

run_path = study.save("runs")
run_path

Saved artifacts include:

params
accepted specs
interaction specs
factor ranking
trial tables
validation reports
holdout reports
coefficient table
model versions
audit history