Public API Reference

The tables below cover the public API that is meant for notebooks, scripts, and workflows. Private helper functions are not listed.

GLMStudy

Import:

from glm_factor_optimizer import GLMStudy

Constructor:

GLMStudy(
    df,
    target,
    family="poisson",
    exposure=None,
    weight=None,
    prediction="predicted",
    factor_kinds=None,
    min_bin_size=100.0,
    seed=42,
)

GLMStudy accepts pandas and Spark DataFrames. Pandas inputs use the pandas study implementation. Spark inputs dispatch to a Spark-backed study that keeps raw, split, and scored modeling tables as Spark DataFrames while returning small aggregate metadata tables for notebook inspection.

Primary methods:

Method Purpose
split(...) Create train, validation, and holdout samples.
rank_candidates(factors, ...) Screen candidate raw factors.
factor(name, kind=None) Return a FactorBlock.
accept(block_or_name, spec=None, comment=None) Accept a factor spec.
reject(block_or_name, comment=None) Record a rejected proposal.
fit_main_effects() Fit the current accepted model.
validation_report() Return validation tables for the current model.
holdout_report() Explicitly score holdout.
refine_factor(name, ...) Re-optimize one accepted factor with other factors fixed.
refine_all(...) Propose refinements for all accepted factors.
find_interactions(...) Diagnose candidate pair interactions.
test_interaction(a, b) Evaluate one coarse interaction.
accept_interaction(a, b=None, comment=None) Accept a tested interaction or test and accept a pair.
auto_design(factors, ...) Build an automatic baseline through the same acceptance path.
finalize() Fit current model and score holdout.
save(output_dir) Save specs, history, reports, and diagnostics.

Important attributes:

Attribute Meaning
specs Accepted raw factor specs keyed by raw factor name.
interaction_specs Accepted interaction specs keyed by interaction output.
accepted_raw_factors Raw factors accepted as main effects.
selected_factors Transformed model columns used by the fitted model.
ranking Latest candidate ranking dataframe.
history Audit events.
model_versions Stored scored model-version summaries.
current_model Latest fitted FittedGLM.
train_scored Latest scored train dataframe.
validation_scored Latest scored validation dataframe.
holdout_scored Holdout scored by finalize() or holdout_report().

For Spark studies, train, validation, holdout, and scored frame attributes are Spark DataFrames. Ranking, comparison, report, and audit tables are bounded pandas metadata.

When time= is supplied, split(...) uses exact ordered splits by default. Spark users can opt into distributed approximate time thresholds with time_split="approximate" when exact row-number boundaries are less important than avoiding a global ordered window. Pandas inputs support only the exact strategy.

FactorBlock

Import:

from glm_factor_optimizer import FactorBlock

Usually created by:

block = study.factor("machine_age")

Spark studies return SparkFactorBlock, which mirrors the same notebook methods while running factor operations on Spark.

Methods:

Method Purpose
coarse_bins(bins=10, method="quantile") Create simple train-derived bins or groups.
target_order(max_groups=None) Calculate categorical target order.
set_spec(spec) Use a manual JSON-serializable spec.
optimize(...) Run Optuna for this factor with accepted factors fixed.
bin_table(sample="train") Inspect train, validation, or holdout bin sizes and actuals.
validation_table() Return validation diagnostics by proposed output.
compare() Compare current model against this proposed spec.
accept(comment=None) Accept into the owning study.
reject(comment=None) Record rejection in the owning study.

Low-Level Modeling

from glm_factor_optimizer import GLM, RateGLM

GLM supports:

  • family="poisson"
  • family="gamma"
  • family="gaussian"

RateGLM is a convenience wrapper for Poisson-style count/exposure models and simple candidate-factor screening. Exposure is optional; when supplied, it is used as a log offset.

Core methods:

  • fit(df, factors)
  • predict(df, model)
  • report(df, bins=10)
  • bins(df, factor, bins=10, method="quantile")
  • apply(df, spec)
  • optimize(train_df, validation_df, factor, ...)
  • rank(train_df, validation_df, factors, ...)

Optimization

from glm_factor_optimizer import optimize_bins, optimize_factor

optimize_bins is an alias for optimize_factor.

Important arguments:

  • train_df
  • validation_df
  • target
  • exposure
  • factor
  • kind: "numeric" or "categorical"
  • family
  • fixed_factors
  • weight
  • trials
  • max_bins
  • n_prebins
  • min_bin_size
  • penalties

The objective uses validation deviance plus complexity and bin-size penalties.

Screening

from glm_factor_optimizer import rank_factors

Returns a dataframe with validation deviance improvement, missing rates, coverage, screening p-value, bin counts, and the simple screening spec.

Validation

from glm_factor_optimizer import (
    validation_report,
    by_factor_report,
    train_validation_comparison,
    weighted_mae,
    weighted_rmse,
)

Use these for custom reports outside GLMStudy.

Spark Backend

from glm_factor_optimizer.spark import SparkGLM, SparkGLMStudy, SparkGLMWorkflow

The Spark backend is optional and imports PySpark lazily. Install with:

pip install "glm-factor-optimizer[spark]"

Top-level GLM, RateGLM, and GLMStudy dispatch on the input dataframe. With Spark dataframes, manual fitting uses Spark ML generalized linear regression:

from glm_factor_optimizer import RateGLM

glm = RateGLM(target_col="events", exposure_col="hours")
model = glm.fit(train_sdf, factors=["event_type", "region"])
scored = glm.predict(valid_sdf, model)

In long Spark Connect notebooks, call model.release() after a fitted model is no longer needed. Study and optimization workflows release temporary Spark ML models automatically.

Spark users can use the same study workflow without converting the modeling table to pandas:

from glm_factor_optimizer import GLMStudy

study = GLMStudy(
    spark.table("catalog.schema.modeling_table"),
    target="events",
    exposure="hours",
    family="poisson",
    prediction="predicted_events",
    factor_kinds={"region": "categorical"},
)

train, valid, holdout = study.split(
    train_fraction=0.6,
    validation_fraction=0.2,
    holdout_fraction=0.2,
    seed=42,
)
ranking = study.rank_candidates(["machine_age", "region"], bins=5)

age = study.factor("machine_age")
age.coarse_bins(bins=5)
age.compare()
age.accept()

study.fit_main_effects()
report = study.validation_report()
final = study.finalize()

Spark study reports are aggregate metadata collected for display and saved run artifacts; the large modeling frames remain Spark DataFrames. Report summaries combine totals, deviance, MAE, and RMSE in compact Spark aggregations.

For Spark-native factor screening, keep the data in Spark:

from glm_factor_optimizer import RateGLM, split

train, valid, holdout = split(spark.table("catalog.schema.modeling_table"))

glm = RateGLM(
    candidate_factors=["event_type", "region", "service_channel"],
    target_col="events",
    exposure_col="hours",
    family="poisson",
    prediction_col="predicted_events",
)
glm.fit(train, validation_df=valid)

display(glm.identified_factors_)
scored_holdout = glm.predict(holdout)