Architecture

glm-factor-optimizer keeps the low-level pieces visible. You can use the automatic workflows, or call the same binning, fitting, scoring, and logging functions yourself.

Layer 1: Specs and Model Primitives

Core modules:

  • bins.py
  • model.py
  • metrics.py
  • optimize.py
  • penalties.py

Responsibilities:

  • create JSON-serializable numeric and categorical specs
  • apply specs to dataframes
  • fit GLMs
  • score predictions
  • optimize one factor at a time
  • define reusable optimization penalties

Core mechanics live here.

Layer 2: Analysis Helpers

Helper modules:

  • screening.py
  • aggregation.py
  • sampling.py
  • diagnostics.py
  • validation.py
  • runs.py

Responsibilities:

  • rank candidate factors
  • create representative samples
  • aggregate model development tables
  • find candidate interactions
  • produce validation reports
  • save run artifacts

Use them in notebooks or as building blocks for automatic workflows.

Layer 3: Notebook Study

Main modules:

  • study.py
  • factor.py

Responsibilities:

  • own split train, validation, and holdout samples
  • manage accepted and rejected factor specs
  • keep audit history
  • compare candidate bins to the current model
  • refine factors with the current full model fixed
  • test interactions explicitly
  • finalize and save model artifacts

Use GLMStudy when you want to design factors step by step and keep a record of the choices you made.

Layer 4: Automatic Workflow

Main module:

  • workflow.py

GLMWorkflow is a compact automatic wrapper for baselines, experiments, and examples. For close review, use GLMStudy instead.

Layer 5: Spark Backend

Spark modules live under:

  • glm_factor_optimizer.spark

Spark support lives in its own package namespace, so pandas users can install the core package without PySpark. Spark imports are lazy, and the same JSON specs can be used with Spark dataframes and Spark GLM jobs.

Data Flow

A common flow is:

raw dataframe
  -> split
  -> rank candidate raw factors
  -> propose factor spec
  -> apply spec to train and validation
  -> fit model with accepted factors fixed
  -> score validation
  -> accept or reject
  -> fit main effects
  -> refine accepted factors
  -> test interactions
  -> finalize on holdout
  -> save artifacts

Why Specs Are JSON Dictionaries

Specs are simple dictionaries so they are:

  • easy to inspect in notebooks
  • easy to write to JSON
  • independent of Python class versions
  • portable to Spark or other execution engines
  • suitable for audit review

That keeps saved specs readable and portable.