Binning and Grouping Specs

All specs are plain JSON-serializable dictionaries. Specs are learned from training data and then applied to train, validation, holdout, or future data.

Numeric Spec

Example:

{
  "type": "numeric",
  "column": "machine_age",
  "output": "machine_age_bin",
  "method": "quantile",
  "edges": [17.999999999, 25.0, 35.0, 50.0, 85.000000001],
  "labels": ["bin_1", "bin_2", "bin_3", "bin_4"]
}

Fields:

Field Meaning
type Always "numeric".
column Raw input column.
output Transformed output factor column.
method Source method, such as "quantile" or "optuna_quantile_subset".
edges Ordered numeric cut edges.
labels Bin labels.
prebin_edges Optional Optuna prebin edge list.

Application behavior:

  • Values inside edges are mapped to labels.
  • Missing or nonnumeric values become "missing".
  • The first interval includes the lowest edge.

Categorical Spec

Example:

{
  "type": "categorical",
  "column": "equipment_type",
  "output": "equipment_type_group",
  "order": ["compact", "standard", "heavy", "specialized"],
  "cutpoints": [2, 3],
  "mapping": {
    "compact": "group_01",
    "standard": "group_01",
    "heavy": "group_02",
    "specialized": "group_03"
  },
  "labels": ["group_01", "group_02", "group_03"],
  "default": "other",
  "missing": "__missing__",
  "stats": []
}

Fields:

Field Meaning
type Always "categorical".
column Raw input column.
output Transformed group column.
order Training categories ordered by observed target level.
cutpoints Boundaries over the target-ordered category list.
mapping Category-to-group mapping.
labels Group labels.
default Value for unseen categories.
missing Internal missing key used during grouping.
stats Training target-level table records used to build the grouping.

Application behavior:

  • Categories are converted to strings.
  • Missing categories use the missing key.
  • Unseen categories become default, usually "other".

Interaction Spec

Interaction specs are created by GLMStudy.test_interaction(...):

{
  "type": "interaction",
  "columns": ["machine_age_bin", "equipment_type_group"],
  "output": "machine_age_bin__x__equipment_type_group",
  "method": "string_cross"
}

Fields:

Field Meaning
type Always "interaction".
columns Accepted transformed columns to cross.
output Interaction output column.
method Current method, "string_cross".

The interaction output is a string cross of the two accepted transformed columns. Interactions are only added to the model after explicit acceptance.

Persistence Contract

Keep specs:

  • JSON serializable
  • independent of Python objects
  • learned from training data only
  • directly applicable to validation, holdout, or scoring data
  • suitable for logging and audit review