Cleaner¶
The glassbox.cleaner module provides scikit-learn-style transformers for data preprocessing. Every transformer follows the fit → transform contract defined by BaseTransformer.
Transformer API¶
All transformers share the same interface:
transformer.fit(X) # Learn parameters from training data
transformer.transform(X) # Apply the transformation
transformer.fit_transform(X) # Shorthand: fit + transform
Where X is always a np.ndarray of shape (n_samples, n_features).
Imputation¶
SimpleImputer replaces missing values (NaN) using a chosen strategy.
Available Strategies¶
| Strategy | Behavior |
|---|---|
ImputationStrategy.MEAN |
Replace with column mean (default). |
ImputationStrategy.MEDIAN |
Replace with column median. |
ImputationStrategy.MODE |
Replace with column mode. |
ImputationStrategy.CONSTANT |
Replace with a user-defined constant. |
Example¶
from glassbox.cleaner import SimpleImputer, ImputationStrategy
# Mean imputation (default)
imputer = SimpleImputer()
X_clean = imputer.fit_transform(X)
# Constant imputation
imputer = SimpleImputer(
strategy=ImputationStrategy.CONSTANT,
constant_value=-1.0,
)
X_clean = imputer.fit_transform(X)
Outlier Capping¶
OutlierCapper clips values outside the IQR bounds (Q1 − 1.5×IQR, Q3 + 1.5×IQR) to the boundary values.
from glassbox.cleaner import OutlierCapper
capper = OutlierCapper()
X_capped = capper.fit_transform(X)
Info
NaN values are preserved — only non-missing values are clipped.
Scaling¶
StandardScaler¶
Standardizes features to zero mean and unit variance: z = (x - μ) / σ.
from glassbox.cleaner import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
MinMaxScaler¶
Scales features to the [0, 1] range: x' = (x - min) / (max - min).
from glassbox.cleaner import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
Encoding¶
OneHotEncoder¶
Converts each categorical column into binary indicator columns — one per unique category.
from glassbox.cleaner import OneHotEncoder
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X_categorical)
# shape: (n_samples, total_unique_categories)
LabelEncoder¶
Maps each unique label to an integer 0, 1, 2, ….
from glassbox.cleaner import LabelEncoder
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y.reshape(-1, 1))
Tip
Use LabelEncoder for target labels and OneHotEncoder for input features.
Full Pipeline Example¶
import numpy as np
from glassbox.cleaner import (
SimpleImputer,
OutlierCapper,
StandardScaler,
OneHotEncoder,
LabelEncoder,
)
# Assume X_num is numeric features, X_cat is categorical features
# Numeric pipeline
X_num = SimpleImputer().fit_transform(X_num)
X_num = OutlierCapper().fit_transform(X_num)
X_num = StandardScaler().fit_transform(X_num)
# Categorical pipeline
X_cat = OneHotEncoder().fit_transform(X_cat)
# Combine
X_final = np.hstack([X_num, X_cat])
API Reference¶
SimpleImputer
¶
Bases: BaseTransformer
Replaces missing values using a specified statistical strategy.
Notes
This imputer supports basic strategies like mean, median, mode, or a constant value.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
strategy
|
ImputationStrategy
|
The strategy used for missing value imputation. |
ImputationStrategy.MEAN
|
constant_value
|
Union[float, str, None]
|
The value to use when strategy is CONSTANT. |
0.0
|
Source code in glassbox/cleaner/imputers.py
fit
¶
Learn the imputation values from the training data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
Self
|
Fitted imputer instance. |
Source code in glassbox/cleaner/imputers.py
transform
¶
Impute missing values in the given dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Transformed array with missing values imputed. |
Source code in glassbox/cleaner/imputers.py
ImputationStrategy
¶
Bases: Enum
Strategies available for imputing missing values.
OutlierCapper
¶
Bases: BaseTransformer
Identifies and caps numerical outliers based on specified bounds.
Source code in glassbox/cleaner/outliers.py
fit
¶
Detect boundaries for outlier capping from the training data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
Self
|
Fitted outlier capper instance. |
Source code in glassbox/cleaner/outliers.py
transform
¶
Cap outliers in the input dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Transformed array with outliers capped. |
Source code in glassbox/cleaner/outliers.py
StandardScaler
¶
Bases: BaseTransformer
Standardizes features by removing the mean and scaling to unit variance.
Source code in glassbox/cleaner/scalers.py
fit
¶
Compute the mean and standard deviation to be used for later scaling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
Self
|
Fitted scaler instance. |
Source code in glassbox/cleaner/scalers.py
transform
¶
Perform standardization by centering and scaling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Transformed array properly scaled. |
Source code in glassbox/cleaner/scalers.py
MinMaxScaler
¶
Bases: BaseTransformer
Transforms features by scaling each feature to a given range.
Source code in glassbox/cleaner/scalers.py
fit
¶
Compute the minimum and maximum to be used for later scaling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
Self
|
Fitted scaler instance. |
Source code in glassbox/cleaner/scalers.py
transform
¶
Scale features of X according to feature range.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Transformed array properly scaled. |
Source code in glassbox/cleaner/scalers.py
OneHotEncoder
¶
Bases: BaseTransformer
Encode categorical features as a one-hot numeric array.
Source code in glassbox/cleaner/encoders.py
fit
¶
Learn the categorical levels for encoding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
Self
|
Fitted encoder instance. |
Source code in glassbox/cleaner/encoders.py
transform
¶
Transform the dataset into a one-hot encoded representation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Transformed array properly encoded. |
Source code in glassbox/cleaner/encoders.py
LabelEncoder
¶
Bases: BaseTransformer
Encode target labels with value between 0 and n_classes-1.
Source code in glassbox/cleaner/encoders.py
fit
¶
Learn the vocabulary of the labels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
Self
|
Fitted encoder instance. |
Source code in glassbox/cleaner/encoders.py
transform
¶
Transform labels to normalized encoding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
X
|
ndarray
|
Input array of shape (n_samples, n_features). |
required |
Returns:
| Type | Description |
|---|---|
ndarray
|
Transformed array properly encoded. |