Skip to content

Inspector

The glassbox.inspector module performs a non-destructive audit of raw data, producing a comprehensive EDAReport with feature typing, statistics, outlier detection, and association analysis.

Kroki


Running a Full Audit

The DataAuditor orchestrates the entire EDA pipeline in a single call:

from glassbox.frame import read_csv
from glassbox.inspector import DataAuditor

ds = read_csv("data.csv")
auditor = DataAuditor()
report = auditor.run_audit(ds)

The returned EDAReport contains five sections:

Field Type Description
feature_types Dict[str, FeatureType] Auto-detected type per column.
missing_values Dict[str, MissingInfo] Missing count & percentage.
outliers_info Dict[str, OutlierInfo] IQR-based outlier bounds & counts.
summary_stats Dict[str, NumericStats \| CategoricalStats] Descriptive statistics.
collinearity_map List[CollinearityPair] Pairwise associations.

Auto-Typing

The AutoTyper classifies each column into one of four logical types:

FeatureType Criteria
BOOLEAN Exactly 2 unique values.
ORDINAL Integer values with cardinality < 20.
NUMERICAL Continuous float values.
NOMINAL Non-numeric (string/categorical).
from glassbox.inspector import AutoTyper

typer = AutoTyper()
types = typer.infer_types(ds)
# {'Age': NUMERICAL, 'Gender': NOMINAL, 'Passed': BOOLEAN, ...}

Statistical Profiling

StatProfiler computes summary statistics split by feature type:

Numeric featuresNumericStats:

  • Mean, Median, Standard Deviation, Skewness, Kurtosis

Categorical featuresCategoricalStats:

  • Mode, Cardinality (number of unique values)
from glassbox.inspector import StatProfiler

profiler = StatProfiler()

num_stats = profiler.calculate_numeric_stats(ds, ["Age", "Score"])
cat_stats = profiler.calculate_categorical_stats(ds, ["Gender"])

Outlier Detection

OutlierDetector uses the Interquartile Range (IQR) method:

  • Lower bound: Q1 − 1.5 × IQR
  • Upper bound: Q3 + 1.5 × IQR
from glassbox.inspector import OutlierDetector

detector = OutlierDetector()
outliers = detector.flag_outliers(ds, ["Age", "Score"])

print(outliers["Age"])
# OutlierInfo(count=12, lower_bound=5.0, upper_bound=65.0)

Association Analysis

AssociationAnalyzer computes pairwise associations:

  • Numeric ↔ Numeric: Pearson correlation coefficient
  • Categorical ↔ Categorical: Cramér's V statistic
from glassbox.inspector import AssociationAnalyzer

analyzer = AssociationAnalyzer()
pairs = analyzer.build_associations(
    ds,
    num_cols=["Age", "Score"],
    cat_cols=["Gender", "Region"],
)

for pair in pairs:
    print(f"{pair.feature_a}{pair.feature_b}: "
          f"{pair.score:.3f} ({pair.metric})")

Serialization

The full report can be serialized to JSON:

json_str = report.to_json()
Note

NaN values are serialized as null and FeatureType enums are serialized by name (e.g., "NUMERICAL").


API Reference

DataAuditor

Orchestrates the EDA process to generate a complete report.

run_audit

run_audit(data)

Perform a full audit on the dataset.

Parameters:

Name Type Description Default
data Dataset

The dataset to audit.

required

Returns:

Type Description
EDAReport

A comprehensive report containing EDA results.

Source code in glassbox/inspector/auditor.py
def run_audit(self, data: Dataset) -> EDAReport:
    """
    Perform a full audit on the dataset.

    Parameters
    ----------
    data : Dataset
        The dataset to audit.

    Returns
    -------
    EDAReport
        A comprehensive report containing EDA results.
    """
    auto_typer = AutoTyper()
    outlier_detector = OutlierDetector()
    stat_profiler = StatProfiler()
    association_analyzer = AssociationAnalyzer()

    feature_types = auto_typer.infer_types(data)
    n_samples = data.shape[0]

    missing_values = {}
    for col_name in data.columns:
        col_data = data.get_columns(col_name).data[:, 0]
        if np.issubdtype(col_data.dtype, np.number):
            missing = int(np.isnan(col_data).sum())
        else:
            missing = 0
            for v in col_data:
                if v is None or (isinstance(v, float) and np.isnan(v)):
                    missing += 1
        missing_values[col_name] = MissingInfo(
            count=missing, percentage=missing / n_samples
        )

    numeric_cols = [c for c, t in feature_types.items() if t == FeatureType.NUMERICAL]
    categorical_cols = [
        c
        for c, t in feature_types.items()
        if t in (FeatureType.NOMINAL, FeatureType.ORDINAL, FeatureType.BOOLEAN)
    ]

    outliers_info = outlier_detector.flag_outliers(data, numeric_cols)
    num_stats = stat_profiler.calculate_numeric_stats(data, numeric_cols)
    cat_stats = stat_profiler.calculate_categorical_stats(data, categorical_cols)
    collinearity_map = association_analyzer.build_associations(
        data, numeric_cols, categorical_cols
    )

    summary_stats = {**num_stats, **cat_stats}

    return EDAReport(
        feature_types=feature_types,
        missing_values=missing_values,
        outliers_info=outliers_info,
        summary_stats=summary_stats,
        collinearity_map=collinearity_map,
    )

AutoTyper

Infers logical data types for dataset columns.

infer_types

infer_types(data)

Infer feature types for all columns in the dataset.

Parameters:

Name Type Description Default
data Dataset

The dataset to analyze.

required

Returns:

Type Description
Dict

Mapping from column names to their inferred FeatureType.

Source code in glassbox/inspector/auto_typer.py
def infer_types(self, data: Dataset) -> Dict[str, FeatureType]:
    """
    Infer feature types for all columns in the dataset.

    Parameters
    ----------
    data : Dataset
        The dataset to analyze.

    Returns
    -------
    Dict
        Mapping from column names to their inferred FeatureType.
    """
    results = {}
    for col_name in data.columns:
        col_data = data.get_columns(col_name).data[:, 0]

        if self._is_boolean(col_data):
            results[col_name] = FeatureType.BOOLEAN
        elif self._is_numeric(col_data):
            if self._is_ordinal(col_data):
                results[col_name] = FeatureType.ORDINAL
            else:
                results[col_name] = FeatureType.NUMERICAL
        else:
            results[col_name] = FeatureType.NOMINAL
    return results

StatProfiler

Calculates summary statistics for dataset columns.

calculate_numeric_stats

calculate_numeric_stats(data, cols)

Compute statistics for numerical columns.

Parameters:

Name Type Description Default
data Dataset

The dataset containing the inputs.

required
cols List[str]

List of column names to analyze.

required

Returns:

Type Description
Dict

Mapping from column names to NumericStats objects.

Source code in glassbox/inspector/statistics.py
def calculate_numeric_stats(
    self, data: Dataset, cols: List[str]
) -> Dict[str, NumericStats]:
    """
    Compute statistics for numerical columns.

    Parameters
    ----------
    data : Dataset
        The dataset containing the inputs.
    cols : List[str]
        List of column names to analyze.

    Returns
    -------
    Dict
        Mapping from column names to NumericStats objects.
    """
    results = {}
    for col_name in cols:
        col_data = data.get_columns(col_name).data[:, 0].astype(float)
        col_valid = col_data[~np.isnan(col_data)]
        if len(col_valid) == 0:
            results[col_name] = NumericStats(
                mean=float("nan"),
                median=float("nan"),
                std=float("nan"),
                skew=float("nan"),
                kurt=float("nan"),
            )
            continue

        results[col_name] = NumericStats(
            mean=self._calc_mean(col_valid),
            median=self._calc_median(col_valid),
            std=self._calc_std(col_valid),
            skew=self._calc_skew(col_valid),
            kurt=self._calc_kurtosis(col_valid),
        )
    return results

calculate_categorical_stats

calculate_categorical_stats(data, cols)

Compute statistics for categorical columns.

Parameters:

Name Type Description Default
data Dataset

The dataset containing the inputs.

required
cols List[str]

List of column names to analyze.

required

Returns:

Type Description
Dict

Mapping from column names to CategoricalStats objects.

Source code in glassbox/inspector/statistics.py
def calculate_categorical_stats(
    self, data: Dataset, cols: List[str]
) -> Dict[str, CategoricalStats]:
    """
    Compute statistics for categorical columns.

    Parameters
    ----------
    data : Dataset
        The dataset containing the inputs.
    cols : List[str]
        List of column names to analyze.

    Returns
    -------
    Dict
        Mapping from column names to CategoricalStats objects.
    """
    results = {}
    for col_name in cols:
        col_data = data.get_columns(col_name).data[:, 0]
        valid_mask = np.array(
            [
                v is not None and not (isinstance(v, float) and np.isnan(v))
                for v in col_data
            ]
        )
        col_valid = col_data[valid_mask]

        if len(col_valid) == 0:
            results[col_name] = CategoricalStats(mode=float("nan"), cardinality=0)
            continue

        unique_vals = np.unique(col_valid)
        results[col_name] = CategoricalStats(
            mode=self._calc_mode(col_valid), cardinality=len(unique_vals)
        )
    return results

AssociationAnalyzer

Analyzes pairwise correlations and associations between features.

build_associations

build_associations(data, num_cols, cat_cols)

Compute pairwise correlation and associations across specified columns.

Parameters:

Name Type Description Default
data Dataset

Input dataset.

required
num_cols List[str]

Numerical columns to inspect with Pearson.

required
cat_cols List[str]

Categorical columns to inspect with Cramer's V.

required

Returns:

Type Description
List

A list of CollinearityPair objects containing scores.

Source code in glassbox/inspector/statistics.py
def build_associations(
    self, data: Dataset, num_cols: List[str], cat_cols: List[str]
) -> List[CollinearityPair]:
    """
    Compute pairwise correlation and associations across specified columns.

    Parameters
    ----------
    data : Dataset
        Input dataset.
    num_cols : List[str]
        Numerical columns to inspect with Pearson.
    cat_cols : List[str]
        Categorical columns to inspect with Cramer's V.

    Returns
    -------
    List
        A list of CollinearityPair objects containing scores.
    """
    pairs = []
    n_num = len(num_cols)
    for i in range(n_num):
        for j in range(i + 1, n_num):
            col_x_name = num_cols[i]
            col_y_name = num_cols[j]
            col_x = data.get_columns(col_x_name).data[:, 0].astype(float)
            col_y = data.get_columns(col_y_name).data[:, 0].astype(float)

            valid_mask = ~(np.isnan(col_x) | np.isnan(col_y))
            x_val = col_x[valid_mask]
            y_val = col_y[valid_mask]

            score = self._calc_pearson(x_val, y_val)
            pairs.append(
                CollinearityPair(
                    feature_a=col_x_name,
                    feature_b=col_y_name,
                    score=score,
                    metric="pearson",
                )
            )

    n_cat = len(cat_cols)
    for i in range(n_cat):
        for j in range(i + 1, n_cat):
            col_x_name = cat_cols[i]
            col_y_name = cat_cols[j]
            col_x = data.get_columns(col_x_name).data[:, 0]
            col_y = data.get_columns(col_y_name).data[:, 0]

            valid_mask = np.array(
                [
                    v_x is not None
                    and not (isinstance(v_x, float) and np.isnan(v_x))
                    and v_y is not None
                    and not (isinstance(v_y, float) and np.isnan(v_y))
                    for v_x, v_y in zip(col_x, col_y)
                ]
            )
            x_val = col_x[valid_mask]
            y_val = col_y[valid_mask]

            score = self._calc_cramers_v(x_val, y_val)
            pairs.append(
                CollinearityPair(
                    feature_a=col_x_name,
                    feature_b=col_y_name,
                    score=score,
                    metric="cramers_v",
                )
            )

    return pairs

OutlierDetector

Detects outliers within numerical columns of a dataset.

flag_outliers

flag_outliers(data, cols)

Identify outliers for specified columns.

Parameters:

Name Type Description Default
data Dataset

The dataset containing the columns.

required
cols List[str]

A list of column names to check for outliers.

required

Returns:

Type Description
Dict

Mapping from column names to OutlierInfo objects.

Source code in glassbox/inspector/outliers.py
def flag_outliers(self, data: Dataset, cols: List[str]) -> Dict[str, OutlierInfo]:
    """
    Identify outliers for specified columns.

    Parameters
    ----------
    data : Dataset
        The dataset containing the columns.
    cols : List[str]
        A list of column names to check for outliers.

    Returns
    -------
    Dict
        Mapping from column names to OutlierInfo objects.
    """
    results = {}
    for col_name in cols:
        # Extract and cast to float to assure vector operations on mixed dataset object matrices
        col_data = data.get_columns(col_name).data[:, 0].astype(float)
        col_valid = col_data[~np.isnan(col_data)]
        if len(col_valid) == 0:
            results[col_name] = OutlierInfo(
                count=0, lower_bound=float("nan"), upper_bound=float("nan")
            )
            continue
        lower, upper = calc_iqr(col_valid)

        count = int(np.sum((col_valid < lower) | (col_valid > upper)))
        results[col_name] = OutlierInfo(
            count=count, lower_bound=lower, upper_bound=upper
        )
    return results

EDAReport dataclass

EDAReport(
    feature_types,
    missing_values,
    outliers_info,
    summary_stats,
    collinearity_map,
)

Container for the complete Exploratory Data Analysis report.

to_json

to_json()

Serialize the entire EDA report down to a JSON string.

Returns:

Type Description
str

JSON representation of the report.

Source code in glassbox/inspector/report.py
def to_json(self) -> str:
    """
    Serialize the entire EDA report down to a JSON string.

    Returns
    -------
    str
        JSON representation of the report.
    """

    class EnumEncoder(json.JSONEncoder):
        def default(self, obj):
            if isinstance(obj, Enum):
                return obj.name
            if isinstance(obj, float) and np.isnan(obj):
                return None
            return super().default(obj)

    # We need to handle nan to null if missing, but json.dumps handles nan by default to NaN.
    # But JSON standard doesn't support NaN, so let's allow it standard.
    return json.dumps(dataclasses.asdict(self), cls=EnumEncoder)

OutlierInfo dataclass

OutlierInfo(count, lower_bound, upper_bound)

Stores outlier bounds and count for a single feature.