Inspector¶
The glassbox.inspector module performs a non-destructive audit of raw data, producing a comprehensive EDAReport with feature typing, statistics, outlier detection, and association analysis.
Running a Full Audit¶
The DataAuditor orchestrates the entire EDA pipeline in a single call:
from glassbox.frame import read_csv
from glassbox.inspector import DataAuditor
ds = read_csv("data.csv")
auditor = DataAuditor()
report = auditor.run_audit(ds)
The returned EDAReport contains five sections:
| Field | Type | Description |
|---|---|---|
feature_types |
Dict[str, FeatureType] |
Auto-detected type per column. |
missing_values |
Dict[str, MissingInfo] |
Missing count & percentage. |
outliers_info |
Dict[str, OutlierInfo] |
IQR-based outlier bounds & counts. |
summary_stats |
Dict[str, NumericStats \| CategoricalStats] |
Descriptive statistics. |
collinearity_map |
List[CollinearityPair] |
Pairwise associations. |
Auto-Typing¶
The AutoTyper classifies each column into one of four logical types:
| FeatureType | Criteria |
|---|---|
BOOLEAN |
Exactly 2 unique values. |
ORDINAL |
Integer values with cardinality < 20. |
NUMERICAL |
Continuous float values. |
NOMINAL |
Non-numeric (string/categorical). |
from glassbox.inspector import AutoTyper
typer = AutoTyper()
types = typer.infer_types(ds)
# {'Age': NUMERICAL, 'Gender': NOMINAL, 'Passed': BOOLEAN, ...}
Statistical Profiling¶
StatProfiler computes summary statistics split by feature type:
Numeric features → NumericStats:
- Mean, Median, Standard Deviation, Skewness, Kurtosis
Categorical features → CategoricalStats:
- Mode, Cardinality (number of unique values)
from glassbox.inspector import StatProfiler
profiler = StatProfiler()
num_stats = profiler.calculate_numeric_stats(ds, ["Age", "Score"])
cat_stats = profiler.calculate_categorical_stats(ds, ["Gender"])
Outlier Detection¶
OutlierDetector uses the Interquartile Range (IQR) method:
- Lower bound: Q1 − 1.5 × IQR
- Upper bound: Q3 + 1.5 × IQR
from glassbox.inspector import OutlierDetector
detector = OutlierDetector()
outliers = detector.flag_outliers(ds, ["Age", "Score"])
print(outliers["Age"])
# OutlierInfo(count=12, lower_bound=5.0, upper_bound=65.0)
Association Analysis¶
AssociationAnalyzer computes pairwise associations:
- Numeric ↔ Numeric: Pearson correlation coefficient
- Categorical ↔ Categorical: Cramér's V statistic
from glassbox.inspector import AssociationAnalyzer
analyzer = AssociationAnalyzer()
pairs = analyzer.build_associations(
ds,
num_cols=["Age", "Score"],
cat_cols=["Gender", "Region"],
)
for pair in pairs:
print(f"{pair.feature_a} ↔ {pair.feature_b}: "
f"{pair.score:.3f} ({pair.metric})")
Serialization¶
The full report can be serialized to JSON:
Note
NaN values are serialized as null and FeatureType enums are serialized by name (e.g., "NUMERICAL").
API Reference¶
DataAuditor
¶
Orchestrates the EDA process to generate a complete report.
run_audit
¶
Perform a full audit on the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dataset
|
The dataset to audit. |
required |
Returns:
| Type | Description |
|---|---|
EDAReport
|
A comprehensive report containing EDA results. |
Source code in glassbox/inspector/auditor.py
AutoTyper
¶
Infers logical data types for dataset columns.
infer_types
¶
Infer feature types for all columns in the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dataset
|
The dataset to analyze. |
required |
Returns:
| Type | Description |
|---|---|
Dict
|
Mapping from column names to their inferred FeatureType. |
Source code in glassbox/inspector/auto_typer.py
StatProfiler
¶
Calculates summary statistics for dataset columns.
calculate_numeric_stats
¶
Compute statistics for numerical columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dataset
|
The dataset containing the inputs. |
required |
cols
|
List[str]
|
List of column names to analyze. |
required |
Returns:
| Type | Description |
|---|---|
Dict
|
Mapping from column names to NumericStats objects. |
Source code in glassbox/inspector/statistics.py
calculate_categorical_stats
¶
Compute statistics for categorical columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dataset
|
The dataset containing the inputs. |
required |
cols
|
List[str]
|
List of column names to analyze. |
required |
Returns:
| Type | Description |
|---|---|
Dict
|
Mapping from column names to CategoricalStats objects. |
Source code in glassbox/inspector/statistics.py
AssociationAnalyzer
¶
Analyzes pairwise correlations and associations between features.
build_associations
¶
Compute pairwise correlation and associations across specified columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dataset
|
Input dataset. |
required |
num_cols
|
List[str]
|
Numerical columns to inspect with Pearson. |
required |
cat_cols
|
List[str]
|
Categorical columns to inspect with Cramer's V. |
required |
Returns:
| Type | Description |
|---|---|
List
|
A list of CollinearityPair objects containing scores. |
Source code in glassbox/inspector/statistics.py
OutlierDetector
¶
Detects outliers within numerical columns of a dataset.
flag_outliers
¶
Identify outliers for specified columns.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Dataset
|
The dataset containing the columns. |
required |
cols
|
List[str]
|
A list of column names to check for outliers. |
required |
Returns:
| Type | Description |
|---|---|
Dict
|
Mapping from column names to OutlierInfo objects. |
Source code in glassbox/inspector/outliers.py
EDAReport
dataclass
¶
Container for the complete Exploratory Data Analysis report.
to_json
¶
Serialize the entire EDA report down to a JSON string.
Returns:
| Type | Description |
|---|---|
str
|
JSON representation of the report. |
Source code in glassbox/inspector/report.py
OutlierInfo
dataclass
¶
Stores outlier bounds and count for a single feature.