Architecture and key principles
Project Structure
badgers/
├── badgers/ # main library package
│ ├── __init__.py
│ ├── core/ # foundation: base classes, pipeline, utils
│ │ ├── __init__.py
│ │ ├── base.py # GeneratorMixin abstract base class
│ │ ├── pipeline.py # Pipeline for chaining generators
│ │ ├── utils.py # probability normalization, RNG helpers
│ │ └── decorators/ # input preprocessing decorators
│ │ ├── __init__.py
│ │ ├── tabular_data.py
│ │ └── time_series.py
│ └── generators/ # data transformation implementations
│ ├── __init__.py
│ ├── tabular_data/ # outliers, drift, imbalance, missingness, noise
│ │ ├── outliers/
│ │ │ ├── distribution_sampling.py
│ │ │ ├── instance_sampling.py
│ │ │ └── low_density_sampling.py
│ │ ├── drift.py
│ │ ├── imbalance.py
│ │ ├── missingness.py
│ │ └── noise.py
│ ├── time_series/ # changepoints, seasons, trends, errors
│ │ ├── changepoints.py
│ │ ├── missingness.py
│ │ ├── noise.py
│ │ ├── outliers.py
│ │ ├── patterns.py
│ │ ├── seasons.py
│ │ ├── transmission_errors.py
│ │ ├── trends.py
│ │ └── utils.py
│ ├── graph/ # graph manipulations
│ │ └── missingness.py
│ ├── text/ # text transformations
│ │ └── typos.py
│ ├── image/ # image processing (stub)
│ └── geolocated_data/ # geospatial (stub)
│
├── benchmarks/ # performance benchmarking framework
│ ├── __init__.py
│ ├── __main__.py # entry point: python -m benchmarks
│ ├── models.py # Scenario, GeneratorBenchmark, BenchmarkResult
│ ├── registry.py # auto-discovers _*.py registrations
│ ├── runner.py # run_performance() with time/memory measurement
│ ├── cli.py # run, baseline, compare subcommands
│ ├── comparator.py # regression detection (>20% time, >30% memory)
│ ├── scenarios/ # reusable data factories
│ │ ├── tabular.py
│ │ ├── time_series.py
│ │ ├── graph.py
│ │ └── text.py
│ └── generators/ # per-generator benchmark registrations
│ ├── tabular_data/
│ │ ├── _drift.py
│ │ ├── _imbalance.py
│ │ ├── _missingness.py
│ │ ├── _noise.py
│ │ └── _outliers.py
│ ├── time_series/
│ │ ├── _changepoints.py
│ │ ├── _missingness.py
│ │ ├── _noise.py
│ │ ├── _outliers.py
│ │ ├── _patterns.py
│ │ ├── _seasons.py
│ │ ├── _transmission_errors.py
│ │ └── _trends.py
│ ├── graph/
│ │ └── _missingness.py
│ └── text/
│ └── _typos.py
│
├── tests/ # pytest test suite (148 tests)
│ ├── conftest.py # shared fixtures (rng)
│ ├── core/
│ │ ├── test_pipelines.py
│ │ └── test_utils.py
│ ├── benchmarks/
│ │ ├── test_cli.py
│ │ ├── test_comparator.py
│ │ ├── test_models.py
│ │ ├── test_registry.py
│ │ ├── test_runner.py
│ │ └── test_scenarios.py
│ └── generators/
│ ├── tabular_data/
│ │ ├── conftest.py # tabular_small, tabular_large fixtures
│ │ ├── test_drift.py
│ │ ├── test_imbalance.py
│ │ ├── test_missingness.py
│ │ ├── test_noise.py
│ │ └── test_outliers.py
│ ├── time_series/
│ │ ├── conftest.py # time_series_sine, time_series_walk fixtures
│ │ ├── test_changepoints.py
│ │ ├── test_missingness.py
│ │ ├── test_noise.py
│ │ ├── test_outliers.py
│ │ ├── test_patterns.py
│ │ ├── test_seasons.py
│ │ ├── test_transmission_errors.py
│ │ ├── test_trends.py
│ │ └── test_utils.py
│ ├── graph/
│ │ ├── conftest.py # graph_erdos_renyi fixture
│ │ └── test_missingness.py
│ └── text/
│ ├── conftest.py # text_word_list fixture
│ └── test_typos.py
│
├── docs/ # mkdocs documentation
│ ├── index.md
│ ├── architecture.md
│ ├── benchmarking.md
│ ├── changelog.md
│ ├── getting-started.md
│ ├── tutorials/ # Jupyter notebook tutorials
│ └── superpowers/ # design specs and implementation plans
│
├── experiments/ # ad-hoc experiments and benchmark results
├── pyproject.toml # project metadata and build config
├── requirements.txt
├── tox.ini # multi-version test matrix (py38–py314)
└── mkdocs.yml # documentation site config
Core Module
The core module serves as the foundation of the Badgers framework, providing essential building blocks and infrastructure that other components rely on.
Main Responsibilities:
-
Base Classes: Defines the fundamental
GeneratorMixinabstract base class that all generators must inherit from, ensuring a consistent interface across the entire system. -
Standardized Interface: Enforces a uniform
generate(X, y, **params)method signature that returns transformed data(Xt, yt)for all generators. -
Input Preprocessing: Provides decorator functions (
preprocess_inputs) that automatically validate and convert input data to standardized formats (pandas DataFrames/Series). -
Pipeline Infrastructure: Implements the
Pipelineclass that enables chaining multiple generators together in sequential workflows. -
Utility Functions: Offers helper functions for common operations like probability normalization and random number generation.
Generators Module
The generators module contains the actual implementation of various data transformation algorithms, organized by data type categories.
Main Responsibilities:
- Data Transformation Implementation: Houses concrete implementations of various data generation techniques across different data domains:
- Tabular data transformations (outliers, drift, imbalance, missingness, noise)
- Time series modifications (changepoints, seasons, trends, transmission errors)
- Graph-based manipulations
- Image processing generators
-
Text transformation tools
-
Domain-Specific Organization: Structures generators by data type categories, making it easy to find and use appropriate transformations for specific data modalities.
-
Extensibility: Provides a plug-and-play architecture where new generators can be easily added by following the established
GeneratorMixininterface.
Benchmarks Module
The benchmarks package provides a systematic framework for measuring performance (time/memory) of all generators. It lives outside the main badgers source to avoid coupling.
Architecture (3 Layers)
- Registration Layer (
models.py,registry.py,scenarios/,generators/): Scenariodataclasses define reusable data factories for each data type (tabular, time series, graph, text).GeneratorBenchmarkties a generator class to its scenarios.-
The
Registryauto-discovers_*.pyregistration modules frombenchmarks/generators/. -
Execution Layer (
runner.py,cli.py): run_performance()measures wall-clock time and peak memory over multiple iterations.-
CLI provides
run,baseline, andcomparesubcommands. -
Analysis Layer (
comparator.py): compare_results()diffs two result sets and flags regressions (>20% time, >30% memory).
Usage
# Run all benchmarks (performance)
python -m benchmarks run
# Filter by generator category
python -m benchmarks run --generators tabular_data.outliers
# Save a baseline for regression detection
python -m benchmarks baseline save --name v1.0
# Compare latest results against a baseline
python -m benchmarks compare --baseline v1.0
Adding a New Generator to Benchmarks
Create a _<name>.py file in the appropriate benchmarks/generators/<category>/ directory:
from benchmarks.models import Scenario, GeneratorBenchmark
from benchmarks.registry import register
register(GeneratorBenchmark(
class_name="MyNewGenerator",
module="badgers.generators.tabular_data.noise",
scenarios={"small": Scenario(...), "large": Scenario(...)},
))
The registry auto-discovers all _*.py files — no other changes needed.
Tests
Tests live under tests/ and use pytest with function-based tests (no unittest.TestCase classes). The test structure mirrors the source layout:
tests/
├── conftest.py # shared fixtures (e.g., rng)
├── core/ # tests for badgers.core
├── benchmarks/ # tests for the benchmarks package
└── generators/
├── tabular_data/
│ ├── conftest.py # tabular-specific fixtures
│ ├── test_outliers.py
│ ├── test_noise.py
│ └── ...
├── time_series/
│ ├── conftest.py # time-series fixtures (sine, random walk)
│ └── ...
├── graph/
│ ├── conftest.py # graph fixtures (Erdős-Rényi)
│ └── ...
└── text/
├── conftest.py # text fixtures (word list)
└── ...
Key Conventions
- Fixtures over classes: Each data type has a
conftest.pywith reusable fixtures (e.g.,tabular_small,time_series_sine,graph_erdos_renyi,text_word_list). Fixtures return(X, y)tuples with deterministic random seeds. - Flat test functions: Tests are plain
test_<subject>__<behavior>functions (double underscore separator), not methods onTestCaseclasses. - Generator tests are independent: Generator correctness is tested directly via pytest, not through the benchmark framework's functional checks. This keeps tests fast, focused, and free of benchmark infrastructure coupling.
Running Tests
# Run all tests
pytest tests/
# Run only generator tests
pytest tests/generators/
# Run a specific category
pytest tests/generators/tabular_data/ -v