Database Design¶
The database module provides a layered architecture for describing annotation databases and generating dataset records. A shared protocol and base class sit at the top, with dataset-family-specific implementations underneath. Scenario metadata (splits, versions, sampling parameters) is modelled as immutable Pydantic objects so that every database instance is fully hashable and cacheable.
Architecture Overview¶
classDiagram
direction TB
class generate_dataset {
<<Hydra entrypoint>>
build_database()
main()
}
class DatabaseInterface {
<<Protocol>>
database_version
scenarios
cache_path
load_scenario_records()
process_scenario_records()
}
class BaseDatabase {
get_polars_schema()
get_main_database_scenario_data()
get_unique_scenario_data()
process_scenario_records()
}
class scenarios {
DatasetParams
ScenarioData
Scenarios
}
class schemas {
<<package>>
DatasetRecord
DatasetTableSchema
DataModelInterface
LidarFrameDataModel
LidarSourceDataModel
CategoryMappingDataModel
Box3DDataModel
Box3DDatasetSchema
}
class polars {
<<external>>
DataFrame
Schema
}
class ConcreteDatabase {
<<dataset-specific>>
process_scenario_records()
}
class TrainingInference {
<<downstream>>
train()
evaluate()
predict()
}
generate_dataset --> DatabaseInterface : instantiates via Hydra
DatabaseInterface ..> scenarios : uses Scenarios, ScenarioData
DatabaseInterface --> schemas : process_scenario_records()
BaseDatabase ..|> DatabaseInterface : satisfies
ConcreteDatabase --|> BaseDatabase : extends
schemas --> TrainingInference : Sequence[DatasetRecord] consumed by
schemas ..> polars : uses pl.DataType, pl.Schema
Core Components¶
DatabaseInterface¶
DatabaseInterface is the protocol that every database implementation must satisfy. It defines the contract for version metadata, scenario access, and record generation:
class DatabaseInterface(Protocol):
@property
def database_version(self) -> str: ...
@property
def scenarios(self) -> MappingProxyType[str, Scenarios]: ...
def get_unique_scenario_data(self) -> MappingProxyType[str, ScenarioData]: ...
def load_scenario_records(self) -> Sequence[DatasetRecord]: ...
def process_scenario_records(self) -> Sequence[DatasetRecord]: ...
All concrete databases are accessed through this protocol, ensuring downstream code (training, evaluation) never depends on a specific dataset format.
BaseDatabase¶
BaseDatabase provides the shared implementation of DatabaseInterface. It handles initialization from version and paths, caching directory creation, Polars schema retrieval, resolving the main scenario group, and deduplicating scenario data across groups:
class BaseDatabase:
def __init__(
self,
database_version: str,
database_root_path: str,
cache_path: str,
cache_file_prefix_name: str,
num_workers: int,
) -> None:
...
def get_polars_schema(self) -> pl.Schema: ...
def get_main_database_scenario_data(self) -> Scenarios: ...
def get_unique_scenario_data(self) -> Mapping[str, ScenarioData]: ...
def process_scenario_records(self) -> Sequence[DatasetRecord]:
raise NotImplementedError("Subclasses must implement process_scenario_records!")
To add a new dataset family, subclass BaseDatabase and implement process_scenario_records(). See T4Dataset for a concrete example.
Scenarios¶
The scenarios module models scenario metadata as immutable Pydantic objects. DatasetParams captures per-dataset preprocessing parameters, ScenarioData uniquely identifies a single scenario with its version and sampling settings, and Scenarios is the abstract base that concrete implementations extend to parse scenario configs based on a dataset:
class DatasetParams(BaseModel):
dataset_name: str
max_sweeps: int
sample_steps: int
class ScenarioData(BaseModel):
scenario_id: str
scenario_version: str
vehicle_type: str | None = None
location: str | None = None
...
class Scenarios(BaseModel):
version: str
scenario_root_path: Path
dataset_params: Sequence[DatasetParams]
scenario_data: Mapping[SplitType, Sequence[ScenarioData]] | None = None
@model_validator(mode="after")
def build_scenarios(self) -> None:
raise NotImplementedError("Subclasses must implement build_scenarios!")
Schema¶
process_scenario_records() returns Sequence[DatasetRecord] — one frozen Pydantic row per sample/frame. BaseDatabase.get_polars_schema() delegates to DatasetTableSchema so records can be serialized to Parquet via DatasetRecord.to_dictionary().
The schema is defined in the autoware_ml/databases/schemas/ package and covers basic frame metadata, nested LiDAR structs, and annotation fields such as category mapping and 3D boxes. The 3D box payload is modeled by Box3DDataModel with its struct layout defined in Box3DDatasetSchema, and is stored in the top-level boxes_3d list column. See Dataset Schema for the full column layout, nested data models, and extension guide.
Dataset Generation (Hydra Entrypoint)¶
The generate_dataset.py script is the Hydra-based entrypoint that wires everything together. It reads a YAML config, instantiates the configured database class, and triggers record generation:
@hydra.main(version_base=None, config_path=_CONFIG_PATH)
def main(cfg: DictConfig):
database: DatabaseInterface = instantiate(cfg.database)
database.process_scenario_records()
To run dataset generation:
python3 autoware_ml/scripts/generate_dataset.py \
--config-name default_t4dataset_generator \
working_dir=<working_dir> \
data_root_path=<dataset_root_path> \
database.num_workers=32
Configuration is done through YAML files under autoware_ml/configs/generators/. Override any parameter from the command line using Hydra syntax. See Configuration Guide for full details.
Extending the Database¶
| Extension Point | How |
|---|---|
| New dataset family | Subclass BaseDatabase, implement process_scenario_records(), register in a Hydra config |
| New scenario format | Subclass Scenarios, implement build_scenarios() to parse format-specific YAML |
| New schema columns | See Dataset Schema |
Implementation¶
| Path | Description |
|---|---|
autoware_ml/databases/schemas/ |
Dataset schema package — see schemas.md |
autoware_ml/databases/scenarios.py |
ScenarioData, DatasetParams, Scenarios |
autoware_ml/databases/database_interface.py |
DatabaseInterface protocol |
autoware_ml/databases/base_database.py |
Shared BaseDatabase implementation |
autoware_ml/scripts/generate_dataset.py |
Hydra entrypoint for dataset generation |
autoware_ml/configs/generators/ |
YAML configs for dataset generation |