Database Design¶
The database module provides a layered architecture for describing annotation databases and generating dataset records. A shared protocol and base class sit at the top, with dataset-family-specific implementations underneath. Scenario metadata (splits, versions, sampling parameters) is modelled as immutable Pydantic objects so that every database instance is fully hashable and cacheable.
Architecture Overview¶
classDiagram
direction TB
class generate_dataset {
<<Hydra entrypoint>>
build_database()
main()
}
class DatabaseInterface {
<<Protocol>>
database_version
scenarios
cache_path
load_scenario_records()
process_scenario_records()
}
class BaseDatabase {
get_polars_schema()
get_main_database_scenario_data()
get_unique_scenario_data()
process_scenario_records()
}
class scenarios {
DatasetParams
ScenarioData
Scenarios
}
class schemas {
DatasetRecord
DatasetTableSchema
DatasetTableColumn
}
class polars {
<<external>>
DataFrame
Schema
}
class ConcreteDatabase {
<<dataset-specific>>
process_scenario_records()
}
class TrainingInference {
<<downstream>>
train()
evaluate()
predict()
}
generate_dataset --> DatabaseInterface : instantiates via Hydra
DatabaseInterface ..> scenarios : uses Scenarios, ScenarioData
DatabaseInterface --> schemas : process_scenario_records()
BaseDatabase ..|> DatabaseInterface : satisfies
ConcreteDatabase --|> BaseDatabase : extends
schemas --> TrainingInference : Sequence[DatasetRecord] consumed by
schemas ..> polars : uses pl.DataType, pl.Schema
Core Components¶
DatabaseInterface¶
DatabaseInterface is the protocol that every database implementation must satisfy. It defines the contract for version metadata, scenario access, and record generation:
class DatabaseInterface(Protocol):
@property
def database_version(self) -> str: ...
@property
def scenarios(self) -> MappingProxyType[str, Scenarios]: ...
def get_unique_scenario_data(self) -> MappingProxyType[str, ScenarioData]: ...
def load_scenario_records(self) -> Sequence[DatasetRecord]: ...
def process_scenario_records(self) -> Sequence[DatasetRecord]: ...
All concrete databases are accessed through this protocol, ensuring downstream code (training, evaluation) never depends on a specific dataset format.
BaseDatabase¶
BaseDatabase provides the shared implementation of DatabaseInterface. It handles initialization from version and paths, caching directory creation, Polars schema retrieval, resolving the main scenario group, and deduplicating scenario data across groups:
class BaseDatabase:
def __init__(
self,
database_version: str,
database_root_path: str,
cache_path: str,
cache_file_prefix_name: str,
num_workers: int,
) -> None:
...
def get_polars_schema(self) -> pl.Schema: ...
def get_main_database_scenario_data(self) -> Scenarios: ...
def get_unique_scenario_data(self) -> Mapping[str, ScenarioData]: ...
def process_scenario_records(self) -> Sequence[DatasetRecord]:
raise NotImplementedError("Subclasses must implement process_scenario_records!")
To add a new dataset family, subclass BaseDatabase and implement process_scenario_records(). See T4Dataset for a concrete example.
Scenarios¶
The scenarios module models scenario metadata as immutable Pydantic objects. DatasetParams captures per-dataset preprocessing parameters, ScenarioData uniquely identifies a single scenario with its version and sampling settings, and Scenarios is the abstract base that concrete implementations extend to parse scenario configs based on a dataset:
class DatasetParams(BaseModel):
dataset_name: str
max_sweeps: int
sample_steps: int
class ScenarioData(BaseModel):
scenario_id: str
scenario_version: str
vehicle_type: str | None = None
location: str | None = None
...
class Scenarios(BaseModel):
version: str
scenario_root_path: Path
dataset_params: Sequence[DatasetParams]
scenario_data: Mapping[SplitType, Sequence[ScenarioData]] | None = None
@model_validator(mode="after")
def build_scenarios(self) -> None:
raise NotImplementedError("Subclasses must implement build_scenarios!")
Schema¶
The output schema is defined in schemas.py and consists of two parts:
DatasetTableSchema— a frozen dataclass whose class-level attributes areDatasetTableColumnnamed tuples, each pairing a column name with a Polars data type. CallDatasetTableSchema.to_polars_schema()to get apl.Schemafor constructing or validating a PolarsDataFrame.DatasetRecord— a frozen Pydantic model representing a single row. One record is emitted per sample/frame byprocess_scenario_records().
class DatasetTableSchema:
SCENARIO_ID = DatasetTableColumn("scenario_id", pl.String)
SAMPLE_ID = DatasetTableColumn("sample_id", pl.String)
SAMPLE_INDEX = DatasetTableColumn("sample_index", pl.Int32)
LOCATION = DatasetTableColumn("location", pl.String)
VEHICLE_TYPE = DatasetTableColumn("vehicle_type", pl.String)
@classmethod
def to_polars_schema(cls) -> pl.Schema: ...
class DatasetRecord(BaseModel):
scenario_id: str
sample_id: str
sample_index: int
location: str | None
vehicle_type: str | None
| Column | Python type | Polars type | Description |
|---|---|---|---|
scenario_id |
str |
String |
Unique identifier of the driving scenario |
sample_id |
str |
String |
Unique identifier of the individual sample/frame |
sample_index |
int |
Int32 |
Zero-based index of the sample within the scenario |
location |
str \| None |
String |
Geographic location where the data was captured |
vehicle_type |
str \| None |
String |
Type of vehicle used for data collection |
Both classes are kept in sync: every field in DatasetRecord has a corresponding column in DatasetTableSchema. When adding new annotation fields (e.g. 3D bounding boxes), add entries to both.
Dataset Generation (Hydra Entrypoint)¶
The generate_dataset.py script is the Hydra-based entrypoint that wires everything together. It reads a YAML config, instantiates the configured database class, and triggers record generation:
@hydra.main(version_base=None, config_path=_CONFIG_PATH)
def main(cfg: DictConfig):
database: DatabaseInterface = instantiate(cfg.database)
database.process_scenario_records()
To run dataset generation:
python3 autoware_ml/scripts/generate_dataset.py \
--config-name default_t4dataset_generator \
working_dir=<working_dir> \
data_root_path=<dataset_root_path> \
database.num_workers=32
Configuration is done through YAML files under autoware_ml/configs/generators/. Override any parameter from the command line using Hydra syntax. See Configuration Guide for full details.
Extending the Database¶
| Extension Point | How |
|---|---|
| New dataset family | Subclass BaseDatabase, implement process_scenario_records(), register in a Hydra config |
| New scenario format | Subclass Scenarios, implement build_scenarios() to parse format-specific YAML |
| New schema columns | Add entries to both DatasetTableSchema (Polars type) and DatasetRecord (Pydantic field) |
Implementation¶
| Path | Description |
|---|---|
autoware_ml/databases/schemas.py |
DatasetRecord and DatasetTableSchema |
autoware_ml/databases/scenarios.py |
ScenarioData, DatasetParams, Scenarios |
autoware_ml/databases/database_interface.py |
DatabaseInterface protocol |
autoware_ml/databases/base_database.py |
Shared BaseDatabase implementation |
autoware_ml/scripts/generate_dataset.py |
Hydra entrypoint for dataset generation |
autoware_ml/configs/generators/ |
YAML configs for dataset generation |