Dataset Schema
The output schema lives in the autoware_ml/databases/schemas/ package. It is split into a top-level table definition and reusable nested data models that can be shared across dataset families. Every database implementation emits Sequence[DatasetRecord] from process_scenario_records() and persists rows as a Polars DataFrame (typically Parquet) using DatasetTableSchema.
Base building blocks (base_schemas.py)
DatasetTableColumn - a NamedTuple pairing a column name with a Polars data type.
BaseFieldSchema - base class for nested struct schemas. Subclasses declare DatasetTableColumn attributes and expose to_polars_field_schema() to produce pl.Field definitions for struct columns.
DataModelInterface - abstract interface requiring to_dictionary() and load_from_dictionary() so every Pydantic data model can round-trip through a Polars DataFrame.
Top-level table (dataset_schemas.py)
DatasetTableSchema - a frozen dataclass whose class-level attributes are DatasetTableColumn entries. Call DatasetTableSchema.to_polars_schema() to get a pl.Schema for constructing or validating a Polars DataFrame.
DatasetRecord - a frozen Pydantic model (implementing DataModelInterface) representing a single row. One record is emitted per sample/frame by process_scenario_records().
class DatasetTableSchema:
# Basic metadata
SCENARIO_ID = DatasetTableColumn("scenario_id", pl.String)
SAMPLE_ID = DatasetTableColumn("sample_id", pl.String)
SAMPLE_INDEX = DatasetTableColumn("sample_index", pl.Int32)
TIMESTAMP_SECONDS = DatasetTableColumn("timestamp_seconds", pl.Float64)
LOCATION = DatasetTableColumn("location", pl.String)
VEHICLE_TYPE = DatasetTableColumn("vehicle_type", pl.String)
SCENARIO_NAME = DatasetTableColumn("scenario_name", pl.String)
# Nested sensor data columns
LIDAR_FRAMES = DatasetTableColumn("lidar_frames", pl.List(pl.Struct(...)))
LIDAR_SOURCES = DatasetTableColumn("lidar_sources", pl.List(pl.Struct(...)))
# Annotation fields
CATEGORY_MAPPING = DatasetTableColumn("category_mapping", pl.Struct(...))
BOXES_3D = DatasetTableColumn("boxes_3d", pl.List(pl.Struct(...)))
@classmethod
def to_polars_schema(cls) -> pl.Schema: ...
class DatasetRecord(BaseModel, DataModelInterface):
scenario_id: str
sample_id: str
sample_index: int
timestamp_seconds: float
location: str | None
vehicle_type: str | None
scenario_name: str
lidar_frames: Sequence[LidarFrameDataModel]
lidar_sources: Sequence[LidarSourceDataModel] | None
category_mapping: CategoryMappingDataModel | None
boxes_3d: Sequence[Box3DDataModel] | None
def to_dictionary(self) -> Mapping[str, Any]: ...
@classmethod
def load_from_dictionary(cls, data_model: Mapping[str, Any]) -> DatasetRecord: ...
Top-level columns
| Column |
Python type |
Polars type |
Description |
scenario_id |
str |
String |
Unique identifier of the driving scenario |
sample_id |
str |
String |
Unique identifier of the individual sample/frame |
sample_index |
int |
Int32 |
Zero-based index of the sample within the scenario |
timestamp_seconds |
float |
Float64 |
Sample timestamp in seconds |
location |
str \| None |
String |
Geographic location where the data was captured |
vehicle_type |
str \| None |
String |
Type of vehicle used for data collection |
scenario_name |
str |
String |
Human-readable name of the scenario scene |
lidar_frames |
Sequence[LidarFrameDataModel] |
List(Struct) |
Keyframe and sweep LiDAR frame metadata per sample |
lidar_sources |
Sequence[LidarSourceDataModel] \| None |
List(Struct) |
Per-sensor calibration metadata for LiDAR sources |
category_mapping |
CategoryMappingDataModel \| None |
Struct |
Mapping between category names and indices |
boxes_3d |
Sequence[Box3DDataModel] \| None |
List(Struct) |
3D box annotations for the sample/frame |
lidar_frames struct fields
Each list entry is a LidarFrameDataModel covering one keyframe or sweep:
| Field |
Polars type |
Description |
lidar_frame_id |
String |
Sample-data token for this frame |
lidar_keyframe |
Boolean |
True for the main keyframe, False for sweeps |
lidar_sensor_id |
String |
Calibrated-sensor token |
lidar_sensor_channel_name |
String |
LiDAR channel name (e.g. LIDAR_TOP) |
lidar_timestamp_seconds |
Float64 |
Frame timestamp in seconds |
lidar_pointcloud_path |
String |
Absolute path to the point cloud file |
lidar_pointcloud_source_path |
String |
Path to per-point metadata (or null) |
lidar_pointcloud_num_features |
Int32 |
Number of features per point (configured on the database) |
lidar_sensor_to_ego_pose_matrix |
Array(Float32, 4x4) |
Sensor-to-ego transform |
lidar_frame_ego_pose_to_global_matrix |
Array(Float32, 4x4) |
Ego-to-global transform for this frame |
lidar_sensor_to_lidar_sweep_matrices |
Array(Float32, 4x4) |
Sensor-to-sweep transform |
lidar_pointcloud_semantic_mask_path |
String |
LiDAR segmentation mask path (or null) |
lidar_sources struct fields
Each list entry is a LidarSourceDataModel describing one LiDAR sensor in the scene:
| Field |
Polars type |
Description |
channel_name |
String |
Sensor channel name |
sensor_token |
String |
Sensor token |
translation |
Array(Float32, 3) |
Sensor translation vector |
rotation |
Array(Float32, 3x3) |
Sensor rotation matrix |
category_mapping struct fields
| Field |
Polars type |
Description |
category_names |
List(String) |
Ordered list of category names |
category_indices |
List(Int32) |
Corresponding category index values |
Nested data models
| Module |
Schema class |
Data model |
Purpose |
lidar_frames.py |
LidarFrameDatasetSchema |
LidarFrameDataModel |
Point cloud paths, poses, and sweep transforms |
lidar_sources.py |
LidarSourceDatasetSchema |
LidarSourceDataModel |
LiDAR sensor channel name, token, and extrinsics |
category_mapping.py |
CategoryMappingDatasetSchema |
CategoryMappingDataModel |
Parallel lists of category names and indices |
box3d_schemas.py |
Box3DDatasetSchema |
Box3DDataModel |
Per-object 3D box parameters and metadata |
frame_basic_metadata.py |
— |
FrameBasicMetadata |
Shared per-frame metadata used during record generation |
boxes_3d struct fields
Each list entry is a Box3DDataModel with the following struct fields:
| Field |
Polars type |
Description |
box3d_params |
Array(Float32, 10) |
3D box vector in Box3DFieldIndex order: (x, y, z, l, w, h, yaw, vx, vy, vz) |
box3d_instance_id |
String |
Instance identifier for the box |
box3d_dataset_label_name |
String |
Original dataset label name |
box3d_label_name |
String |
Normalized training/evaluation label name |
box3d_label_index |
Int32 |
Class index of box3d_label_name |
box3d_num_lidar_pointclouds |
Int32 |
Number of LiDAR points in the box |
box3d_num_radar_pointclouds |
Int32 |
Number of radar points in the box |
box3d_valid |
Boolean |
Validity flag for this annotation |
box3d_attributes |
List(String) |
Attribute tags associated with this box |
box3d_coordinate |
String |
Coordinate frame identifier for the box representation |
Each nested module follows the same pattern: a *DatasetSchema class defines the Polars struct layout, and a matching *DataModel Pydantic class implements DataModelInterface for serialization. DatasetRecord.to_dictionary() delegates to these nested models when writing Parquet; DatasetRecord.load_from_dictionary() reconstructs them when reading back.
DatasetTableSchema, DatasetRecord, and every nested schema/data-model pair are kept in sync. When adding new columns/fields (e.g. 3D bounding boxes), add entries to the relevant *DatasetSchema and *DataModel, then wire the new column into DatasetTableSchema and DatasetRecord.
Extending the schema
| Extension |
How |
| New top-level column |
Add a *DatasetSchema/*DataModel pair (or extend an existing one), then wire the column into DatasetTableSchema and DatasetRecord |
| New struct field |
Add matching entries to the relevant *DatasetSchema and *DataModel classes |
See T4Dataset for a concrete example of how these schemas are populated from T4 annotations.
Implementation
| Path |
Description |
autoware_ml/databases/schemas/base_schemas.py |
DatasetTableColumn, BaseFieldSchema, DataModelInterface |
autoware_ml/databases/schemas/dataset_schemas.py |
DatasetRecord and DatasetTableSchema |
autoware_ml/databases/schemas/lidar_frames.py |
LiDAR frame struct schema and data model |
autoware_ml/databases/schemas/lidar_sources.py |
LiDAR source struct schema and data model |
autoware_ml/databases/schemas/category_mapping.py |
Category mapping struct schema and data model |
autoware_ml/databases/schemas/box3d_schemas.py |
3D box struct schema and data model |
autoware_ml/databases/schemas/frame_basic_metadata.py |
Shared per-frame metadata model |