Metrics¶
A metric accumulates data over an epoch and produces a scalar report at the end. Metrics are attached to a model from config and run during validation and test. Losses are handled by the model itself, not by metrics.
The design separates two roles:
- A suite (
MetricSuite, atorchmetrics.Metric) is a task state-engine. It owns the accumulated state and its cross-GPU reduction, and the per-range dispatch. It does not decide which metrics run. - A metric (
Metric) is a small, self-contained, injectable object. It computes its own numbers from the state the suite builds, and declares which stages it runs in.
Which metrics run, and in which stages, is pure configuration. The suite is just the engine that feeds them.
What runs in each split¶
| Split | Losses | Metrics |
|---|---|---|
| train | logged | not run |
| val | logged | run, metrics whose stages include val |
| test | logged | run, metrics whose stages include test |
| predict | not run | not run |
Each metric declares its stages. The convention is that cheap headline metrics run in both val
and test, while the heavier reporting runs only in test, so validation epochs stay fast. This is
set per metric in config, not in code.
Lifecycle¶
A suite runs the standard torchmetrics contract across an epoch.
update(eval_out)runs once per batch on each GPU. It folds the batch into the suite's state and never talks to other GPUs.- sync runs once at epoch end, inside
compute. torchmetrics combines every GPU's state using the reduction declared for each state. This is the only cross-GPU step. compute()runs after sync. It builds a taskstateonce overall and once per range, then asks each stage-applicable metric toevaluatethat state and merges their reports.
result(stage) sets the reporting stage and calls compute. The mixin clones a suite per stage
and resets it at epoch start, so each instance reports for exactly one stage.
sequenceDiagram
participant L as Lightning
participant M as Model
participant S as Suite
participant Me as Metric
loop each val batch
L->>M: on_validation_batch_end(outputs, batch)
M->>M: build_eval_output(batch, outputs)
M->>S: update(eval_out)
end
L->>M: on_validation_epoch_end()
M->>S: result(stage)
S->>S: compute() syncs state across GPUs
S->>S: state_for(range) builds the state
S->>Me: evaluate(state, stage) for each metric in this stage
Me-->>S: per metric report
S-->>M: merged report
M->>L: log under val/prefix/key
State and reduction¶
Each piece of state is registered with add_state(name, default, dist_reduce_fx). The reduce
function is how torchmetrics combines that state across GPUs, chosen per state by what the
quantity is.
| Suite | State | dist_reduce_fx |
Why |
|---|---|---|---|
| segmentation | one stacked confusion tensor, shape (ranges+1, C, C) |
sum |
counts are additive, so the global matrices are the per-rank ones summed |
| detection | per-frame prediction and GT tensors, as list states | None |
each frame stays its own list element, so matching stays within a frame after the gather |
A confusion matrix is a bounded sufficient statistic, so segmentation keeps one matrix per range in a single stacked state and derives every metric from it. Detection mAP needs the raw per-frame samples because matching is score-ordered and happens inside each frame, so the detection states are kept as per-frame list elements and gathered with no reduction.
Class structure¶
classDiagram
class TorchMetric["torchmetrics.Metric"] {
+add_state(name, default, dist_reduce_fx)
+update(eval_out)
+compute()
+reset()
}
class MetricSuite {
+prefix : str
+_required_keys : tuple[str, ...]
+components : list[Metric]
+update(eval_out)*
+state_for(range)*
+compute()
+result(stage)
}
class Metric {
+stages : frozenset[EvalStage]
+evaluate(state, stage)*
}
class TaskSuite["Detection3DMetricSuite / Segmentation3DMetricSuite"]
class TaskMetric["MeanAP / IoU / ..."]
TorchMetric <|-- MetricSuite
MetricSuite <|-- TaskSuite
Metric <|-- TaskMetric
MetricSuite o-- "0..*" Metric : runs injected
Method marks:
| Mark | Meaning |
|---|---|
* |
abstract, the subclass implements it |
| none | concrete, provided by the base |
A suite implements update and state_for and declares prefix and _required_keys. A metric
implements evaluate and declares stages. The suite holds a list of metrics it was given and
runs each one against the state. Adding a metric means adding a Metric subclass and listing it
in config, never editing the suite.
Built-in suites and metrics¶
| Suite | prefix |
_required_keys |
Available metrics |
|---|---|---|---|
Detection3DMetricSuite |
det3d |
predictions, gt_boxes, gt_labels |
MeanAP, HeadingAP, Nds, TpErrors |
Segmentation3DMetricSuite |
seg3d |
seg_pred_labels, seg_target_labels, seg_coord |
IoU, Accuracy, PrecisionRecallF1 |
Both suites are range-aware. Configure ranges (radial MetricRange windows) and every key a
metric emits is also emitted per range with a distance suffix, for example
test/seg3d/iou_car_0m_50m or test/det3d/mAP_car_50m_90m. Detection clips boxes per range.
Segmentation keeps one confusion matrix per range and uses seg_coord (per-point xy) to bucket
points.
Keys are logged as {split}/{prefix}/{key}, for example val/det3d/mAP. Checkpoint monitors and
Optuna targets point at these keys directly.
Attaching metrics¶
model.metrics is a list of suites, so a joint segmentation and detection model lists two. Each
suite is given its components (the metrics it runs) and the stages each one runs in. The full
suite lives once in the task base config and reads its range buckets and per-class caps from the
metric_ranges and metric_eval_class_range interpolation variables. A variant retunes the suite
by overriding just those two variables. This indirection is deliberate: model.metrics is a list,
and Hydra replaces a list wholesale rather than merging it, so overriding the suite directly would
mean restating every field.
# base config: the suite defined once, reading the tunable bits from variables
metric_eval_class_range: { car: 121.0, truck: 121.0, bus: 121.0, bicycle: 121.0, pedestrian: 121.0 }
metric_ranges:
- { _target_: autoware_ml.metrics.base.MetricRange, name: 0-50m, min_distance: 0.0, max_distance: 50.0 }
- { _target_: autoware_ml.metrics.base.MetricRange, name: 50-90m, min_distance: 50.0, max_distance: 90.0 }
model:
metrics:
- _target_: autoware_ml.metrics.detection3d.suite.Detection3DMetricSuite
class_names: ${class_names}
eval_class_range: ${metric_eval_class_range}
ranges: ${metric_ranges}
components:
- { _target_: autoware_ml.metrics.detection3d.mean_ap.MeanAP, stages: [val, test] }
- { _target_: autoware_ml.metrics.detection3d.heading_ap.HeadingAP, stages: [test] }
- { _target_: autoware_ml.metrics.detection3d.nds.Nds, stages: [test] }
- { _target_: autoware_ml.metrics.detection3d.tp_errors.TpErrors, stages: [test] }
# variant: retune without restating the suite
metric_eval_class_range: { car: 102.0, pedestrian: 102.0 } # and the rest
What a model provides¶
One method. It maps the raw forward outputs to the flat dict the suites read. Model-specific work like box decoding happens here.
class ModelA(BaseModel):
def build_eval_output(self, batch, outputs):
return {
"predictions": self.bbox_head.predict(outputs),
"gt_boxes": batch["gt_boxes"],
"gt_labels": batch["gt_labels"],
}
The mixin feeds this dict into every attached suite. The model never calls update, compute,
or result.
Writing a custom metric¶
A metric is the unit of extension. Subclass Metric, declare the stages it runs in (or accept the
default), and read the suite's state. The example adds a per-class accuracy view to segmentation
without touching the suite.
class PerClassAccuracy(Metric):
def evaluate(self, state, stage):
return {
f"acc_class_{i}": float(state.recall[i].item())
for i in range(state.num_classes)
if bool(state.has_support[i])
}
Add it to the suite's components list in config and its keys appear under the suite prefix. A new
metric family that needs new state is a new suite, which implements update and state_for.
Distributed runs¶
| Quantity | How it combines across GPUs |
|---|---|
| Loss | Lightning reduces the scalar with sync_dist=True |
| Metric | torchmetrics reduces each state by its dist_reduce_fx, then compute runs once |
Losses are means, so a mean across GPUs is correct. Metrics are not always linear, so each state
declares how it combines and torchmetrics applies it before computing. After sync the state is
identical on every rank, so the result is logged without sync_dist.
Distributed eval padding
autoware-ml test runs on a single device by default, so there is no padding and the metrics
are exact. Pass --use-config-devices to evaluate on the config's devices. The caveat below
only applies when evaluation runs on more than one device, for example validation during
multi-GPU training or test with --use-config-devices on several GPUs.
Under DDP the validation sampler pads the last batch with repeated frames so the dataset
divides evenly across ranks, which double counts at most world_size - 1 frames. On a normal
validation set this is well under a tenth of a percent and is left uncorrected. A detection
suite could drop the duplicates by frame id, but a segmentation suite cannot, because its
confusion matrix has already pooled the points and a single frame can no longer be removed.
Bit exact multi-device eval would instead use a non padding sampler at the datamodule level,
which is out of scope for the metrics.