Design
SYNC.TOOLING is designed to ensure that the given distributed system, including ECUs, sensors and other network equipment, is synchronized correctly.
While software like LinuxPTP and its command line
programs ptp4l and phc2sys can ensure reliable synchronization, they do not make their
diagnostics available to other programs.
Further, equipment like sensors might not support common diagnostics protocols at all,
necessitating custom means of ensuring correct synchronization.
Requirements
SYNC.TOOLING is required to
- req.realtime provide online real-time3 diagnostics
- req.ros to ROS 2 (SYNC.DIAG) and
- req.web via web interface (SYNC.DOCTOR)
- req.preexisting for pre-existing setups (e.g. vehicles set up before SYNC.TOOLING became available)
- req.replay provide offline analysis of recorded data
- be shippable as systemd services
- be one-click installable for troubleshooting purposes
- neither raise false positives (e.g. triggering MRM on a transient fault)
- nor report actual faults too late or not at all
General Assumptions
In designing this software suite, the following assumptions have been made:
- the time synchronization mechanism is PTPv21
- all ECUs that participate in PTP time synchronization
- asm.ptp4l are running
ptp4lto synchronize with other network devices - asm.phc2sys are running
phc2systo synchronize their internal clocks (if there are multiple) - asm.systemd are running
ptp4landphc2sysinstances as systemd units - asm.no-other are not performing any other time synchronization, e.g.
using
ptpdor non-systemd units
- asm.ptp4l are running
- all sensors that participate in PTP provide a way to compare their clock with another one
in the system
- for example, sending timestamps in their packets, that can then be compared with the receiving ECU's clock
- asm.nebula sensors without native PMC support are expected to be supported through Nebula
- not all devices that participate in time synchronization are fully observable
- for example, some devices might not have any diagnostics interfaces
- some devices might only report status information, but no info on their parent or master PTP instances
- in case of synchronization loss, clocks take multiple seconds2 to drift far enough apart to be problematic
Diagnostics Requirements
Diagnostics must be made available in real-time3 to ROS 2 /diagnostics in a manner
compatible with the Autoware Diagnostics API.
The diagnostics shall be updated as often as necessary but in any case faster than the 5s2
deadline imposed above. For the time being, 1s seems to be a good compromise4.
As for the actual diagnostics output, it is required that
- for every clock, the status of the synchronization to the grandmaster5 is diagnosed
- for missing clocks to be detected and reported
- for cycles or disconnected subgraphs to be detected and reported
System Architecture
Given the above requirements and assumptions, the following architecture has been designed:
The Sync Worker instances are using the systemd journal according to asm.systemd
to query the status of the ptp4l and phc2sys instances according to asm.ptp4l
and asm.phc2sys. This indirect communication satisfies req.preexisting.
Assumption asm.no-other eliminates the need for additional monitoring for other
possibly interfering services.
Workers are publishing their updates via ROS 2 on a topic /sync_diag/graph_updates.
Sensors that participate in PTP synchronization are integrated using Nebula according to
asm.nebula. Nebula too publishes its updates via the same topic. This communication
mechanism satisfies req.replay by making record/replay functionality available
through ros2 bag record and ros2 bag play.
The Sync Master instance is subscribing to the above topic, and assembles all received information into a graph, which is subsequently used to diagnose the synchronization status of the system as a whole, as well as every clock within it. The Sync Master provides SYNC.DOCTOR, which satisfies req.web, and SYNC.DIAG, which satisfies req.ros.
Synchronization Graph
Graph Structure
The synchronization graph (sync graph) is a directed graph where
- nodes represent clocks
- edges represent real or virtual links between two clocks
Each clock is a hardware clock device that can participate in PTP or PHC2SYS synchronization, such as the clock of a sensor or a network interface, or an ECU's system clock.
A link between two clocks can be
- (real) a PTP synchronization link
- (real) a PHC2SYS synchronization link
- (virtual) a measurement performed by means different from PTP or PHC2SYS
The graph is constructed from GraphUpdate messages, which provide pieces of
information about the graph observable by individual workers.
Clock Naming
Clocks are identified by different names depending on the context:
- PTP uses a MAC address based clock identifier
- PTP4L and PHC2SYS use a PTP clock identifier, interface name or Linux clock device name
- Nebula uses the sensor's frame ID
Since it is not possible in general to observe all aliases of a clock from a single worker,
these identifiers are related to each other by the ClockAliasUpdate
update type.
In the user-facing SYNC.DOCTOR and SYNC.DIAG, the most human-readable representation of each clock is shown in case of multiple aliases. See get_most_human_readable_alias for the specific ordering.
Statelessness
SYNC.TOOLING is designed to be stateless. The main motivation behind this is to avoid having to
read potentially tens of thousands of lines of the systemd journal in order to begin operations.
Instead, the sync graph operates only on information received in the last timeout seconds,
and the workers retransmit information at least once per second.
Graph Update Types
The synchronization graph is constructed from various types of updates, each representing a mostly atomic piece of the system. These updates are sent by workers and other components to the diagnostic master, which assembles them into a coherent graph structure.
ClockAliasUpdate
Clock alias updates establish relationships between different identifiers that refer to the same physical clock. This is necessary because different components may refer to the same clock using different naming schemes:
- PTP Clock IDs: MAC address-based identifiers used by PTP protocol, e.g.
123456.fffe.654321 - System Clock IDs: Human-readable names like
my_host.sys - Interface IDs: Network interface identifiers, e.g.
my_host.eno1 - Linux Clock Device IDs: Device names like
my_host.ptp0(/dev/ptp0) - Sensor IDs: Frame IDs used by sensors, e.g.
sensor@lidar/top
When an alias update is received, the graph combines all nodes representing the same clock and updates all references to use the most human-readable identifier.
Example:
ptp_clock_id = ClockId(ptp_clock_id=PtpClockId(id="123456.fffe.654321"))
system_clock_id = ClockId(system_clock_id=SystemClockId(hostname="my_host"))
alias_update = ClockAliasUpdate(aliases=[ptp_clock_id, system_clock_id])
graph_update = GraphUpdate(clock_alias_update=alias_update)
graph TD
subgraph After
A2["my_host.ptp2<br>(aliases: 123456.fffe.654321, my_host.ptp2)"]
end
subgraph Before
A1["123456.fffe.654321"] ~~~ B1["my_host.ptp2"]
end
ClockMasterUpdate
Clock master updates represent PTP master-slave relationships and their reported time offset. These updates include:
- clock_id: The slave clock that is being synchronized
- master: The master clock (optional - if not set, indicates that no master is present)
- master_offset_ns: The offset from master as reported by PTP (ignored if no master is present)
Role in Graph: Creates directed edges labeled as "master" links, representing the synchronization hierarchy.
Example:
slave = ClockId(ptp_clock_id=PtpClockId(id="111111.fffe.111111"))
master = ClockId(ptp_clock_id=PtpClockId(id="222222.fffe.222222"))
master_update = ClockMasterUpdate(clock_id=slave, master=master, master_offset_ns=3)
graph_update = GraphUpdate(clock_master_update=master_update)
graph TD
subgraph After
A2["clock 1"] -- master (offset: 3ns) --> B2["clock 2"]
end
subgraph Before
A1["clock 1"] ~~~ B1["clock 2"]
end
standalone_clock = ClockId(ptp_clock_id=PtpClockId(id="111111.fffe.111111"))
master_update = ClockMasterUpdate(clock_id=standalone_clock)
graph_update = GraphUpdate(clock_master_update=master_update)
graph TD
subgraph After
A2["clock 1"] ~~~ B2["clock 2"]
end
subgraph Before
A1["clock 1"] -- master (offset: 3ns) --> B1["clock 2"]
end
PtpParentUpdate
PTP parent updates establish parent-child relationships in the PTP synchronization tree. These represent the PTP port hierarchy:
- clock_id: The child clock
- parent: The parent PTP port (includes clock ID, port number, and PTP domain)
Role in Graph: Creates directed edges labeled as "ptp_parent" links, representing the PTP port hierarchy. Port number 0 (reserved for internal PTP mechanisms such as local PMC queries and PHC2SYS synchronization) is discarded.
Example:
child = ClockId(ptp_clock_id=PtpClockId(id="111111.fffe.111111"))
parent = ClockId(ptp_clock_id=PtpClockId(id="222222.fffe.222222"))
parent_port = PortId(clock_id=parent, port_number=1, ptp_domain=0)
parent_update = PtpParentUpdate(clock_id=child, parent=parent_port)
graph_update = GraphUpdate(ptp_parent_update=parent_update)
graph TD
subgraph After
A2["clock_id"] -- ptp_parent (domain: 0, port: 1) --> B2["parent.clock_id"]
end
subgraph Before
A1["clock_id"] ~~~ B1["parent.clock_id"]
end
Phc2SysUpdate
PHC2SYS updates represent synchronization relationships between hardware clocks and system clocks via PHC2SYS:
- src: The source hardware clock (e.g., network interface PTP clock)
- dst: The destination system clock
- clock_state: The slave clock state including offset measurements and servo state
Role in Graph: Creates directed edges labeled as "phc2sys" links, representing hardware-to-system clock synchronization.
Example:
src_hw_clock = ClockId(ptp_clock_id=PtpClockId(id="111111.fffe.111111"))
dst_sys_clock = ClockId(system_clock_id=SystemClockId(hostname="my_host"))
clock_state = SlaveClockState(offset_ns=1, servo_state=ServoState.SERVO_LOCKED)
phc2sys_update = Phc2SysUpdate(
src=src_hw_clock, dst=dst_sys_clock, clock_state=clock_state
)
graph_update = GraphUpdate(phc2sys_update=phc2sys_update)
graph TD
subgraph After
A2["src"] -- phc2sys (offset: 1ns, servo: locked) --> B2["dst"]
end
subgraph Before
A1["src"] ~~~ B1["dst"]
end
ClockDiffMeasurement
Clock difference measurements represent time offset measurements between clocks that are not performed by PTP or PHC2SYS directly. These are typically used for:
- Sensor timestamp comparisons with packet ingress times
- Sanity checks like reading and comparing clock timestamps manually
Fields:
- src: The source clock for the measurement
- dst: The destination clock for the measurement
- diff_ns: The time difference (
time(dst) - time(src)). Can be negative.
Role in Graph: Creates directed edges labeled as "measurement" links, representing virtual synchronization relationships based on external measurements.
Example:
src = ClockId(ptp_clock_id=PtpClockId(id="111111.fffe.111111"))
dst = ClockId(sensor_id=SensorId(frame_id="lidar/top"))
diff_measurement = ClockDiffMeasurement(src=src, dst=dst, diff_ns=20000)
graph_update = GraphUpdate(clock_diff_measurement=diff_measurement)
graph TD
subgraph After
A2["src"] -- measurement (diff: 20 μs) --> B2["dst"]
end
subgraph Before
A1["src"] ~~~ B1["dst"]
end
PortStateUpdate
Port state updates report the operational state of PTP ports:
- port_id: The PTP port identifier
- port_state: The current state of the port (e.g., LISTENING, MASTER, SLAVE, etc.)
Role in Graph: Stores port state information for diagnostic purposes. Port number 0 (internal PTP mechanisms) is discarded.
Example:
clock = ClockId(ptp_clock_id=PtpClockId(id="111111.fffe.111111"))
port = PortId(clock_id=clock, port_number=1, ptp_domain=0)
port_state_update = PortStateUpdate(port_id=port, port_state=PortState.PS_SLAVE)
graph_update = GraphUpdate(port_state_update=port_state_update)
graph TD
subgraph After
A2["port<br>PS_SLAVE"]
end
subgraph Before
A1["port<br>PS_LISTENING"]
end
SelfReportedClockStateUpdate
Self-reported clock state updates contain status information directly reported by clocks (typically sensors):
States:
- UNSYNCHRONIZED: Clock is not synchronized
- TRACKING: Clock is attempting to synchronize but not yet within tolerance
- LOCKED: Clock is synchronized within tolerance
- LOST: Clock was previously synchronized but has lost synchronization
Role in Graph: Stores self-reported synchronization status for diagnostic evaluation.
Example:
sensor = ClockId(sensor_id=SensorId(frame_id="lidar/top"))
clock_state_update = SelfReportedClockStateUpdate(
clock_id=sensor, state=SelfReportedClockStateUpdate.State.LOCKED
)
graph_update = GraphUpdate(self_reported_clock_state_update=clock_state_update)
graph TD
subgraph After
A2["sensor<br>LOCKED"]
end
subgraph Before
A1["sensor<br>UNSYNCHRONIZED"]
end
Status Messages
The following update types provide status and error information from PTP and PHC2SYS components:
- Reports warnings and errors from PTP4L instances
- Includes the affected clock ID and severity level
- Reports warnings and errors specific to PTP ports
- Includes the affected port ID and severity level
- Reports warnings and errors from PHC2SYS instances
- Includes source clock, affected destination clocks, and severity level
Role in Graph: These messages are currently stored but not actively used for graph construction. They provide additional diagnostic context for troubleshooting synchronization issues.
Update Processing
When updates are received, the graph ensures consistency by:
- Clock Creation: New clocks referenced in updates are automatically added to the graph
- Alias Resolution: All clock references are updated to use canonical (most human-readable) identifiers
- Edge Management: Multiple edge types can exist between the same pair of clocks (master, ptp_parent, phc2sys, measurement)
- Self-Loop Prevention: Updates that would create self-loops (a clock synchronizing to itself) are ignored
- Port Tracking: Port information is maintained separately from the main graph structure
The graph maintains both the main synchronization structure and metadata about ports and clock states, enabling comprehensive diagnostic analysis of the entire time synchronization system.
Tech Stack
The following technologies are used:
| Technology | Usage | Rationale |
|---|---|---|
| Python 3.10 | All program logic | Type system, ease of interfacing with, development speed |
| Protobuf | Internal interfaces | Support for sum types (oneof), self-referential data structures (e.g. trees) |
| ROS 2 | Transport layer | Familiarity, no additional network setup necessary |
| ROS 2 | Diagnostics (SYNC.DIAG) | Interoperability with Autoware |
| Flask | Web server (SYNC.DOCTOR) | Fast and simple, other frameworks such as FastAPI would be fine too |
| Apache ECharts | Graph rendering (SYNC.DOCTOR) | Design, smoothness, ease of integration |
| NetworkX | Graph analysis | De-facto standard graph analysis library for Python |
-
Specifically IEEE 1588v2 (PTPv2), IEEE 802.1AS (gPTP) or AutoSAR EthTSyn (gPTP Automotive Profile) ↩
-
This should be in the order of tens of seconds, but we are, somewhat arbitrarily, defining this as
5shere. ↩↩ -
Both in the sense of the strict definition (the computations must complete by a certain periodic deadline), and in the sense that diagnostics are live (at most a few seconds out of date). See real-time computing. ↩↩
-
This allows for momentary faults in communication without raising a diagnostic error. Further, some tools like
pmcare too slow to operate reliably at a sub-second frequency. ↩ -
The term "grandmaster" is defined in the PTP standard, but the usage here refers to the clock that all other clocks synchronize to, even through means other than PTP (such as PHC2SYS). ↩