AHB Chiplet Communication
Introduction
For small M-class microcontroller SoCs, particularly those built around Arm Cortex-M0, M0+, and M3 processors, AHB is the standard on-chip bus interconnect. AHB is an inherently blocking transfer protocol: a bus manager must receive the response to its current transaction before it can issue the next. This works well for low-throughput, low-latency interconnects within a single die, but becomes problematic when the bus fabric must stretch across chiplets. Read transactions are especially concerning: the read data must return before the bus is relinquished, and if there are long latencies involved such as multiple hops between chiplets, clock domain crossings, serialisation delays, etc. the entire bus stalls for the duration.
This project, called 'TideLink' extends the generic AXI chiplet controller (that is built around the open-source Wlink die-to-die link layer with runtime master/slave role selection and I2C sideband). It uses the Arm XHB500 AHB-to-AXI bridges to interface to the AXI based inter chiplet communications. TideLink provides three independent communication paths that share a single die-to-die PHY, each solving a distinct class of chiplet communication problem. A credit-based packet FIFO, a CAM-based address translator, and a Precision Time Protocol (PTP) clock synchronisation engine are used in this project.
The AHB Blocking-Bus Problem
AHB is a blocking protocol. A manager must receive the response to its current transaction before issuing the next one. For read transactions over a chiplet link, this creates a severe stall:
- The host CPU issues an AHB read. The address phase completes in one cycle.
- The TideLink bridge must hold HREADY low while the request crosses the link to the remote slave, the remote slave processes it, and the read data returns across the link.
- For even a modest 10 ns one-way link latency at 100 MHz AHB, this introduces a minimum 20-cycle stall per read — and the entire host AHB bus is frozen for the duration.
AHB does define SPLIT and RETRY mechanisms that allow a slave to release the bus during a long-latency response, but these require arbiter support and are absent from Cortex-M bus matrices and virtually all existing Cortex-M peripherals. They are not a practical mitigation.
Writes are less critical — they can be buffered and issued as fire-and-forget — but read performance over a transparent AHB bridge degrades dramatically with link latency. For latency-sensitive or high-bandwidth use cases, transparent AHB bridging is insufficient on its own.
The Solution: A Three-Path Architecture
The TideLink project addresses the full range of chiplet communication requirements through three independent paths, all sharing a single die-to-die PHY and independently flow-controlled so that traffic on one path cannot starve or be starved by traffic on another.
Path 1 — Transparent AHB Bridge (Control-Plane Traffic)
An XHB500 AHB-to-AXI bridge converts AHB transactions to AXI, which natively supports outstanding (non-blocking) transactions. The AXI chiplet controller carries the transactions over the link. An XHB500 AXI-to-AHB bridge reconstructs AHB on the remote side. A CAM-based address translator remaps local bus addresses to remote address ranges using two independent translation channels, each with 8 programmable match/replace rules that can be configured at runtime via APB.
This path is used for control-plane access: configuration writes, memory-mapped access to remote peripherals, and debugging. Latency is acceptable for these use cases, and the programming model is completely transparent — a CPU or DMA engine on one chiplet issues AHB transactions that are forwarded and executed on the remote chiplet's bus fabric without any software awareness of the link.
Path 2 — Mailbox Packet FIFO (Data-Plane and Latency-Sensitive Traffic)
Rather than bridging AHB reads transparently, TideLink exposes a FIFO mailbox on each chiplet. Software on the sending side constructs a descriptor packet — specifying transaction type, source and destination chiplet IDs, addresses, burst length, and a transaction tag — and writes it word-by-word into a TX aperture. A dedicated Wlink flow-control (FC) node (data_id=0xa1, 48-bit width) carries the words across the link directly into the remote chiplet's receive FIFO. Software on the receiving side is interrupted when a complete packet arrives, pops the descriptor, performs the local transaction, and writes a response packet back through its own TX aperture.
This path eliminates bus stalling entirely. The CPU writes a handful of words to a local peripheral and is immediately free. The bus is never held waiting for a remote response. For read requests, the host CPU writes only a 4-word descriptor and continues executing — the remote CPU performs the local reads and sends the data back asynchronously as a response packet.
Path 3 — PTP Clock Synchronisation (Time-Plane Traffic)
In multi-chiplet systems, a common time reference is essential for coordinating events, timestamping data, and implementing distributed protocols. TideLink integrates a Precision Time Protocol subsystem that synchronises the Precision Hardware Clock (PHC) across chiplets using dedicated Wlink short packets (data_id=0x50 for SYNC, 0x51 for DELAY_REQ). This path bypasses the FC state machine entirely — no credits, replay buffers, or CRC overhead — providing 67% bandwidth reduction compared to long packets and tighter timing characteristics.
All three paths are necessary. The transparent bridge provides simple memory-mapped access with no software overhead. The mailbox provides scalable, bulk, interrupt-driven data movement without AHB bus stalling. The PTP path provides autonomous clock synchronisation. Together they cover the full range of chiplet communication requirements for Cortex-M class systems.
Relationship to Wlink
Wlink is a layered chiplet communication stack originally developed by WAV (now open-source):
- Application layer: Protocol-specific nodes that convert bus transactions into Wlink packets. Wlink natively supports AXI, APB, and TileLink application nodes. AHB is not natively supported — this is TideLink's role.
- Link layer: Flow control (FC state machines), ECC (MIPI CSI/DSI SEC/DED), byte striping across lanes, TX/RX routing.
- PHY layer: Configurable — GPIO, SerDes, Bunch-of-Wires, or custom. Up to 256 asymmetric lanes. TideLink uses 8 GPIO lanes by default.
TideLink extends Wlink with two additional application-layer nodes. The mailbox path adds a dedicated FC node (data_id=0xa1, 48-bit) that provides the streaming valid/ready interface for FIFO data. The PTP path uses Wlink's native short packet mechanism (32-bit, with Hamming SEC/DED ECC) for low-latency timestamp exchange. The regular AHB bridge path uses the existing AXI application nodes within the chiplet controller. The Wlink instance is regenerated from Chisel source with the TideLink-specific FC node configuration.
TideLink wraps Wlink in a generic chiplet controller (axi_chiplet_controller) that adds runtime master/slave role selection via a strap pin and APB register, I2C sideband with independent master and slave cores for out-of-band configuration, and Wlink power-on-reset gating until the role is locked. This allows a single TideLink to serve as either endpoint in an asymmetric chiplet pair.
Architecture Detail
TideLink Top-Level Integration
The top-level module (tidelink_top) presents six AHB slave ports, one AHB master port, an APB configuration port, and dedicated interfaces for the PHC clock domain, chiplet controller role selection, and I2C sideband:
| Port | Direction | Purpose |
|---|---|---|
ahb_sub | Slave | Regular AHB access to remote side (via XHB500, address-translated) |
ahb_tx | Slave | TideLink TX aperture (direct to FC node, same aperture size as remote RX FIFO) |
ahb_fifo | Slave | Local RX FIFO data window (pop received packets) |
ahb_adr | Slave | Address translator configuration |
ahb_ptp | Slave | PTP TX write port (CPU writes here to trigger PTP short-packet messages) |
apb | Slave | Unified configuration port (0x0000–0x1FFF: Wlink controller, 0x2000–0x203F: TideLink config + PTP registers) |
ahb_mng | Master | Incoming transactions from remote side (via XHB500) |
Additionally, it exposes:
- PHY pads for the die-to-die link (8 GPIO lanes by default).
- PHC clock domain interface — a full bidirectional interface comprising hardware capture trigger and timestamp outputs (to/from the external PHC), free-running PHC time inputs, PPS pulse, and phase-step/frequency-adjust outputs from the servo.
- Role selection —
role_strap_i(external strap pin),role_is_master_o,role_locked_o. - I2C sideband — tristate SCL/SDA pins plus an AXI slave port for CPU-initiated I2C master transactions.
- Five interrupt outputs —
released_credits_irq,doorbell_irq,packet_committed_irq,ptp_irq,wlink_irq. - Servo status —
servo_lockedoutput. - General bus — 32-bit bidirectional interrupt forwarding across the link.
- Scan/DFT — scan mode, clock, shift, chain in/out.
Receive-Side FIFO Subsystem
The receive-side FIFO subsystem (tidelink_fifo_ahb) is the local mailbox buffer on each chiplet. It wraps:
- A 16 KB SRAM backing store with technology-specific implementations (FPGA, ASIC, or generic behavioural RTL).
- A FIFO controller (
tidelink_fifo_ctrl) that manages circular read/write pointers, packet-boundary framing, and credit counting. The controller uses the first word of each incoming packet as a length field to detect packet boundaries, firing apacket_committed_irqinterrupt when a complete packet has been received. The controller supports two write sources: standard AHB writes (2-phase protocol) and a direct FC write path that bypasses the AHB bus for single-cycle writes, doubling write throughput from the FC adapter. - An APB register block (
tidelink_apb_regs) for configuration (pair base address, credit release threshold), status (overrun, underrun, master error, packet committed), credit accumulators, doorbell, pair credit counter, and pass-through access to PTP, servo, and chiplet controller registers. The register block is organised into five regions:- Region 0 — FIFO configuration: pair base address, release threshold, packet word length, credit count, status, doorbell, flush control.
- Region 1 — Pair-side accumulators: released credits (write-accumulate/read-clear), doorbell response, pair credit counter with consume and enable registers.
- Region 2 — PTP and hardware sync initiator: PTP control/status/RX payload pass-through, HW sync enable/interval/status.
- Region 3 — Servo configuration: mode (Grandmaster/Subordinate), PI gains (KP, KI), step threshold, status (locked, last delay, NS_INCR_FRAC), and mailbox write registers for incoming servo timestamps.
- Region 4 — Chiplet controller register pass-through for Wlink configuration.
- A returner (
tidelink_returner) — a 3-channel priority-arbitrated AHB master that sends credit-release deltas and doorbell notifications back to the remote side. As the CPU reads data from the FIFO, freed word counts accumulate until they reach a configurable release threshold, at which point a credit delta is returned to the remote sender. The three channels are prioritised: credit release (highest), doorbell response, and reset handshake (lowest).
Credit accounting is handled automatically. The maximum credit count is derived from the SRAM size (4096 words for a 16 KB FIFO). Each packet costs its word length plus one (for the length word itself). Credits are decremented on write and incremented on read in a circular buffer scheme. Setting the release threshold to zero passes credits through immediately for backward compatibility.
FC Adapter
The FC adapter (tidelink_fc_adapter) bridges the AHB domain to the Wlink FC node. It handles traffic through a priority-arbitrated TX path and a stateless RX demultiplexer:
Transmit side: The adapter presents a write-only AHB slave (the TX aperture) through which the CPU writes packet words. Each 32-bit AHB write is combined with the AHB address offset to form a 48-bit FC word: 2 bits of packet type, 14 bits of address offset within the 16 KB aperture, and 32 bits of payload. The adapter also intercepts the returner's AHB master writes — credit deltas and doorbells — and re-encodes them as SIDEBAND FC packets on the same FC node, with the returner's target register offset carried in the address field. The PTP servo can also inject FC SIDEBAND packets carrying timestamp data directly to the remote side's mailbox registers. TX priority is: returner (highest) > servo SIDEBAND > TX aperture (lowest).
Receive side: The adapter accepts incoming 48-bit FC words and routes them by packet type: FIFO_DATA words are written directly to the FIFO data window via the direct FC write path (bypassing the AHB bus), and SIDEBAND words are routed to the APB configuration registers (targeting the appropriate mailbox, servo, or controller register). Each FC word is self-describing — it carries its own destination address and routing tag — so the RX path is entirely stateless.
Address Translator
The address translator (tidelink_addr_translator) provides APB-configurable address remapping for the transparent AHB bridge path. It contains two independent translation channels, each backed by 8 programmable CAM-based match/replace rules (parameterised via NUM_RULES). Each rule matches on the upper bits of the incoming address and replaces them with a configured output pattern, allowing software to map local address ranges to arbitrary remote address ranges. This CAM-based approach reduces register storage from 2,048 FFs (for a full 256-entry segment table) to approximately 169 FFs per channel, with no reduction in practical flexibility for typical chiplet address maps.
Packetisation
TideLink packets are a software convention imposed on the raw FIFO word stream. The first word written to the TX aperture at address offset 0x0000 is a length field specifying the number of words that follow. The next three words form a descriptor header:
- Word 1: Packet type (RD_REQ, WR_REQ, RD_RSP, WR_RSP, ERROR), source and destination chiplet IDs (8-bit each), transaction tag (8-bit), status, and burst type.
- Word 2: 32-bit destination address on the remote chiplet.
- Word 3: Burst length and beat size.
- Words 4+: Data payload (for write requests and read responses).
Hardware is unaware of packet semantics — it transports each 32-bit word independently as a FIFO_DATA FC packet. The receiving CPU reconstructs the packet by reading the length word first, then popping the descriptor and payload from the local RX FIFO.
Write and Read Mechanisms
For a write request: The host CPU constructs a WR_REQ packet (descriptor plus data payload) and writes it word-by-word to the TX aperture. Each word is forwarded over the FC node to the remote FIFO. On arrival of the final word, packet_committed_irq fires on the device side. The device CPU pops the descriptor, performs the requested local AHB writes at the destination address, and optionally sends a WR_RSP acknowledgement back.
For a read request: The host CPU writes only the RD_REQ descriptor (4 words, no data payload) to the TX aperture and is immediately free — the AHB bus is never stalled waiting for remote data. The device CPU receives the descriptor, performs local AHB reads at the specified address, constructs an RD_RSP packet containing the descriptor header and read data, and writes it back through the device TX aperture. The RD_RSP traverses the link and arrives in the host RX FIFO, triggering packet_committed_irq on the host. The host CPU then pops the response data. This asynchronous round-trip avoids the AHB blocking-bus problem entirely: the host CPU was free for other work for the full duration of the remote read.
Precision Time Protocol (PTP) Subsystem
Motivation
In multi-chiplet systems, a common time reference is critical. In the reference deployment, one chiplet has Ethernet connectivity and synchronises to an external PTP Grandmaster via standard IEEE 1588. Other chiplets in the system have no direct Ethernet access. TideLink PTP propagates the disciplined time from the Ethernet-connected chiplet (acting as a local Grandmaster) to all other chiplets (Subordinates) over the die-to-die link, creating a two-level PTP hierarchy:
IEEE 1588 PTP TideLink PTP
(Ethernet) (Die-to-Die)
External PTP Grandmaster
│
│ Ethernet 1588
▼
Chiplet A (Grandmaster) ◄── Ethernet-connected
│
│ TideLink PTP (short packets 0x50/0x51)
▼
Chiplet B (Subordinate) ◄── No Ethernet
For multi-hop deployments, TideLink supports cascaded PTP synchronisation: a Subordinate that has converged can act as a Grandmaster to a further chiplet. The PHC_LOCK_GATE_EN parameter gates the hardware sync initiator on an external phc_locked_i signal, ensuring that a mid-chain chiplet does not begin forwarding SYNC messages until its own clock is locked to the upstream source.
Protocol
TideLink PTP implements a simplified two-message clock synchronisation protocol inspired by IEEE 1588. The exchange uses SYNC and DELAY_REQ messages carried as Wlink short packets (32 bits on wire: 8-bit ECC, 16-bit payload, 8-bit data_id). No follow-up messages are required because timestamps are captured in hardware at the exact moment of packet handshake.
The protocol flow is:
- Grandmaster sends SYNC: The PTP module waits for
tx_router_idle(ensuring no other traffic is in the TX pipeline), simultaneously assertshw_capture(capturing timestamp t1 in the PHC) and transmits the SYNC short packet. - Subordinate receives SYNC: The PTP module asserts
hw_captureon receipt, capturing t2. An interrupt fires. - Subordinate sends DELAY_REQ: Same idle-gated process, capturing t3.
- Grandmaster receives DELAY_REQ: Captures t4. An interrupt fires.
- Timestamp exchange: t1 and t4 are sent to the Subordinate (via the FC SIDEBAND path or the mailbox FIFO).
- Offset and delay computation:
offset = ((t2 - t1) - (t4 - t3)) / 2delay = ((t2 - t1) + (t4 - t3)) / 2
The idle gating on the TX path is critical: by waiting until the Wlink TX router is idle before transmitting, the short packet enters the link layer with deterministic latency, eliminating arbitration jitter on transmit timestamps.
Hardware Sync Initiator
The PTP module includes a hardware sync initiator that autonomously generates periodic SYNC messages without CPU intervention. It uses the PHC time outputs to determine when to fire, maintains a target timestamp that advances by a configurable interval (matching IEEE 1588 logSyncInterval ranges from 128 Hz to 1/16 Hz), and auto-increments a 16-bit sequence number. The initiator shares the TX path with software-initiated messages and servo-initiated DELAY_REQ messages, with software having priority. When PHC_LOCK_GATE_EN=1, the initiator is gated on phc_locked_i, preventing SYNC emission until the local clock is stable — essential for multi-hop PTP chains.
Autonomous Hardware Servo
For applications requiring clock synchronisation without any CPU intervention, TideLink includes a fully autonomous hardware PTP servo (tidelink_ptp_servo). The servo operates in one of two modes:
- Grandmaster mode: Captures t1/t4 timestamps after each SYNC/DELAY_REQ exchange and sends them to the Subordinate via FC SIDEBAND packets (4 words per timestamp, written directly to the remote side's mailbox registers).
- Subordinate mode: Captures t2/t3 timestamps, receives t1/t4 from the Grandmaster via the SIDEBAND mailbox, computes offset and delay, autonomously triggers DELAY_REQ messages, and adjusts the local PHC.
Clock discipline uses a two-tier approach:
- Large offsets (exceeding a configurable step threshold, or seconds mismatch): Direct phase step via the PHC SET_TIME registers.
- Small offsets: A PI (proportional-integral) controller adjusts the PHC's
NS_INCR_FRACregister to steer the clock frequency. The proportional and integral gains (KP, KI) are configurable via APB registers, with defaults of approximately 0.7 and 0.3 respectively in Q0.32 fixed-point representation.
The servo multiplication engine is parameterised: iterative mode (32-cycle, small area) or combinational mode (1-cycle, larger area). The servo exposes status registers including the last computed offset, last one-way delay, current NS_INCR_FRAC value, and a servo_locked indicator that is also brought out as a top-level output.
Clock Domain Crossing
The PHC may operate on a different clock from the AHB system clock. The CDC bridge (tidelink_phc_cdc) synchronises six signal paths between the two domains:
| Path | Direction | Width | Purpose | Mechanism |
|---|---|---|---|---|
| 1 | PHC → AHB | 110-bit | HW capture timestamps | Quasi-static snapshot |
| 2 | PHC → AHB | 78-bit | Free-running PHC time | Handshake snapshot |
| 3 | PHC → AHB | 1-bit | PPS pulse | Toggle-based pulse sync |
| 4 | AHB → PHC | 1-bit | HW capture trigger | Toggle-based pulse sync |
| 5 | AHB → PHC | 79-bit | Phase step command | Data + pulse handshake |
| 6 | AHB → PHC | 33-bit | Frequency adjust | Data + pulse handshake |
The module uses a configurable synchroniser chain depth (minimum 2 stages) and is safe for fully asynchronous clocks. When both clocks are the same (single-clock mode), the module can be bypassed via a BYPASS_CDC parameter, reducing cost from approximately 526 flip-flops to approximately 20.
Generic Chiplet Controller
TideLink wraps the Wlink die-to-die link layer in a generic chiplet controller (axi_chiplet_controller) that adds several integration features beyond raw link-layer transport:
- Runtime master/slave role selection: A strap pin (
role_strap_i) determines whether the chiplet acts as master or slave. The role is locked at startup and exposed asrole_is_master_oandrole_locked_o. Wlink power-on-reset is gated until the role is locked, ensuring deterministic initialisation. Different APB register sets are exposed depending on the selected role. - I2C sideband: Independent I2C master and slave cores with pin-muxed tristate I/O provide an out-of-band communication channel for configuration, recovery, and boot-time handshaking before the main link is active. The I2C master is accessible via an AXI slave port (
s_i2c_axi_*). - D2D reset output:
d2d_reset_oallows one chiplet to hold the other in reset.
This abstraction allows a single TideLink RTL design to be instantiated identically on both sides of a chiplet link, with the role strap determining which side acts as master and which as slave.
Verification
TideLink has extensive verification infrastructure spanning cocotb (Python-based), UVM (SystemVerilog), formal (X-propagation), CDC (SpyGlass), and lint (Cadence HAL):
- cocotb: 296 tests across 13 environments covering the FIFO controller, returner, APB registers, FC adapter, address translator, iterative multiplier, AHB wrapper, paired system stress, PTP short-packet exchange, PTP servo operation, and full top-level loopback.
- UVM: 8 environments with 51 test files covering FIFO unit tests, FC adapter TX/RX paths, loopback integration, paired system stress (credit exhaustion, reset recovery, sideband stress, mixed traffic), PTP jitter stress characterisation (under concurrent AXI, mailbox, and general bus traffic), PTP convergence analysis (PI servo model, offset/drift/step-change recovery, long-term stability), and multi-hop PTP chain testing (lock propagation, force enable, step recovery).
- Formal: VC Formal X-propagation analysis on 5 modules (FIFO controller, returner, APB registers, FIFO wrapper, top-level FIFO integration).
- CDC: SpyGlass CDC analysis on
tidelink_topwith constraint and waiver files. - Lint: Cadence HAL lint with standalone and CMSDK-dependent module targets.
- Code coverage (VCS): Line coverage exceeding 92% on the FIFO subsystem, with condition, branch, toggle, and FSM coverage actively tracked across all environments.
- CI/CD: A 9-stage GitLab CI pipeline runs lint, CDC, cocotb regression, UVM regression, C driver tests, synthesis (Design Compiler), coverage merging, and dashboard generation.
Known Limitations
The design has several documented limitations, including: no hardware credit underflow protection (software can write packets larger than available credits, causing counter wrap); a single-packet-in-flight limitation at the FIFO controller level; no AHB error response on FIFO overrun/underrun (errors are flagged in a status register but the bus transfer completes normally); no returner retry mechanism on bus errors; and RX-side PTP jitter from the Wlink receive pipeline that cannot be gated. These are documented in detail with severity classifications and recommended mitigations.
Add new comment
To post a comment on this article, please log in to your account. New users can create an account.