PCI Express (PCIe) is one of the most reliable and robust high-speed protocols ever designed. Why? Because it includes multiple layers of error detection, correction, reporting, and recovery. In this part, we focus only on PCIe-specific error handling mechanisms.
1. PCIe Error Types (Based on Severity)
PCIe classifies errors into three major categories:
| Error Type | Impact | System Behavior |
|---|---|---|
| Correctable | Error is fixed automatically | No system impact |
| Non-Fatal | Affects transaction but link stays up | Software recovery needed |
| Fatal | Link/device failure | Requires reset or re-enumeration |
2. Error Classification by Layer
| Layer | Examples of Errors |
|---|---|
| Physical | Signal loss, lane failure |
| Data Link | LCRC error, Replay timeout |
| Transaction | Malformed TLP, Unsupported request |
3. PCIe Error Reporting Features
PCIe uses config space registers to log and report errors:
- Device Status / Control Registers
- AER (Advanced Error Reporting) Capability
4. Advanced Error Reporting (AER)
AER is an optional but widely used feature.
It includes:
- Correctable Error Status
- Uncorrectable Error Status
- Header Logs (captures offending TLP)
- Root Error Status (aggregates errors)
Registers help software identify exactly what went wrong.
5. Transaction Layer Errors
5.1 Malformed TLP
- Invalid header
- Wrong length or format
Action: TLP is dropped, error logged.
5.2 Unsupported Request (UR)
- Address not accessible
Action: UR completion TLP sent back.
5.3 Completion Timeout
- No response received in allotted time
Action: Non-fatal error
6. Data Link Layer Errors
6.1 LCRC Error
Each TLP has a Link CRC (LCRC) to detect corruption.
Process:
- Transmitter sends TLP with LCRC.
- Receiver calculates LCRC.
- If mismatch → send NAK.
- Transmitter replays TLP from Replay Buffer.
This guarantees reliable delivery.
6.2 Replay Timer Timeout
If ACK/NAK not received → error.
7. ACK / NAK Flow (ASCII)
Sender ---- TLP ----> Receiver
|
| LCRC Check OK
V
ACK Sent
Sender ---- TLP ----> Receiver
|
| LCRC Check FAIL
V
NAK Sent
Sender replays TLP
8. Replay Buffer
Sender stores recent TLPs in a Replay Buffer until ACK is received.
If NAK or timeout occurs → resend from buffer.
Prevents data loss.
9. ECRC (End-to-End CRC)
Optional Transaction Layer CRC.
- Protects TLP from end to end (not just hop by hop).
- More robust than LCRC.
- Used for critical traffic (e.g., I/O virtualization, SR-IOV).
10. Fatal vs Non-Fatal Errors (Examples)
| Error | Type |
|---|---|
| Correctable CRC error | Correctable |
| Replay timeout | Correctable / Non-Fatal |
| Malformed TLP | Non-Fatal |
| Unsupported Request | Non-Fatal |
| Poisoned TLP | Non-Fatal |
| Surprise Down | Fatal |
| Link electrical failure | Fatal |
| Uncorrectable internal error | Fatal |
11. Poisoned TLP
A TLP marked as “poisoned” indicates data corruption detected by source.
- Receiver detects poison bit.
- Data not used.
- Error logged, no retry.
Used to prevent silent data corruption.
12. Surprise Down Error
Occurs when a device is unplugged without proper shutdown.
Action: Fatal error → link reset or disable.
13. Hot-Plug Errors
Hot-plug allows adding/removing devices at runtime.
Errors may occur during:
- Powerup
- Link training
- Configuration
Handled by root complex and software.
14. Link Recovery Mechanism
If the link becomes unstable, PCIe enters Recovery State in LTSSM.
Steps:
- Detect error
- Enter Recovery
- Retrain link (equalization, speed, width)
- Re-enter L0 (normal operation)
If recovery fails → disable link.
15. Error Logging and Propagation
Errors are captured in:
- Device Registers
- AER Capability
- Root Port Error Status
Software (OS/driver) reads logs, takes corrective action.
16. System Response to Errors
| Error Type | PCIe Action | OS/Driver Action |
|---|---|---|
| Correctable | Retry or ignore | None |
| Non-Fatal | Continue link | Log, software recovery |
| Fatal | Disable link | Reset device or bus |
17. Key Mechanisms in PCIe Reliability
| Mechanism | Purpose |
|---|---|
| LCRC | Detect Data Link errors |
| ECRC | Detect Transaction errors |
| ACK/NAK | Request retransmission |
| Replay Buffer | Store TLPs for retry |
| FEC | Correct physical errors |
| LTSSM Recovery | Fix link instability |
| AER | Report detailed errors |
| Poisoned TLP | Avoid using corrupted data |
| Timeouts | Detect lost responses |
18. Example Error Flow (ASCII)
1. Sender -> TLP -> Receiver
2. Receiver detects LCRC error
3. Receiver sends NAK
4. Sender replays TLP
5. Receiver ACKs
6. Transaction completes
19. Example: Fatal Error Flow
1. Link loses signal
2. Physical layer cannot recover
3. LTSSM enters Disabled
4. Device removed from bus
5. OS initiates re-enumeration or reset

