PCI Express (PCIe) is one of the most reliable and robust high-speed protocols ever designed. Why? Because it includes multiple layers of error detection, correction, reporting, and recovery. In this part, we focus only on PCIe-specific error handling mechanisms.


1. PCIe Error Types (Based on Severity)

PCIe classifies errors into three major categories:

Error TypeImpactSystem Behavior
CorrectableError is fixed automaticallyNo system impact
Non-FatalAffects transaction but link stays upSoftware recovery needed
FatalLink/device failureRequires reset or re-enumeration

2. Error Classification by Layer

LayerExamples of Errors
PhysicalSignal loss, lane failure
Data LinkLCRC error, Replay timeout
TransactionMalformed TLP, Unsupported request

3. PCIe Error Reporting Features

PCIe uses config space registers to log and report errors:

  • Device Status / Control Registers
  • AER (Advanced Error Reporting) Capability

4. Advanced Error Reporting (AER)

AER is an optional but widely used feature.

It includes:

  • Correctable Error Status
  • Uncorrectable Error Status
  • Header Logs (captures offending TLP)
  • Root Error Status (aggregates errors)

Registers help software identify exactly what went wrong.


5. Transaction Layer Errors

5.1 Malformed TLP

  • Invalid header
  • Wrong length or format

Action: TLP is dropped, error logged.

5.2 Unsupported Request (UR)

  • Address not accessible
    Action: UR completion TLP sent back.

5.3 Completion Timeout

  • No response received in allotted time
    Action: Non-fatal error

6. Data Link Layer Errors

6.1 LCRC Error

Each TLP has a Link CRC (LCRC) to detect corruption.

Process:

  1. Transmitter sends TLP with LCRC.
  2. Receiver calculates LCRC.
  3. If mismatch → send NAK.
  4. Transmitter replays TLP from Replay Buffer.

This guarantees reliable delivery.

6.2 Replay Timer Timeout

If ACK/NAK not received → error.


7. ACK / NAK Flow (ASCII)

Sender ---- TLP ----> Receiver
             |
             | LCRC Check OK
             V
          ACK Sent
Sender ---- TLP ----> Receiver
             |
             | LCRC Check FAIL
             V
          NAK Sent
Sender replays TLP

8. Replay Buffer

Sender stores recent TLPs in a Replay Buffer until ACK is received.

If NAK or timeout occurs → resend from buffer.

Prevents data loss.


9. ECRC (End-to-End CRC)

Optional Transaction Layer CRC.

  • Protects TLP from end to end (not just hop by hop).
  • More robust than LCRC.
  • Used for critical traffic (e.g., I/O virtualization, SR-IOV).

10. Fatal vs Non-Fatal Errors (Examples)

ErrorType
Correctable CRC errorCorrectable
Replay timeoutCorrectable / Non-Fatal
Malformed TLPNon-Fatal
Unsupported RequestNon-Fatal
Poisoned TLPNon-Fatal
Surprise DownFatal
Link electrical failureFatal
Uncorrectable internal errorFatal

11. Poisoned TLP

A TLP marked as “poisoned” indicates data corruption detected by source.

  • Receiver detects poison bit.
  • Data not used.
  • Error logged, no retry.

Used to prevent silent data corruption.


12. Surprise Down Error

Occurs when a device is unplugged without proper shutdown.

Action: Fatal error → link reset or disable.


13. Hot-Plug Errors

Hot-plug allows adding/removing devices at runtime.
Errors may occur during:

  • Powerup
  • Link training
  • Configuration

Handled by root complex and software.


14. Link Recovery Mechanism

If the link becomes unstable, PCIe enters Recovery State in LTSSM.

Steps:

  1. Detect error
  2. Enter Recovery
  3. Retrain link (equalization, speed, width)
  4. Re-enter L0 (normal operation)

If recovery fails → disable link.


15. Error Logging and Propagation

Errors are captured in:

  • Device Registers
  • AER Capability
  • Root Port Error Status

Software (OS/driver) reads logs, takes corrective action.


16. System Response to Errors

Error TypePCIe ActionOS/Driver Action
CorrectableRetry or ignoreNone
Non-FatalContinue linkLog, software recovery
FatalDisable linkReset device or bus

17. Key Mechanisms in PCIe Reliability

MechanismPurpose
LCRCDetect Data Link errors
ECRCDetect Transaction errors
ACK/NAKRequest retransmission
Replay BufferStore TLPs for retry
FECCorrect physical errors
LTSSM RecoveryFix link instability
AERReport detailed errors
Poisoned TLPAvoid using corrupted data
TimeoutsDetect lost responses

18. Example Error Flow (ASCII)

1. Sender -> TLP -> Receiver
2. Receiver detects LCRC error
3. Receiver sends NAK
4. Sender replays TLP
5. Receiver ACKs
6. Transaction completes

19. Example: Fatal Error Flow

1. Link loses signal
2. Physical layer cannot recover
3. LTSSM enters Disabled
4. Device removed from bus
5. OS initiates re-enumeration or reset

Scroll to Top