Your first-week FAE,
answering like a ten-year veteran.
Every customer question your team has answered before becomes an answer waiting on your desk — sourced to the engineer who first solved it. Press Option-Space on any screen; the rest is a draft you send.
We’re seeing intermittent PCIe link drops on the 12-node cluster under sustained load after updating to firmware 1.4.2. Logs below.
[ 8423.71] pcieport 0000:40:01.0: AER: Corrected error[ 8423.71] device [8086:347a] error/Receiver ID[ 8429.02] pcieport: Link retrain failed, lane width x8 → x4
| Board | atlas-rev-C2 |
| Firmware | 1.4.2 |
| Lane width | x16 (negotiated x8) |
Reproduced on our bench rig — the drops start exactly as the link tries to enter L1.2. Fairly sure it’s ASPM-related but I haven’t pinned the regressing change yet. Have we seen this on rev-C2 before?
Your team has already solved this link-drop — root cause was ASPM L1 substate timing.
And right where the email lands.
Open a customer thread and the same intelligence is already there — the matched case, the recommended reply, the cross-source history — without leaving the message. The pane moves through Intel, Tickets and Chat on its own; hover to take over.
Re: PCIe link drops on the 12-node cluster
Hi —
We’re seeing PCIe links drop on the 12-node cluster under sustained load. We’re prepping for Q3 production and this is now a hard blocker for sign-off.
All nodes are on firmware 1.4.1. It reproduces within ~20 minutes at full utilization. Was there a firmware fix for this, and if so what’s the rollout timing? Happy to share the cluster logs.
Thanks,
Maya
Helio Robotics
Thanks Maya — can you confirm the exact firmware build and attach the cluster logs? We’ll check the failure signature against the known PCIe retrain issue and get you a rollout date.
Heads up — PCIe link drops are back on the 12-node cluster under sustained Q3 load testing. Flagging early, before sign-off, in case there’s a known fix.
Helio is evaluating the 12-node cluster for Q3 production. The open thread is a PCIe link drop under sustained load — now a stated blocker. Maya Chen (FAE) has asked twice for a firmware date.
- 1Reply: the retrain bug is fixed in firmware 1.4.2Matches GitHub #1190 — resolves their stated blocker.
- 2Confirm the 1.4.2 rollout windowOpen on Jira FAE-482; Maya has asked twice.
- 3Attach the thermal-throttling workaroundHolds the cluster until 1.4.2 ships.
Every account, every signal, every resolution — one place.
Behind the companion is the full workspace: a home that triages what needs a reply, and a customer view that collapses every source into one story. Insight is the unit, not charts or counts.
| Account | Stage | Health | Engagement | Open / Stale | Last | Suggested action |
|---|---|---|---|---|---|---|
| Helio Robotics | Integration | At risk | 14 / 5 | 1d | 3 blockers >14d on PCIe scale-out — book sync | |
| Orbital Dynamics | Integration | At risk | 9 / 4 | 21d | Silent 21d after perf regression — ESCALATE | |
| Arclight AI | Bring-up | Watch | 6 / 2 | 9d | Activity decaying mid bring-up — check in | |
| Vortex Compute | Pre-production | Healthy | 4 / 0 | 2d | On track — confirm production timeline | |
| Nimbus Photonics | Sampling | Watch | 8 / 1 | 3d | Eval spike on firmware — send 1.4.2 guide |
Stop losing what your team already knows.
Built for semiconductor support teams who refuse to re-diagnose the same failure twice.