Customer intelligence for semiconductor FAEs

Your first-week FAE,
answering like a ten-year veteran.

Every customer question your team has answered before becomes an answer waiting on your desk — sourced to the engineer who first solved it. Press Option-Space on any screen; the rest is a draft you send.

See how it works

PCIe link drops intermittently aft…×

github.com/helio-robotics/atlas-fw/issues/482

helio-robotics / atlas-fw

PCIe link drops intermittently after firmware 1.4.2 #482

Openmaya-chen opened this issue 2 days ago · 4 comments · 5 participants

MCmaya-chen commented 2 days agoedited

We’re seeing intermittent PCIe link drops on the 12-node cluster under sustained load after updating to firmware 1.4.2. Logs below.

Note Only reproduces under sustained load (>20 min at full utilization).

[ 8423.71] pcieport 0000:40:01.0: AER: Corrected error
[ 8423.71]  device [8086:347a] error/Receiver ID
[ 8429.02] pcieport: Link retrain failed, lane width x8 → x4

Board	atlas-rev-C2
Firmware	1.4.2
Lane width	x16 (negotiated x8)

👍 5👀 3

maya-chen added the bug firmware labels · 2 days ago

darius-ok self-assigned this · yesterday

darius-ok mentioned this in #467 · yesterday

DOdarius-ok commented yesterdayMember

Reproduced on our bench rig — the drops start exactly as the link tries to enter L1.2. Fairly sure it’s ASPM-related but I haven’t pinned the regressing change yet. Have we seen this on rev-C2 before?

Assignees

maya-chen

Labels

bugregressionpciefirmwareP1needs-triage

Milestone

1.4.x stability

41% complete7 open · 5 closed

5 participants

Development

Tighten ASPM L1.2 entry delay #491

Case #PCIE-218 · 94% match

Reading · github.com⌥Space

Your team has already solved this link-drop — root cause was ASPM L1 substate timing.

Answering from your team’s cases

DODarius Okonkwo#PCIE-218 · 2024 · GitHub1

PNPriya Nairfield-eng thread · 2023 · Discord2

AKAria Kerr1.4.x stability · 2022 · GitHub3

Draft — awaiting your approval

Grounded in1.4.3-rc changelogCase #PCIE-218

↩Use this reply⏎

Reads fromOutlookGitHubJiraDiscordGmail

Inside Outlook

And right where the email lands.

Open a customer thread and the same intelligence is already there — the matched case, the recommended reply, the cross-source history — without leaving the message. The pane moves through Intel, Tickets and Chat on its own; hover to take over.

Outlook

ReplyReply allForward

Re: PCIe link drops on the 12-node cluster

Maya Chen <[email protected]>Tue 9:14 AM

To: You

Hi —

We’re seeing PCIe links drop on the 12-node cluster under sustained load. We’re prepping for Q3 production and this is now a hard blocker for sign-off.

All nodes are on firmware 1.4.1. It reproduces within ~20 minutes at full utilization. Was there a firmware fix for this, and if so what’s the rollout timing? Happy to share the cluster logs.

Thanks,
Maya

Maya Chen · Field Application Engineer
Helio Robotics

cluster-logs-0624.txt84 KB

On Mon, Jun 23, You wrote:

Thanks Maya — can you confirm the exact firmware build and attach the cluster logs? We’ll check the failure signature against the known PCIe retrain issue and get you a rollout date.

On Mon, Jun 23, Maya Chen wrote:

Heads up — PCIe link drops are back on the 12-node cluster under sustained Q3 load testing. Flagging early, before sign-off, in case there’s a known fix.

Synchronize⋯

Helio Robotics

helio-robotics.com

Customer Overview

Helio is evaluating the 12-node cluster for Q3 production. The open thread is a PCIe link drop under sustained load — now a stated blocker. Maya Chen (FAE) has asked twice for a firmware date.

Recommended Next

1Reply: the retrain bug is fixed in firmware 1.4.2Matches GitHub #1190 — resolves their stated blocker.
2Confirm the 1.4.2 rollout windowOpen on Jira FAE-482; Maya has asked twice.
3Attach the thermal-throttling workaroundHolds the cluster until 1.4.2 ships.

Similar Engagements

Vortex Compute

Compute infrastructure

Same issue: PCIe link drop

HardwarePCIe12-node

SymptomsLink drop

Arclight AI

AI accelerators

Same issue: Thermal throttling

SymptomsThrottling

SoftwareFW 1.4.1

AI-assisted · verify before sending

The workspace

Every account, every signal, every resolution — one place.

Behind the companion is the full workspace: a home that triages what needs a reply, and a customer view that collapses every source into one story. Insight is the unit, not charts or counts.

Synchronize

Customers

Portfolio cockpit · 24 accounts

At-risk accounts

4 / 24▲ +2 wk

Stale blockers >14d

11▲ +3

Going quiet · 30d

3▬ flat

Account health · book of 242 at risk · 1 silent >14d · 5 stale blockers

Account	Stage	Health	Open / Stale	Last	Suggested action
Helio Robotics	Integration	At risk	14 / 5	1d	3 blockers >14d on PCIe scale-out — book sync
Orbital Dynamics	Integration	At risk	9 / 4	21d	Silent 21d after perf regression — ESCALATE
Arclight AI	Bring-up	Watch	6 / 2	9d	Activity decaying mid bring-up — check in
Vortex Compute	Pre-production	Healthy	4 / 0	2d	On track — confirm production timeline
Nimbus Photonics	Sampling	Watch	8 / 1	3d	Eval spike on firmware — send 1.4.2 guide

Stop losing what your team already knows.

Built for semiconductor support teams who refuse to re-diagnose the same failure twice.

See how it works