Re: [PATCH] accel/qaic: Add crashdump to Sahara

Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx> · Thu, 19 Sep 2024 09:00:29 -0600

On 9/18/2024 5:41 PM, Konrad Dybcio wrote:
On 18.09.2024 5:52 PM, Jeffrey Hugo wrote:
The Sahara protocol has a crashdump functionality. In the hello
exchange, the device can advertise it has a memory dump available for
the host to collect. Instead of the device making requests of the host,
the host requests data from the device which can be later analyzed.

Implement this functionality and utilize the devcoredump framework for
handing the dump over to userspace.

Similar to how firmware loading in Sahara involves multiple files,
crashdump can consist of multiple files for different parts of the dump.
Structure these into a single buffer that userspace can parse and
extract the original files from.

Reviewed-by: Carl Vanderlip <quic_carlv@xxxxxxxxxxx>
Signed-off-by: Jeffrey Hugo <quic_jhugo@xxxxxxxxxxx>
---

I gave this a brief read, but.. aren't you dumping however much DRAM the
AIC100 has (and then some SRAM) onto the host machine without the user
asking for it (i.e. immediately after the AIC crashes)?

I'm not entirely clear what the concern is.  Too much host RAM usage maybe?

In short, I think the direct answer is yes and no.

We put the dump content in the host RAM and allow the user to decide if 
they want to save it.  The user has 5 minutes to do something with the 
dump, then the devcoredump framework automatically frees the content in 
RAM.  Typically the user would access the sysfs file provided by 
devcoredump, and save the contents to the file system for offline 
processing.

There are a few other GPUs and several other devices that do the same. 
Panfrost appears to save every BO the user allocated into the dump, 
which would suggest that the user could create an arbitrarily large dump.

In the case of AIC100, it is technically possible for the entire device 
DRAM and SRAM to be offloaded.  That is up to the FW to decide if all of 
that is relevant.  Current implementation of the FW is heavily 
aggressive on what it selects for the dump, and current dumps are in the 
100-200MB range.

It feels like you are implying the user should somehow request the dump 
by having devcoredump publish something, and then hook into the user's 
reads of the sysfs to go collect the dump.  I worry that means the 
driver would then need to determine when there is no user interested in 
collecting the dump, in order to continue the reboot process.  I expect 
that would be a 5 minute delay (devcoredump deciding there is no user 
interest after 5 minutes).  I worry that users would object to such a 
delay given customer feedback we've had on getting the devices into 
service quickly.

-Jeff