Am 17.03.22 um 10:29 schrieb Daniel Vetter:
On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote:
Am 16.03.22 um 16:36 schrieb Rob Clark:
[SNIP]
just one point of clarification.. in the msm and i915 case it is
purely for debugging and telemetry (ie. sending crash logs back to
distro for analysis if user has crash reporting enabled).. it isn't
used for triggering any action like killing app or compositor.
By the way, how does msm it's memory management for the devcoredumps?
GFP_NORECLAIM all the way. It's purely best effort.
Ok, good to know that it's as simple as that.
Note that the fancy new plan for i915 discrete gpu is to only support gpu
crash dumps on non-recoverable gpu contexts, i.e. those that do not
continue to the next batch when something bad happens.
This is what vk wants
That's exactly what I'm telling an internal team for a couple of years
now as well. Good to know that this is not that totally crazy.
and also what iris now uses (we do context recovery in userspace in
all cases), and non-recoverable contexts greatly simplify the crash dump
gather: Only thing you need to gather is the register state from hw
(before you reset it), all the batchbuffer bo and indirect state bo (in
i915 you can mark which bo to capture in the CS ioctl) can be captured in
a worker later on. Which for non-recoverable context is no issue, since
subsequent batchbuffers won't trample over any of these things.
And that way you can record the crashdump (or at least the big pieces like
all the indirect state stuff) with GFP_KERNEL.
Interesting idea, so basically we only do the state we need to reset
initially and grab a reference on the killed application to gather the
rest before we clean them up.
Going to keep that in mind as well.
Thanks,
Christian.
msm probably gets it wrong since embedded drivers have much less shrinker
and generally no mmu notifiers going on :-)
I mean it is strictly forbidden to allocate any memory in the GPU reset
path.
I would however *strongly* recommend devcoredump support in other GPU
drivers (i915's thing pre-dates devcoredump by a lot).. I've used it
to debug and fix a couple obscure issues that I was not able to
reproduce by myself.
Yes, completely agree as well.
+1
Cheers, Daniel