On Thu, Mar 17, 2022 at 08:34:21AM -0700, Rob Clark wrote: > On Thu, Mar 17, 2022 at 2:29 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > > > On Thu, Mar 17, 2022 at 08:03:27AM +0100, Christian König wrote: > > > Am 16.03.22 um 16:36 schrieb Rob Clark: > > > > [SNIP] > > > > just one point of clarification.. in the msm and i915 case it is > > > > purely for debugging and telemetry (ie. sending crash logs back to > > > > distro for analysis if user has crash reporting enabled).. it isn't > > > > used for triggering any action like killing app or compositor. > > > > > > By the way, how does msm it's memory management for the devcoredumps? > > > > GFP_NORECLAIM all the way. It's purely best effort. > > We do one GEM obj allocation in the snapshot path (the hw has a > mechanism to snapshot it's own state into a gpu buffer.. not sure if > nice debugging functionality like that is a commentary on the blob > driver quality, but I'm not complaining) > > I suppose we could pre-allocate this buffer up-front.. but it doesn't > seem like a problem, ie. if allocation fails we just skip snapshotting > stuff that needs the hw crashdumper. I guess since vram is not > involved, perhaps that makes the situation a bit more straightforward. The problem is that you need to allocate with GFP_ATOMIC, instead of GFP_KERNEL, or things go very bad. The scheduler dma-fence annotations I've had (well still have them here) would catch this stuff, but thus far they got nowhere. > > Note that the fancy new plan for i915 discrete gpu is to only support gpu > > crash dumps on non-recoverable gpu contexts, i.e. those that do not > > continue to the next batch when something bad happens. This is what vk > > wants and also what iris now uses (we do context recovery in userspace in > > all cases), and non-recoverable contexts greatly simplify the crash dump > > gather: Only thing you need to gather is the register state from hw > > (before you reset it), all the batchbuffer bo and indirect state bo (in > > i915 you can mark which bo to capture in the CS ioctl) can be captured in > > a worker later on. Which for non-recoverable context is no issue, since > > subsequent batchbuffers won't trample over any of these things. > > > > And that way you can record the crashdump (or at least the big pieces like > > all the indirect state stuff) with GFP_KERNEL. > > > > msm probably gets it wrong since embedded drivers have much less shrinker > > and generally no mmu notifiers going on :-) > > Note that the bo's associated with the batch are still pinned at this > point, from the bo lifecycle the batch is still active. So from the > point of view of shrinker, there should be no interaction. We aren't > doing anything with mmu notifiers (yet), so not entirely sure offhand > the concern there. > > Currently we just use GFP_KERNEL and bail if allocation fails. Yeah you have a simple enough shrinker for this not to be a problem. The issue is that sooner or later things tend to not stay like that, and we're trying to have common rules for dma_fence to make sure everyone follows the same rules. -Daniel > > BR, > -R > > > > I mean it is strictly forbidden to allocate any memory in the GPU reset > > > path. > > > > > > > I would however *strongly* recommend devcoredump support in other GPU > > > > drivers (i915's thing pre-dates devcoredump by a lot).. I've used it > > > > to debug and fix a couple obscure issues that I was not able to > > > > reproduce by myself. > > > > > > Yes, completely agree as well. > > > > +1 > > > > Cheers, Daniel > > -- > > Daniel Vetter > > Software Engineer, Intel Corporation > > http://blog.ffwll.ch -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch