On Mon, May 30, 2022 at 08:41:20AM +0200, Juergen Gross wrote: > On 25.05.22 20:41, Demi Marie Obenour wrote: > > unmap_grant_pages() currently waits for the pages to no longer be used. > > In https://github.com/QubesOS/qubes-issues/issues/7481, this lead to a > > deadlock against i915: i915 was waiting for gntdev's MMU notifier to > > finish, while gntdev was waiting for i915 to free its pages. I also > > believe this is responsible for various deadlocks I have experienced in > > the past. > > > > Avoid these problems by making unmap_grant_pages async. This requires > > making it return void, as any errors will not be available when the > > function returns. Fortunately, the only use of the return value is a > > WARN_ON(), which can be replaced by a WARN_ON when the error is > > detected. Additionally, a failed call will not prevent further calls > > from being made, but this is harmless. > > > > Because unmap_grant_pages is now async, the grant handle will be sent to > > INVALID_GRANT_HANDLE too late to prevent multiple unmaps of the same > > handle. Instead, a separate bool array is allocated for this purpose. > > This wastes memory, but stuffing this information in padding bytes is > > too fragile. Furthermore, it is necessary to grab a reference to the > > map before making the asynchronous call, and release the reference when > > the call returns. > > I think there is even more syncing needed: > > - In the error path of gntdev_mmap() unmap_grant_pages() is being called and > it is assumed, map is available afterwards again. This should be rather easy > to avoid by adding a counter of active mappings to struct gntdev_grant_map > (number of pages not being unmapped yet). In case this counter is not zero > gntdev_mmap() should bail out early. Is it possible to just unmap the pages directly here? I don’t think there can be any other users of these pages yet. Userspace could race against the unmap and cause a page fault, but that should just cause userspace to get SIGSEGV or SIGBUS. In any case this code can use the sync version; if it gets blocked that’s userspace’s problem. > - gntdev_put_map() is calling unmap_grant_pages() in case the refcount has > dropped to zero. This call can set the refcount to 1 again, so there is > another delay needed before freeing map. I think unmap_grant_pages() should > return in case the count of mapped pages is zero (see above), thus avoiding > to increment the refcount of map if nothing is to be done. This would enable > gntdev_put_map() to just return after the call of unmap_grant_pages() in case > the refcount has been incremented again. I will change this in v3, but I do wonder if gntdev is using the wrong MMU notifier callback. It seems that the appropriate callback is actually release(): if I understand correctly, release() is called precisely when the refcount on the physical page is about to drop to 0, and that is what we want. -- Sincerely, Demi Marie Obenour (she/her/hers) Invisible Things Lab
Attachment:
signature.asc
Description: PGP signature