Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported through dmabuf

Christian König <christian.koenig@xxxxxxx> · Thu, 19 Dec 2024 11:04:54 +0100

Am 19.12.24 um 08:02 schrieb Kasireddy, Vivek:
Hi Christian,

Subject: Re: [PATCH 0/4] cover-letter: Allow MMIO regions to be exported
through dmabuf

	I will resend the patch series. I was experiencing issues with my email
client, which inadvertently split the series into two separate emails.
Alternatively I can also copy the code from the list archive and explain why
that doesn't work:

+void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool
+revoked) {
+    struct vfio_pci_dma_buf *priv;
+    struct vfio_pci_dma_buf *tmp;
+
+    lockdep_assert_held_write(&vdev->memory_lock);

This makes sure that the caller is holding vdev->memory_lock.

+
+    list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
+        if (!dma_buf_try_get(priv->dmabuf))

This here only works because vfio_pci_dma_buf_release() also grabs
vdev-
memory_lock and so we are protected against concurrent releases.
+            continue;
+        if (priv->revoked != revoked) {
+            dma_resv_lock(priv->dmabuf->resv, NULL);
+            priv->revoked = revoked;
+            dma_buf_move_notify(priv->dmabuf);
+            dma_resv_unlock(priv->dmabuf->resv);
+        }
+        dma_buf_put(priv->dmabuf);

The problem is now that this here might drop the last reference which in
turn
calls vfio_pci_dma_buf_release() which also tries to grab vdev-
memory_lock and so results in a deadlock.
AFAICS, vfio_pci_dma_buf_release() would not be called synchronously
after the
last reference is dropped by dma_buf_put(). This is because fput(), which is
called
by dma_buf_put() triggers f_op->release() asynchronously; therefore, a
deadlock
is unlikely to occur in this scenario, unless I am overlooking something.
My recollection is that the f_op->release handler is only called
asynchronously if fput() was issued in interrupt context.
Here is the code of fput() from the current master:
void fput(struct file *file)
{
         if (file_ref_put(&file->f_ref)) {
                 struct task_struct *task = current;

                 if (unlikely(!(file->f_mode & (FMODE_BACKING | FMODE_OPENED)))) {
                         file_free(file);
                         return;
                 }
                 if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
                         init_task_work(&file->f_task_work, ____fput);
                         if (!task_work_add(task, &file->f_task_work, TWA_RESUME))
                                 return;
                         /*
                          * After this task has run exit_task_work(),
                          * task_work_add() will fail.  Fall through to delayed
                          * fput to avoid leaking *file.
                          */
                 }

                 if (llist_add(&file->f_llist, &delayed_fput_list))
                         schedule_delayed_work(&delayed_fput_work, 1);
         }
}

IIUC, based on the above code, it looks like there are two ways in which the
f_op->release() handler is triggered from fput():
- via delayed_fput() for kernel threads and code in interrupt context
- via task_work_run() just before the task/process returns to the user-mode

The first scenario above is definitely asynchronous as the release() handler is
called from a worker thread. But I think the second case (which is the most
common and relevant for our use-case) can also be considered asynchronous,
because the execution of the system call or ioctl that led to the context in
which dma_buf_put() was called is completed.

Here is a trace from my light testing with the udmabuf driver, where you can
see the release() handler being called by syscall_exit_to_user_mode() :
[  158.464203] Call Trace:
[  158.466681]  <TASK>
[  158.468815]  dump_stack_lvl+0x60/0x80
[  158.472507]  dump_stack+0x14/0x16
[  158.475853]  release_udmabuf+0x2f/0x9f
[  158.479631]  dma_buf_release+0x3c/0x90
[  158.483408]  __dentry_kill+0x8f/0x180
[  158.487098]  dput+0xe7/0x1a0
[  158.490013]  __fput+0x131/0x2b0
[  158.493178]  ____fput+0x19/0x20
[  158.496352]  task_work_run+0x61/0x90
[  158.499959]  syscall_exit_to_user_mode+0x1a4/0x1b0
[  158.504769]  do_syscall_64+0x5b/0x110
[  158.508458]  entry_SYSCALL_64_after_hwframe+0x4b/0x53

And, here is the relevant syscall code (from arch/x86/entry/common.c):
__visible noinstr bool do_syscall_64(struct pt_regs *regs, int nr)
{
         add_random_kstack_offset();
         nr = syscall_enter_from_user_mode(regs, nr);

         instrumentation_begin();

         if (!do_syscall_x64(regs, nr) && !do_syscall_x32(regs, nr) && nr != -1) {
                 /* Invalid system call, but still a system call. */
                 regs->ax = __x64_sys_ni_syscall(regs);
         }

         instrumentation_end();
         syscall_exit_to_user_mode(regs);

I also confirmed that the release() handler is indeed called after dma_buf_put()
(and not by dma_buf_put()) by adding debug prints before and after
dma_buf_put() and one in the release() handler. Furthermore, I also found
that calling close() on the dmabuf fd in the user-space is one scenario in
which fput() is called synchronously. Here is the relevant trace:
[  302.770910] Call Trace:
[  302.773389]  <TASK>
[  302.775516]  dump_stack_lvl+0x60/0x80
[  302.779209]  dump_stack+0x14/0x16
[  302.782549]  release_udmabuf+0x2f/0x9f
[  302.786329]  dma_buf_release+0x3c/0x90
[  302.790105]  __dentry_kill+0x8f/0x180
[  302.793789]  dput+0xe7/0x1a0
[  302.796703]  __fput+0x131/0x2b0
[  302.799866]  __fput_sync+0x53/0x70
[  302.803299]  __x64_sys_close+0x58/0xc0
[  302.807076]  x64_sys_call+0x126a/0x17d0
[  302.810938]  do_syscall_64+0x4f/0x110
[  302.814622]  entry_SYSCALL_64_after_hwframe+0x4b/0x53

As you can see above, there is indeed a synchronous version of fput() defined
just below fput():
/*
  * synchronous analog of fput(); for kernel threads that might be needed
  * in some umount() (and thus can't use flush_delayed_fput() without
  * risking deadlocks), need to wait for completion of __fput() and know
  * for this specific struct file it won't involve anything that would
  * need them.  Use only if you really need it - at the very least,
  * don't blindly convert fput() by kernel thread to that.
  */
void __fput_sync(struct file *file)
{
	if (file_ref_put(&file->f_ref))
		__fput(file);
}

Based on all the above, I think we can conclude that since dma_buf_put()
does not directly (or synchronously) call the f_op->release() handler, a
deadlock is unlikely to occur in the scenario you described.

Yeah, I agree.

Interesting to know, I wasn't aware that the task_work functionality 
exists for that use case.

Thanks,
Christian.

Thanks,
Vivek

But could be that this information is outdated.

Regards,
Christian.

Thanks,
Vivek

+    }
+}

This pattern was suggested multiple times and I had to rejected it every
time
because the whole idea is just fundamentally broken.

It's really astonishing how people always come up with the same broken
pattern.

Regards,
Christian.

		Apart from that I have to reject the adding of
dma_buf_try_get(), that is clearly not something we should do.

	Understood. It appears that Vivek has confirmed that his v2 has
resolved the issue. I will follow up with him to determine if he plans to
resume his patch, and if so, I will apply my last patch on top of his
updated
patch series

	Thanks,
	Wei Lin

		Thanks,
		Christian.