code question about tcmu_try_get_data_page()

Xiaoguang Wang <xiaoguang.wang@xxxxxxxxxxxxxxxxx> · Wed, 9 Mar 2022 16:20:49 +0800

hi Bodo,

I have implemented a zero-copy prototype for tcmu(not sent out yet), it 
reallys
improves throughput in large io size cases, and I met some code question 
that
may need you kind help, thanks in advance. In my prototype, I choose to
unmap pages in tcmu_irqcontrol():
==> tcmu_irqcontrol, which holds cmdr_lock.
====> tcmu_handle_completions
======> tcmu_handle_completion
========> zap_page_range, which needs to hold mm related lock.

But in page fault procedure:
==> tcmu_vma_fault, which I think its caller hold mm related lock.
====> tcmu_try_get_data_page, which tries to hold cmdr_lock.

Then dead lock may happens. In tcmu_try_get_data_page(),
it mainly acts like below:
        mutex_lock(&udev->cmdr_lock);
        page = xa_load(&udev->data_pages, dpi);
        if (likely(page)) {
                mutex_unlock(&udev->cmdr_lock);
                return page;
        }

        /*
         * Userspace messed up and passed in a address not in the
         * data iov passed to it.
         */
        pr_err("Invalid addr to data page mapping (dpi %u) on device %s\n",
               dpi, udev->name);
        mutex_unlock(&udev->cmdr_lock);

I wonder do we really need to hold cmdr_lock here, seems it doesn't
protect any real resources. If cmdr_lock here is used to avoid concurrent
xa_erase(), after releasing cmdr_lock, page still can be removed.
So my question is that can we safely remove this cmdr_lock here? thanks.

Regards,
Xiaoguang Wang