Hi Wang,
Yes, we need the cmdr_lock in tcmu_try_get_data_page.
Reason: if the unmap worker thread gets started and calls
find_free_blocks, this function - while holding cmdr_lock - per tcmu
device checks for pages we can release.
If there are pages to release, it first calls unmap_mapping_range
to deny the access from userspace to these pages. After that it loops
over the data_pages xarray to remove the pages from the array and
free them (tcmu_blocks_release).
If tcmu_try_get_data_page wouldn't lock the cmdr_lock mutex, it
possibly could run between the call to unmap_mapping_range and
tcmu_blocks_release removing the page from xarray and freeing it.
Then we would end up having a freed page mapped to userspace!
Up to now I didn't check, whether we here have the same possible
dead lock you described. If so, we probably can solve the problem
by first removing the pages from xarray, then unmap and finally
free all pages. If we would do so, I think we would no longer need
cmdr_lock in tcmu_try_get_data_page.
But: I'm wondering what exactly you are trying to do.
I assume you want to map the sgl-pages into tcmu data_area just while a
cmd is in the ring, right?
If so, would you do that as a new option, or would you do that
generally?
I'm asking because AFAICS there are at least these problems when mapping
sgls:
- if a command in the cmd ring times out, tcmu sends cmd completion
to target core. That means, you have to unmap the sgl pages.
But userspace does not know about that. So it still is allowed to
access the data space described by cmd's iovecs --> SIGBUS
- if the sgl contains not completely filled pages (probably at least
at the end), you might map old kernel data into userspace.
So you have to check and write zeros.
- AFAICS, there is no guarantee that fabric driver completely fills
all pages of the sgl but the least. When mapping directly, that
directly translates into iovecs having a length of not a multiple of
PAGE_SIZE, not only in the last iovec.
I'm not sure all existing userspace SW can handle that correctly.
- KEEP_BUF feature is based on tcmu owning the buffer in dataspace.
So it would no longer work.
To be honest I'm not sure your changes would make it into tcmu.
Bodo
On 09.03.22 09:20, Xiaoguang Wang wrote:
hi Bodo,
I have implemented a zero-copy prototype for tcmu(not sent out yet), it
reallys
improves throughput in large io size cases, and I met some code question
that
may need you kind help, thanks in advance. In my prototype, I choose to
unmap pages in tcmu_irqcontrol():
==> tcmu_irqcontrol, which holds cmdr_lock.
====> tcmu_handle_completions
======> tcmu_handle_completion
========> zap_page_range, which needs to hold mm related lock.
But in page fault procedure:
==> tcmu_vma_fault, which I think its caller hold mm related lock.
====> tcmu_try_get_data_page, which tries to hold cmdr_lock.
Then dead lock may happens. In tcmu_try_get_data_page(),
it mainly acts like below:
mutex_lock(&udev->cmdr_lock);
page = xa_load(&udev->data_pages, dpi);
if (likely(page)) {
mutex_unlock(&udev->cmdr_lock);
return page;
}
/*
* Userspace messed up and passed in a address not in the
* data iov passed to it.
*/
pr_err("Invalid addr to data page mapping (dpi %u) on device
%s\n",
dpi, udev->name);
mutex_unlock(&udev->cmdr_lock);
I wonder do we really need to hold cmdr_lock here, seems it doesn't
protect any real resources. If cmdr_lock here is used to avoid concurrent
xa_erase(), after releasing cmdr_lock, page still can be removed.
So my question is that can we safely remove this cmdr_lock here? thanks.
Regards,
Xiaoguang Wang