On 09.05.22 20:12, Matthew Wilcox wrote:
On Mon, May 09, 2022 at 09:05:45AM +0300, Dan Carpenter wrote:
If there is no other way to avoid the Smatch warning,
The Smatch warning is not the issue. If we're holding a spinlock and
we call might_sleep() then that generates a stack trace at runtime if
you have CONFIG_DEBUG_ATOMIC_SLEEP enabled. Probably enabling
CONFIG_DEBUG_ATOMIC_SLEEP should just be a standard part of the QC
process.
Anyway, it sounds you're just doing the locking to silence a warning in
xarray. Let's ask Matthew if he has a hint.
I suspect that the tcmu code is doing something horrible & wrong again.
It keeps poking at VM internals without understanding of what's going
on or why. Anyway ...
If we really are doing something wrong, the long comment in the new
code in tcmu_blocks_release probably is wrong. Can you please give us
a hint what our misunderstanding is?
On Sun, May 08, 2022 at 08:03:14PM +0200, Bodo Stroesser wrote:
I'm quite sure that our code does not cause any problems, because
in tcmu we explicitly or implicitly take the xarray's lock only while
holding the so called cmdr_lock mutex. Also, we take the lock without
disabling irq or bh. So I think there is no problem if lock_page sleeps
while we hold the xarray's lock.
Of course, we wouldn't need the xarray lock at all, but xarray code
forces us to take it to avoid warnings.
You mean "The XArray warns us when we break its locking rules".
Yes.
The point is, that tcmu does not use spin_locks, but mutex. So we
unfortunately cannot use the XArray lock alone, but always have to take
tcmu's mutex and while holding it, we additionally (implicitly or
explicitly) take XArray's lock.
In tcmu_blocks_release we use the advanced xarray API to keep the
overhead small. It allows us to lock/unlock before and after the loop
only. If there is no other way to avoid the Smatch warning, we could
easily put additional xas_unlock() and xas_lock() around the
lock_page/unlock_page block.
... then you'd have to call xas_reset(), and you might as well use
xa_for_each().
Yes.
I wrote a mail few minutes ago that we should switch to
xa_for_each_range().
1661 static u32 tcmu_blocks_release(struct tcmu_dev *udev, unsigned long first,
1662 unsigned long last)
1663 {
1664 XA_STATE(xas, &udev->data_pages, first * udev->data_pages_per_blk);
1665 struct page *page;
1666 u32 pages_freed = 0;
1667
1668 xas_lock(&xas);
^^^^^^^^^^^^^^
We take a spinlock here.
1669 xas_for_each(&xas, page, (last + 1) * udev->data_pages_per_blk - 1) {
1670 xas_store(&xas, NULL);
1671 /*
1672 * While reaching here there may be page faults occurring on
1673 * the to-be-released pages. A race condition may occur if
1674 * unmap_mapping_range() is called before page faults on these
1675 * pages have completed; a valid but stale map is created.
1676 *
1677 * If another command subsequently runs and needs to extend
1678 * dbi_thresh, it may reuse the slot corresponding to the
1679 * previous page in data_bitmap. Though we will allocate a new
1680 * page for the slot in data_area, no page fault will happen
1681 * because we have a valid map. Therefore the command's data
1682 * will be lost.
1683 *
1684 * We lock and unlock pages that are to be released to ensure
1685 * all page faults have completed. This way
1686 * unmap_mapping_range() can ensure stale maps are cleanly
1687 * removed.
1688 */
--> 1689 lock_page(page);
^^^^^^^^^^^^^^^
The lock_page() function calls might_sleep() (inside the declaration
block).
1690 unlock_page(page);
1691 __free_page(page);
1692 pages_freed++;
1693 }
1694 xas_unlock(&xas);
There are a number of things you can do. One is to remove pages into
a pagevec until it is full, then xas_unlock(); xas_reset(); lock and
unlock each page and pass the pagevec to __pagevec_release(). Then
xas_lock() and continue the loop.
Another possibility is to trylock each page; if it fails, put it into
the pagevec (and do the above dance if the pagevec is now full), but
if it succeeds, you can now unlock it and __free_page().
Thank you for that hint.
The probability, that trylock_page fails is _very_ low. In future we
might call tcmu_blocks_release more often. Then we can speed it up
again by using the method you described, even without a need for
batching.
The key to going fast is batching. And that goes for __free_page()
vs __pagevec_release() as much as for walking the XArray.