Re: [PATCH v4] mm/swap: fix race when skipping swapcache

Chris Li <chrisl@xxxxxxxxxx> · Tue, 20 Feb 2024 08:32:13 -0800

On Mon, Feb 19, 2024 at 8:56 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:

>
> Hi Barry,
>
> > it might not be a problem for throughput. but for real-time and tail latency,
> > this hurts. For example, this might increase dropping frames of UI which
> > is an important parameter to evaluate performance :-)
> >
>
> That's a true issue, as Chris mentioned before I think we need to
> think of some clever data struct to solve this more naturally in the
> future, similar issue exists for cached swapin as well and it has been
> there for a while. On the other hand I think maybe applications that
> are extremely latency sensitive should try to avoid swap on fault? A
> swapin could cause other issues like reclaim, throttled or contention
> with many other things, these seem to have a higher chance than this
> race.

Yes, I do think the best long term solution is to have some clever
data structure to solve the synchronization issue and allow racing
threads to make forward progress at the same time.

I have also explored some (failed) synchronization ideas, for example
having the run time swap entry refcount separate from swap_map count.
BTW, zswap entry->refcount behaves like that, it is separate from swap
entry and manages the temporary run time usage count held by the
function. However that idea has its own problem as well, it needs to
have an xarray to track the swap entry run time refcount (only stored
in the xarray when CPU fails to get SWAP_HAS_CACHE bit.) When we are
done with page faults, we still need to look up the xarray to make
sure there is no racing CPU and put the refcount into the xarray. That
 kind of defeats the purpose of avoiding the swap cache in the first
place. We still need to do the xarray lookup in the normal path.

I came to realize that, while this current fix is not perfect, (I
still wish we had a better solution not pausing the racing CPU). This
patch stands better than not fixing this data corruption issue and the
patch remains relatively simple. Yes it has latency issues but still
better than data corruption.  It also doesn't stop us from coming up
with better solutions later on. If we want to address the
synchronization in a way not blocking other CPUs, it will likely
require a much bigger change.

Unless we have a better suggestion. It seems the better one among the
alternatives so far.

Chris

>
> > BTW, I wonder if ying's previous proposal - moving swapcache_prepare()
> > after swap_read_folio() will further help decrease the number?
>
> We can move the swapcache_prepare after folio alloc or cgroup charge,
> but I didn't see an observable change from statistics, for some
> workload the reading is even worse. I think that's mostly due to
> noise, or higher swap out rate since all raced threads will alloc an
> extra folio now. Applications that have many pages swapped out due to
> memory limit are already on the edge of triggering another reclaim, so
> a dozen more folio alloc could just trigger that...
>
> And we can't move it after swap_read_folio()... That's exactly what we
> want to protect.
>