On 2024/11/20 07:44, Yosry Ahmed wrote:
On Tue, Nov 19, 2024 at 2:35 PM Sridhar, Kanchana P
<kanchana.p.sridhar@xxxxxxxxx> wrote:
-----Original Message-----
From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
Sent: Tuesday, November 19, 2024 11:51 AM
To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
Cc: Chengming Zhou <chengming.zhou@xxxxxxxxx>; Johannes Weiner
<hannes@xxxxxxxxxxx>; Nhat Pham <nphamcs@xxxxxxxxx>; linux-
kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; usamaarif642@xxxxxxxxx;
ryan.roberts@xxxxxxx; 21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx;
Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
<vinodh.gopal@xxxxxxxxx>
Subject: Re: [PATCH v1] mm: zswap: Fix a potential memory leak in
zswap_decompress().
On Tue, Nov 19, 2024 at 11:42 AM Sridhar, Kanchana P
<kanchana.p.sridhar@xxxxxxxxx> wrote:
-----Original Message-----
From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
Sent: Tuesday, November 19, 2024 11:27 AM
To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
Cc: Chengming Zhou <chengming.zhou@xxxxxxxxx>; Johannes Weiner
<hannes@xxxxxxxxxxx>; Nhat Pham <nphamcs@xxxxxxxxx>; linux-
kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; usamaarif642@xxxxxxxxx;
ryan.roberts@xxxxxxx; Huang, Ying <ying.huang@xxxxxxxxx>;
21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; Feghali, Wajdi K
<wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh <vinodh.gopal@xxxxxxxxx>
Subject: Re: [PATCH v1] mm: zswap: Fix a potential memory leak in
zswap_decompress().
On Tue, Nov 19, 2024 at 11:22 AM Sridhar, Kanchana P
<kanchana.p.sridhar@xxxxxxxxx> wrote:
-----Original Message-----
From: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
Sent: Friday, November 15, 2024 1:49 PM
To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
Cc: Chengming Zhou <chengming.zhou@xxxxxxxxx>; Johannes Weiner
<hannes@xxxxxxxxxxx>; Nhat Pham <nphamcs@xxxxxxxxx>; linux-
kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
usamaarif642@xxxxxxxxx;
ryan.roberts@xxxxxxx; Huang, Ying <ying.huang@xxxxxxxxx>;
21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; Feghali, Wajdi K
<wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
<vinodh.gopal@xxxxxxxxx>
Subject: Re: [PATCH v1] mm: zswap: Fix a potential memory leak in
zswap_decompress().
On Fri, Nov 15, 2024 at 1:14 PM Sridhar, Kanchana P
<kanchana.p.sridhar@xxxxxxxxx> wrote:
Hi Chengming,
-----Original Message-----
From: Chengming Zhou <chengming.zhou@xxxxxxxxx>
Sent: Wednesday, November 13, 2024 11:24 PM
To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>;
Johannes
Weiner
<hannes@xxxxxxxxxxx>
Cc: Nhat Pham <nphamcs@xxxxxxxxx>; Yosry Ahmed
<yosryahmed@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-
mm@xxxxxxxxx; usamaarif642@xxxxxxxxx;
ryan.roberts@xxxxxxx;
Huang,
Ying <ying.huang@xxxxxxxxx>; 21cnbao@xxxxxxxxx; akpm@linux-
foundation.org; Feghali, Wajdi K <wajdi.k.feghali@xxxxxxxxx>;
Gopal,
Vinodh
<vinodh.gopal@xxxxxxxxx>
Subject: Re: [PATCH v1] mm: zswap: Fix a potential memory leak in
zswap_decompress().
Hello,
On 2024/11/14 14:37, Sridhar, Kanchana P wrote:
-----Original Message-----
From: Johannes Weiner <hannes@xxxxxxxxxxx>
Sent: Wednesday, November 13, 2024 9:12 PM
To: Sridhar, Kanchana P <kanchana.p.sridhar@xxxxxxxxx>
Cc: Nhat Pham <nphamcs@xxxxxxxxx>; Yosry Ahmed
<yosryahmed@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
linux-
mm@xxxxxxxxx; chengming.zhou@xxxxxxxxx;
usamaarif642@xxxxxxxxx;
ryan.roberts@xxxxxxx; Huang, Ying <ying.huang@xxxxxxxxx>;
21cnbao@xxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; Feghali,
Wajdi K
<wajdi.k.feghali@xxxxxxxxx>; Gopal, Vinodh
<vinodh.gopal@xxxxxxxxx>
Subject: Re: [PATCH v1] mm: zswap: Fix a potential memory
leak in
zswap_decompress().
On Thu, Nov 14, 2024 at 01:56:16AM +0000, Sridhar, Kanchana
P
wrote:
So my question was, can we prevent the migration to a
different
cpu
by relinquishing the mutex lock after this conditional
Holding the mutex doesn't prevent preemption/migration.
Sure, however, is this also applicable to holding the mutex of a
per-
cpu
structure obtained via raw_cpu_ptr()?
Yes, unless you use migration_disable() or cpus_read_lock() to
protect
this section.
Ok.
Would holding the mutex prevent the acomp_ctx of the cpu prior
to
the migration (in the UAF scenario you described) from being
deleted?
No, cpu offline can kick in anytime to free the acomp_ctx->buffer.
If holding the per-cpu acomp_ctx's mutex isn't sufficient to
prevent
the
UAF, I agree, we might need a way to prevent the acomp_ctx
from
being
deleted, e.g. with refcounts as you've suggested, or to not use
the
Right, refcount solution from Johannes is very good IMHO.
acomp_ctx at all for the check, instead use a boolean.
But this is not enough to just avoid using acomp_ctx for the check,
the usage of acomp_ctx inside the mutex is also UAF, since cpu
offline
can kick in anytime to free the acomp_ctx->buffer.
I see. How would the refcounts work? Would this add latency to
zswap
ops? In low memory situations, could the cpu offlining code over-ride
the refcounts?
I think what Johannes meant is that the zswap compress/decompress
paths grab a ref on the acomp_ctx before using it, and the CPU
offlining code only drops the initial ref, and does not free the
buffer directly. The buffer is only freed when the ref drops to zero.
I am not familiar with CPU hotplug, would it be simpler if we have a
wrapper like get_acomp_ctx() that disables migration or calls
cpus_read_lock() before grabbing the per-CPU acomp_ctx? A similar
wrapper, put_acompt_ctx() will be used after we are done using the
acomp_ctx.
Would it be sufficient to add a check for mutex_is_locked() in
zswap_cpu_comp_dead() and if this returns true, to exit without
deleting
the acomp?
I don't think this works. First of all, it's racy. It's possible the
mutex gets locked after we check mutex_is_locked() but before we
delete the acomp_ctx. Also, if we find that the mutex is locked, then
we do nothing and essentially leak the memory.
Yes, this would assume the cpu offlining code retries at some interval,
which could prevent the memory leak.
I am not sure about that, but even so, it wouldn't handle the first
scenario where the mutex gets locked after we check mutex_is_locked().
Second, and probably more important, this only checks if anyone is
currently holding the mutex. What about tasks that may be sleeping
waiting for the mutex to be unlocked? The mutex will be deleted from
under them as well.
Wouldn't this and the race described above, also be issues for the
refcount based approach?
I don't think so, at least if implemented correctly. There are a lot
of examples around the kernel that use RCU + refcounts for such use
cases. I think there are also some examples in kernel docs.
That being said, I am wondering if we can get away with something
simpler like holding the cpus read lock or disabling migration as I
suggested earlier, but I am not quite sure.
Another idea to consider is how zsmalloc avoids this issue through
its use of the local_lock() on the per-cpu mapping area. This disables
preemption from zs_map_object() through zs_unmap_object().
Would changing the acomp_ctx's mutex to a local_lock solve the
problem?
This is similar to disabling migration as I suggested, but disabling
preemption means that we cannot sleep, we spin on a lock instead.
In zswap_compress(), we send the compression request and may sleep
waiting for it. We also make a non-atomic allocation later that may
also sleep but that's less of a problem.
In zswap_decompress() we may also sleep, which is why we sometimes
copy the data into acomp_ctx->buffer and unmap the handle to begin
with.
So I don't think we can just replace the mutex with a lock.
Also, I am wondering if the mutex design already handles cases where
tasks are sleeping, waiting for a mutex that disappears?
I don't believe so. It doesn't make sense for someone to free a mutex
while someone is waiting for it. How would the waiter know if the
memory backing the mutex was freed?
Thanks Yosry, all good points. There would need to be some sort of
arbiter (for e.g., the cpu offlining code) that would reschedule tasks
running on a cpu before shutting it down, which could address
this specific issue. I was thinking these are not problems unique to
zswap's per-cpu acomp_ctx->mutex wrt the offlining?
There are a few reasons why zswap has this problem and other code may
not have it. For example the data structure is dynamically allocated
and is freed during offlining, it wouldn't be a problem if it was
static. Also the fact that we don't disable preemption when accessing
the per-CPU data, as I mentioned earlier, which would prevent the CPU
we are running on from going offline while we access the per-CPU data.
I think we should either:
(a) Use refcounts.
(b) Disable migration.
(c) Hold the CPUs read lock.
I was hoping someone with more knowledge about CPU offlining would
confirm (b) and (c) would work, but I am pretty confident they would.
+1, I also think (b) or (c) should work with my limited knowledge about
CPU offlining :-) And they seem simpler than refcounts implementation.
Thanks.