On 5/24/23 22:04, Ming Lei wrote:
On Wed, May 24, 2023 at 01:28:41PM -0400, Waiman Long wrote:
On 5/24/23 11:43, Waiman Long wrote:
On 5/24/23 00:26, Ming Lei wrote:
On Wed, May 24, 2023 at 12:19:57AM -0400, Waiman Long wrote:
On 5/23/23 23:51, Ming Lei wrote:
As noted by Michal, the blkg_iostat_set's in the lockless list hold
reference to blkg's to protect against their removal. Those blkg's
hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which
is called when the blkcg reference count reaches 0. This circular
dependency will prevent blkcg and some blkgs from being freed after
they are made offline.
It is less a problem if the cgroup to be destroyed also has other
controllers like memory that will call cgroup_rstat_flush() which will
clean up the reference count. If block is the only
controller that uses
rstat, these offline blkcg and blkgs may never be freed leaking more
and more memory over time.
To prevent this potential memory leak:
- flush blkcg per-cpu stats list in __blkg_release(), when no new stat
can be added
- don't grab bio->bi_blkg reference when adding the stats into blkcg's
per-cpu stat list since all stats are guaranteed to be consumed before
releasing blkg instance, and grabbing blkg reference for stats was the
most fragile part of original patch
Based on Waiman's patch:
https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@xxxxxxxxxx/
Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Cc: Waiman Long <longman@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: mkoutny@xxxxxxxx
Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
---
V2:
- remove kernel/cgroup change, and call blkcg_rstat_flush()
to flush stat directly
block/blk-cgroup.c | 29 +++++++++++++++++++++--------
1 file changed, 21 insertions(+), 8 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0ce64dd73cfe..ed0eb8896972 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -34,6 +34,8 @@
#include "blk-ioprio.h"
#include "blk-throttle.h"
+static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
+
/*
* blkcg_pol_mutex protects blkcg_policy[] and policy
[de]activation.
* blkcg_pol_register_mutex nests outside of it and
synchronizes entire
@@ -163,10 +165,21 @@ static void blkg_free(struct blkcg_gq *blkg)
static void __blkg_release(struct rcu_head *rcu)
{
struct blkcg_gq *blkg = container_of(rcu, struct
blkcg_gq, rcu_head);
+ struct blkcg *blkcg = blkg->blkcg;
+ int cpu;
#ifdef CONFIG_BLK_CGROUP_PUNT_BIO
WARN_ON(!bio_list_empty(&blkg->async_bios));
#endif
+ /*
+ * Flush all the non-empty percpu lockless lists before releasing
+ * us, given these stat belongs to us.
+ *
+ * cgroup locks aren't needed here since __blkcg_rstat_flush just
+ * propagates delta into blkg parent, which is live now.
+ */
+ for_each_possible_cpu(cpu)
+ __blkcg_rstat_flush(blkcg, cpu);
/* release the blkcg and parent blkg refs this blkg
has been holding */
css_put(&blkg->blkcg->css);
@@ -951,17 +964,12 @@ static void blkcg_iostat_update(struct
blkcg_gq *blkg, struct blkg_iostat *cur,
u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags);
}
-static void blkcg_rstat_flush(struct cgroup_subsys_state
*css, int cpu)
+static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
{
- struct blkcg *blkcg = css_to_blkcg(css);
struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
struct llist_node *lnode;
struct blkg_iostat_set *bisc, *next_bisc;
- /* Root-level stats are sourced from system-wide IO stats */
- if (!cgroup_parent(css->cgroup))
- return;
-
rcu_read_lock();
lnode = llist_del_all(lhead);
@@ -991,13 +999,19 @@ static void blkcg_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
if (parent && parent->parent)
blkcg_iostat_update(parent, &blkg->iostat.cur,
&blkg->iostat.last);
- percpu_ref_put(&blkg->refcnt);
}
out:
rcu_read_unlock();
}
+static void blkcg_rstat_flush(struct cgroup_subsys_state
*css, int cpu)
+{
+ /* Root-level stats are sourced from system-wide IO stats */
+ if (cgroup_parent(css->cgroup))
+ __blkcg_rstat_flush(css_to_blkcg(css), cpu);
+}
+
I think it may not safe to call __blkcg_rstat_flus() directly
without taking
the cgroup_rstat_cpu_lock. That is why I added a helper to
kernel/cgroup/rstat.c in my patch to meet the locking requirement.
All stats are removed from llist_del_all(), and the local list is
iterated, then each blkg & its parent is touched in
__blkcg_rstat_flus(), so
can you explain it a bit why cgroup locks are needed? For protecting
what?
You are right. The llist_del_all() call in blkcg_rstat_flush() is
atomic, so it is safe for concurrent execution which is what the
cgroup_rstat_cpu_lock protects against. That may not be the case for
rstat callbacks of other controllers. So I will suggest you to add a
comment to clarify that point. Other than that, you patch looks good to
me.
Reviewed: Waiman Long <longman@xxxxxxxxxx>
After some more thought, I need to retract my reviewed-by tag for now. There
is a slight possibility that blkcg_iostat_update() in blkcg_rstat_flush()
can happen concurrently which will corrupt the sequence count.
llist_del_all() moves all 'bis' into one local list, and bis is one percpu
variable of blkg, so in theory same bis won't be flushed at the same
time. And one bis should be touched in just one of stat flush code path
because of llist_del_all().
So 'bis' still can be thought as being flushed in serialized way.
However, blk_cgroup_bio_start() runs concurrently with blkcg_rstat_flush(),
so once bis->lqueued is cleared in blkcg_rstat_flush(), this same bis
could be added to the percpu llist and __blkcg_rstat_flush() from blkg_release()
follows. This should be the only chance for concurrent stats update.
That is why I have in mind. A __blkcg_rstat_flush() can be from
blkg_release() and another one from the regular cgroup_rstat_flush*().
But, blkg_release() is run in RCU callback, so the previous flush has
been done, but new flush can come, and other blkg's stat could be added
with same story above.
One way to
avoid that is to synchronize it by cgroup_rstat_cpu_lock. Another way is to
use the bisc->lqueued for synchronization.
I'd avoid the external cgroup lock here.
In that case, you need to move
WRITE_ONCE(bisc->lqueued, false) in blkcg_rstat_flush() to the end after all
the blkcg_iostat_update() call with smp_store_release() and replace the
READ_ONCE(bis->lqueued) check in blk_cgroup_bio_start() with
smp_load_acquire().
This way looks doable, but I guess it still can't avoid concurrent update on parent
stat, such as when __blkcg_rstat_flush() from blkg_release() is
in-progress, another sibling blkg's bis is added, meantime
blkcg_rstat_flush() is called.
I realized that the use of cgroup_rstat_cpu_lock or the alternative was
not safe enough for preventing concurrent parent blkg rstat update.
Another way is to add blkcg->stat_lock for covering __blkcg_rstat_flush(), what
do you think of this way?
I am thinking of adding a raw spinlock into blkg and take it when doing
blkcg_iostat_update(). This can guarantee no concurrent update to rstat
data. It has to be a raw spinlock as it will be under the
cgroup_rstat_cpu_lock raw spinlock.
The use of u64_stats_fetch_begin/u64_stats_fetch_retry in retrieving the
rstat data from bisc can happen concurrently with update without further
synchronization. However, there can be at most one update at any time.
Cheers,
Longman
Thanks,
Ming