Re: [PATCH V2] blk-cgroup: Flush stats before releasing blkcg_gq

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/24/23 11:43, Waiman Long wrote:
On 5/24/23 00:26, Ming Lei wrote:
On Wed, May 24, 2023 at 12:19:57AM -0400, Waiman Long wrote:
On 5/23/23 23:51, Ming Lei wrote:
As noted by Michal, the blkg_iostat_set's in the lockless list hold
reference to blkg's to protect against their removal. Those blkg's
hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which
is called when the blkcg reference count reaches 0. This circular
dependency will prevent blkcg and some blkgs from being freed after
they are made offline.

It is less a problem if the cgroup to be destroyed also has other
controllers like memory that will call cgroup_rstat_flush() which will
clean up the reference count. If block is the only controller that uses
rstat, these offline blkcg and blkgs may never be freed leaking more
and more memory over time.

To prevent this potential memory leak:

- flush blkcg per-cpu stats list in __blkg_release(), when no new stat
can be added

- don't grab bio->bi_blkg reference when adding the stats into blkcg's
per-cpu stat list since all stats are guaranteed to be consumed before
releasing blkg instance, and grabbing blkg reference for stats was the
most fragile part of original patch

Based on Waiman's patch:

https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@xxxxxxxxxx/

Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Cc: Waiman Long <longman@xxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: mkoutny@xxxxxxxx
Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
---
V2:
    - remove kernel/cgroup change, and call blkcg_rstat_flush()
    to flush stat directly

   block/blk-cgroup.c | 29 +++++++++++++++++++++--------
   1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0ce64dd73cfe..ed0eb8896972 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -34,6 +34,8 @@
   #include "blk-ioprio.h"
   #include "blk-throttle.h"
+static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu);
+
   /*
    * blkcg_pol_mutex protects blkcg_policy[] and policy [de]activation.     * blkcg_pol_register_mutex nests outside of it and synchronizes entire
@@ -163,10 +165,21 @@ static void blkg_free(struct blkcg_gq *blkg)
   static void __blkg_release(struct rcu_head *rcu)
   {
       struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head);
+    struct blkcg *blkcg = blkg->blkcg;
+    int cpu;
   #ifdef CONFIG_BLK_CGROUP_PUNT_BIO
       WARN_ON(!bio_list_empty(&blkg->async_bios));
   #endif
+    /*
+     * Flush all the non-empty percpu lockless lists before releasing
+     * us, given these stat belongs to us.
+     *
+     * cgroup locks aren't needed here since __blkcg_rstat_flush just
+     * propagates delta into blkg parent, which is live now.
+     */
+    for_each_possible_cpu(cpu)
+        __blkcg_rstat_flush(blkcg, cpu);
       /* release the blkcg and parent blkg refs this blkg has been holding */
       css_put(&blkg->blkcg->css);
@@ -951,17 +964,12 @@ static void blkcg_iostat_update(struct blkcg_gq *blkg, struct blkg_iostat *cur,
u64_stats_update_end_irqrestore(&blkg->iostat.sync, flags);
   }
-static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
+static void __blkcg_rstat_flush(struct blkcg *blkcg, int cpu)
   {
-    struct blkcg *blkcg = css_to_blkcg(css);
       struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
       struct llist_node *lnode;
       struct blkg_iostat_set *bisc, *next_bisc;
-    /* Root-level stats are sourced from system-wide IO stats */
-    if (!cgroup_parent(css->cgroup))
-        return;
-
       rcu_read_lock();
       lnode = llist_del_all(lhead);
@@ -991,13 +999,19 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
           if (parent && parent->parent)
               blkcg_iostat_update(parent, &blkg->iostat.cur,
                           &blkg->iostat.last);
-        percpu_ref_put(&blkg->refcnt);
       }
   out:
       rcu_read_unlock();
   }
+static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu)
+{
+    /* Root-level stats are sourced from system-wide IO stats */
+    if (cgroup_parent(css->cgroup))
+        __blkcg_rstat_flush(css_to_blkcg(css), cpu);
+}
+
I think it may not safe to call __blkcg_rstat_flus() directly without taking
the cgroup_rstat_cpu_lock. That is why I added a helper to
kernel/cgroup/rstat.c in my patch to meet the locking requirement.
All stats are removed from llist_del_all(), and the local list is
iterated, then each blkg & its parent is touched in __blkcg_rstat_flus(), so
can you explain it a bit why cgroup locks are needed? For protecting
what?

You are right. The llist_del_all() call in blkcg_rstat_flush() is atomic, so it is safe for concurrent execution which is what the cgroup_rstat_cpu_lock protects against. That may not be the case for rstat callbacks of other controllers. So I will suggest you to add a comment to clarify that point. Other than that, you patch looks good to me.

Reviewed: Waiman Long <longman@xxxxxxxxxx>

After some more thought, I need to retract my reviewed-by tag for now. There is a slight possibility that blkcg_iostat_update() in blkcg_rstat_flush() can happen concurrently which will corrupt the sequence count. One way to avoid that is to synchronize it by cgroup_rstat_cpu_lock. Another way is to use the bisc->lqueued for synchronization. In that case, you need to move WRITE_ONCE(bisc->lqueued, false) in blkcg_rstat_flush() to the end after all the  blkcg_iostat_update() call with smp_store_release() and replace the READ_ONCE(bis->lqueued) check in blk_cgroup_bio_start() with smp_load_acquire().

Cheers,
Longman




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux