On 10/6/22 06:11, Hillf Danton wrote:
On 4 Oct 2022 11:17:48 -0400 Waiman Long <longman@xxxxxxxxxx>
For a system with many CPUs and block devices, the time to do
blkcg_rstat_flush() from cgroup_rstat_flush() can be rather long. It
can be especially problematic as interrupt is disabled during the flush.
It was reported that it might take seconds to complete in some extreme
cases leading to hard lockup messages.
As it is likely that not all the percpu blkg_iostat_set's has been
updated since the last flush, those stale blkg_iostat_set's don't need
to be flushed in this case. This patch optimizes blkcg_rstat_flush()
by keeping a lockless list of recently updated blkg_iostat_set's in a
newly added percpu blkcg->lhead pointer.
The blkg_iostat_set is added to a sentinel lockless list on the update
side in blk_cgroup_bio_start(). It is removed from the sentinel lockless
list when flushed in blkcg_rstat_flush(). Due to racing, it is possible
that blk_iostat_set's in the lockless list may have no new IO stats to
be flushed, but that is OK.
So it is likely that another flag, updated when bis is added to/deleted
from llist, can cut 1/3 off without raising the risk of getting your patch
over complicated.
struct blkg_iostat_set {
struct u64_stats_sync sync;
+ struct llist_node lnode;
+ struct blkcg_gq *blkg;
+ atomic_t queued;
struct blkg_iostat cur;
struct blkg_iostat last;
};
Yes, by introducing a flag to record the lockless list state, it is
possible to just use the current llist implementation. Maybe I can
rework it for now without the sentinel variant and post a separate llist
patch for that later on.
Cheers,
Longman