Re: [md PATCH 14/14] MD: use per-cpu counter for writes_pending

NeilBrown <neilb@xxxxxxxx> · Fri, 17 Feb 2017 13:34:18 +1100

On Thu, Feb 16 2017, Shaohua Li wrote:

> On Thu, Feb 16, 2017 at 03:39:03PM +1100, Neil Brown wrote:
>> The 'writes_pending' counter is used to determine when the
>> array is stable so that it can be marked in the superblock
>> as "Clean".  Consequently it needs to be updated frequently
>> but only checked for zero occasionally.  Recent changes to
>> raid5 cause the count to be updated even more often - once
>> per 4K rather than once per bio.  This provided
>> justification for making the updates more efficient.
>> 
>> So we replace the atomic counter with a per-cpu array of
>> 'long' counters. Incrementing and decrementing is normally
>> much cheaper, testing for zero is more expensive.
>> 
>> To meaningfully be able to test for zero we need to be able
>> to block further updates.  This is done by forcing the
>> "increment" step to take a spinlock in the rare case that
>> another thread is checking if the count is zero.  This is
>> done using a new field: "checkers".  "checkers" is the
>> number of threads that are currently checking whether the
>> count is zero.  It is usually 0, occasionally 1, and it is
>> not impossible that it could be higher, though this would be
>> rare.
>> 
>> If, within an rcu_read_locked section, checkers is seen to
>> be zero, then the local-cpu counter can be incremented
>> freely.  If checkers is not zero, mddev->lock must be taken
>> before the increment is allowed.  A decrement is always
>> allowed.
>> 
>> To test for zero, a thread must increment "checkers", call
>> synchronize_rcu(), then take mddev->lock.  Once this is done
>> no new increments can happen.  A thread may choose to
>> perform a quick test-for-zero by summing all the counters
>> without holding a lock.  If this is non-zero, the the total
>> count is non-zero, or was non-zero very recently, so it is
>> safe to assume that it isn't zero.  If the quick check does
>> report a zero sum, then it is worth performing the locking
>> protocol.
>> 
>> When the counter is decremented, it is no longer possible to
>> immediately test if the result is zero
>> (atmic_dec_and_test()).  We don't even really want to
>> perform the "quick" tests as that sums over all cpus and is
>> work that will most often bring no benefit.
>> 
>> In the "safemode==2" case, when we want to mark the array as
>> "clean" immediately when there are no writes, we perform the
>> quick test anyway, and possibly wake the md thread to do the
>> full test.  "safemode==2" is only used during shutdown so
>> the cost is not problematic.
>> 
>> When safemode!=2 we always set the timer, rather than only
>> when the counter reaches zero.
>> 
>> If mod_timer() is called to set the timeout to the value it
>> already has, mod_timer() has low overhead with no atomic
>> operations.  So at worst it will have a noticeable cost once
>> per jiffie.  To further reduce the otherhead, we round the
>> requests delay to a multiple of ->safemode_delay.  This
>> might increase the delay until the timer fires a little, but
>> will reduce the overhead of calling mod_timer()
>> significantly.  If lots of requests are completing, the
>> timer will be updated every 200 milliseconds (by default)
>> and never fire.  When it does eventually fire, it will
>> schedule the md thread to perform the full test for
>> writes_pending==0, and this is quite likely to find '0'.
>> 
>> Signed-off-by: NeilBrown <neilb@xxxxxxxx>
>
> This sounds like a good place to use percpu-refcount. In set_in_sync, we switch
> it to atomic, read it, then switch it back to percpu.

I did have a quick look at percpu-refcount and it didn't seem suitable.
There is no interface to set the counter to atomic and wait for that to
complete.
There is also no interface to perform a lockless sum - only expected to
report '0' if the count has been zero for a little while.

I might be able to make do without those, or possibly enhance
percpu-refcount.

I'll have a look - thanks.

NeilBrown
Attachment:
signature.asc

Description: PGP signature