Re: [PATCH 2/3] bcache: avoid unnecessary soft lockup in kworker update_writeback_rate()

Jens Axboe <axboe@xxxxxxxxx> · Fri, 27 May 2022 18:00:30 -0600

On 5/27/22 11:05 AM, colyli wrote:
> ? 2022-05-27 23:49?Jens Axboe ???
>> On 5/27/22 9:28 AM, Coly Li wrote:
>>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>>> index d138a2d73240..c51671abe74e 100644
>>> --- a/drivers/md/bcache/writeback.c
>>> +++ b/drivers/md/bcache/writeback.c
>>> @@ -214,6 +214,7 @@ static void update_writeback_rate(struct work_struct *work)
>>>                           struct cached_dev,
>>>                           writeback_rate_update);
>>>      struct cache_set *c = dc->disk.c;
>>> +    bool contention = false;
>>>
>>>      /*
>>>       * should check BCACHE_DEV_RATE_DW_RUNNING before calling
>>> @@ -243,13 +244,41 @@ static void update_writeback_rate(struct work_struct *work)
>>>           * in maximum writeback rate number(s).
>>>           */
>>>          if (!set_at_max_writeback_rate(c, dc)) {
>>> -            down_read(&dc->writeback_lock);
>>> -            __update_writeback_rate(dc);
>>> -            update_gc_after_writeback(c);
>>> -            up_read(&dc->writeback_lock);
>>> +            /*
>>> +             * When contention happens on dc->writeback_lock with
>>> +             * the writeback thread, this kwork may be blocked for
>>> +             * very long time if there are too many dirty data to
>>> +             * writeback, and kerne message will complain a (bogus)
>>> +             * software lockup kernel message. To avoid potential
>>> +             * starving, if down_read_trylock() fails, writeback
>>> +             * rate updating will be skipped for dc->retry_max times
>>> +             * at most while delay this worker a bit longer time.
>>> +             * If dc->retry_max times are tried and the trylock
>>> +             * still fails, then call down_read() to wait for
>>> +             * dc->writeback_lock.
>>> +             */
>>> +            if (!down_read_trylock((&dc->writeback_lock))) {
>>> +                contention = true;
>>> +                dc->retry_nr++;
>>> +                if (dc->retry_nr > dc->retry_max)
>>> +                    down_read(&dc->writeback_lock);
>>> +            }
>>> +
>>> +            if (!contention || dc->retry_nr > dc->retry_max) {
>>> +                __update_writeback_rate(dc);
>>> +                update_gc_after_writeback(c);
>>> +                up_read(&dc->writeback_lock);
>>> +                dc->retry_nr = 0;
>>> +            }
>>>          }
>>>      }
>>
> 
> Hi Jens,
> 
> Thanks for looking into this :-)
> 
>> This is really not very pretty. First of all, why bother with storing a
>> max retry value in there? Doesn't seem like it'd ever be different per
> 
> It is because the probability of the lock contention on
> dc->writeback_lock depends on the I/O speed backing device. From my
> observation during the tests, for fast backing device with larger
> cache device, its writeback thread may work harder to flush more dirty
> data to backing device, the lock contention happens more and longer,
> so the writeback rate update kworker has to wait longer time before
> acquires dc->writeback_lock. So its dc->retry_max should be larger
> then slow backing device.
> 
> Therefore I'd like to have a tunable per-backing-device retry_max. And
> the syses interface will be added when users/customers want it. The
> use case is from SAP HANA users, I have report that they observe the
> soft lockup warning for dc->writeback_lock contention and worry about
> whether data is corrupted (indeed, of course not).

The initial patch has 5 as the default, and there are no sysfs knobs. If
you ever need a sysfs knob, by all means make it a variable and make it
configurable too. But don't do it upfront where the '5' suitabled named
would do the job.

>> diff --git a/drivers/md/bcache/writeback.c b/drivers/md/bcache/writeback.c
>> index 9ee0005874cd..cbc01372c7a1 100644
>> --- a/drivers/md/bcache/writeback.c
>> +++ b/drivers/md/bcache/writeback.c
>> @@ -235,19 +235,27 @@ static void update_writeback_rate(struct
>> work_struct *work)
>>          return;
>>      }
>>
>> -    if (atomic_read(&dc->has_dirty) && dc->writeback_percent) {
>> +    if (atomic_read(&dc->has_dirty) && dc->writeback_percent &&
>> +        !set_at_max_writeback_rate(c, dc)) {
>>          /*
>>           * If the whole cache set is idle, set_at_max_writeback_rate()
>>           * will set writeback rate to a max number. Then it is
>>           * unncessary to update writeback rate for an idle cache set
>>           * in maximum writeback rate number(s).
>>           */
>> -        if (!set_at_max_writeback_rate(c, dc)) {
> 
> The reason I didn't place '!set_at_max_writeback_rate' with other items in
> previous if() was for the above code comment. If I moved it to previous
> if() without other items, I was not comfortable to place the code comments
> neither before or after the if() check. So I used a separated if() check for
> '!set_at_max_writeback_rate'.
> 
> From your change, it seems placing the code comments behind is fine (or
> better), can I understand in this way? I try to learn and follow your way
> to handle such code comments situation.

Just put it higher up if you want, it doesn't really matter, or leave it
where it is.

>>              __update_writeback_rate(dc);
>>              update_gc_after_writeback(c);
>>              up_read(&dc->writeback_lock);
>> -        }
>> +        } while (0);
> 
> Aha, this is cool! I never though of using do{}while(0) and break in
> such a genius way! Sure I will use this, thanks for the hint :-)
> 
> After you reply my defense of dc->retry_max, and the question of code
> comments location, I will update and test the patch again, and
> re-sbumit to you.
> 
> Thanks for your constructive suggestion, especially the do{}while(0)
> part!

I would do something similar to my change and drop the 'dc' addition for
the max retries as it, by definition, can only be one value right now.
For all I know, you'll never need to change it again, and then you're
just wasting memory and making the code harder to read by putting it in
dc instead of just having this define.

-- 
Jens Axboe