Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

Jens Axboe <axboe@xxxxxxxxx> · Wed, 22 Aug 2018 11:30:40 -0600

On 8/22/18 10:42 AM, van der Linden, Frank wrote:
> On 8/22/18 7:27 AM, Jens Axboe wrote:
>> On 8/22/18 6:54 AM, Holger Hoffstätte wrote:
>>> On 08/22/18 06:10, Jens Axboe wrote:
>>>> [...]
>>>> If you have time, please look at the 3 patches I posted earlier today.
>>>> Those are for mainline, so should be OK :-)
>>> I'm just playing along at home but with those 3 I get repeatable
>>> hangs & writeback not starting at all, but curiously *only* on my btrfs
>>> device; for inexplicable reasons some other devices with ext4/xfs flush
>>> properly. Yes, that surprised me too, but it's repeatable.
>>> Now this may or may not have something to do with some of my in-testing
>>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
>>> is golden again. Not eager to repeat since it hangs sync & requires a
>>> hard reboot.. :(
>>> Just thought you'd like to know.
>> Thanks, that's very useful info! I'll see if I can reproduce that.
>>
> I think it might be useful to kind of give a dump of what we discussed
> before this patch was sent, there was a little more than was in the
> description.
> 
> We saw hangs and heavy lock contention in the wbt code under a specific
> workload, on XFS. Crash dump analysis showed the following issues:
> 
> 1) wbt_done uses wake_up_all, which causes a thundering herd
> 2) __wbt_wait sets up a wait queue with the auto remove wake function
> (via DEFINE_WAIT), which caused two problems:
>    * combined with the use of wake_up_all, the wait queue would
> essentially be randomly reordered for tasks that did not get to run
>    * the waitqueue_active check in may_queue was not valid with the auto
> remove function, which could lead incoming tasks with requests to
> overtake existing requests
> 
> 1) was fixed by using a plain wake_up
> 2) was fixed by keeping tasks on the queue until they could break out of
> the wait loop in __wbt_wait
> 
> 
> The random reordering, causing task starvation in __wbt_wait, was the
> main problem. Simply not using the auto remove wait function, e.g.
> *only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait,
> default_wake_function), fixed the hang / task starvation issue in our
> tests. But there was still more lock contention than there should be, so
> we also changed wake_up_all to wake_up.
> 
> It might be useful to run your tests with only the DEFINE_WAIT change I
> describe above added to the original code to see if that still has any
> problems. That would give a good datapoint whether any remaining issues
> are due to missed wakeups or not.
> 
> There is the issue of making forward progress, or at least making it
> fast enough. With the changes as they stand now, you could come up with
> a scenario where the throttling limit is hit, but then is raised. Since
> there might still be a wait queue, you could end up putting each
> incoming task to sleep, even though it's not needed.
> 
> One way to guarantee that the wait queue clears up as fast as possible,
> without resorting to wakeup_all, is to use wakeup_nr, where the number
> of tasks to wake up is (limit - inflight).
> 
> Also, to avoid having tasks going back to sleep in the loop, you could
> do what you already proposed - always just sleep at most once, and
> unconditionally proceed after waking up. To avoid incoming tasks
> overtaking the ones that are being woken up, you could have wbt_done
> increment inflight, effectively reserving a spot for the tasks that are
> about to be woken up.
> 
> Another thing I thought about was recording the number of waiters in the
> wait queue, and modify the check from (inflight < limit) to (inflight <
> (limit - nwaiters)), and no longer use any waitqueue_active checks.
> 
> The condition checks are of course complicated by the fact that
> condition manipulation is not always done under the same lock (e.g.
> wbt_wait can be called with a NULL lock).
> 
> 
> So, these are just some of the things to consider here - maybe there's
> nothing in there that you hadn't already considered, but I thought it'd
> be useful to summarize them.
> 
> Thanks for looking in to this!

It turned out to be an unrelated problem with rq reordering in blk-mq,
mainline doesn't have it.

So I think the above change is safe and fine, but we definitely still
want the extra change of NOT allowing a queue token for the initial loop
inside __wbt_wait() for when we have current sleepers on the queue.
Without that, the initial check in __wbt_wait() is not useful at all.

-- 
Jens Axboe