On 8/22/18 10:42 AM, van der Linden, Frank wrote: > On 8/22/18 7:27 AM, Jens Axboe wrote: >> On 8/22/18 6:54 AM, Holger Hoffstätte wrote: >>> On 08/22/18 06:10, Jens Axboe wrote: >>>> [...] >>>> If you have time, please look at the 3 patches I posted earlier today. >>>> Those are for mainline, so should be OK :-) >>> I'm just playing along at home but with those 3 I get repeatable >>> hangs & writeback not starting at all, but curiously *only* on my btrfs >>> device; for inexplicable reasons some other devices with ext4/xfs flush >>> properly. Yes, that surprised me too, but it's repeatable. >>> Now this may or may not have something to do with some of my in-testing >>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything >>> is golden again. Not eager to repeat since it hangs sync & requires a >>> hard reboot.. :( >>> Just thought you'd like to know. >> Thanks, that's very useful info! I'll see if I can reproduce that. >> > I think it might be useful to kind of give a dump of what we discussed > before this patch was sent, there was a little more than was in the > description. > > We saw hangs and heavy lock contention in the wbt code under a specific > workload, on XFS. Crash dump analysis showed the following issues: > > 1) wbt_done uses wake_up_all, which causes a thundering herd > 2) __wbt_wait sets up a wait queue with the auto remove wake function > (via DEFINE_WAIT), which caused two problems: > * combined with the use of wake_up_all, the wait queue would > essentially be randomly reordered for tasks that did not get to run > * the waitqueue_active check in may_queue was not valid with the auto > remove function, which could lead incoming tasks with requests to > overtake existing requests > > 1) was fixed by using a plain wake_up > 2) was fixed by keeping tasks on the queue until they could break out of > the wait loop in __wbt_wait > > > The random reordering, causing task starvation in __wbt_wait, was the > main problem. Simply not using the auto remove wait function, e.g. > *only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait, > default_wake_function), fixed the hang / task starvation issue in our > tests. But there was still more lock contention than there should be, so > we also changed wake_up_all to wake_up. > > It might be useful to run your tests with only the DEFINE_WAIT change I > describe above added to the original code to see if that still has any > problems. That would give a good datapoint whether any remaining issues > are due to missed wakeups or not. > > There is the issue of making forward progress, or at least making it > fast enough. With the changes as they stand now, you could come up with > a scenario where the throttling limit is hit, but then is raised. Since > there might still be a wait queue, you could end up putting each > incoming task to sleep, even though it's not needed. > > One way to guarantee that the wait queue clears up as fast as possible, > without resorting to wakeup_all, is to use wakeup_nr, where the number > of tasks to wake up is (limit - inflight). > > Also, to avoid having tasks going back to sleep in the loop, you could > do what you already proposed - always just sleep at most once, and > unconditionally proceed after waking up. To avoid incoming tasks > overtaking the ones that are being woken up, you could have wbt_done > increment inflight, effectively reserving a spot for the tasks that are > about to be woken up. > > Another thing I thought about was recording the number of waiters in the > wait queue, and modify the check from (inflight < limit) to (inflight < > (limit - nwaiters)), and no longer use any waitqueue_active checks. > > The condition checks are of course complicated by the fact that > condition manipulation is not always done under the same lock (e.g. > wbt_wait can be called with a NULL lock). > > > So, these are just some of the things to consider here - maybe there's > nothing in there that you hadn't already considered, but I thought it'd > be useful to summarize them. > > Thanks for looking in to this! It turned out to be an unrelated problem with rq reordering in blk-mq, mainline doesn't have it. So I think the above change is safe and fine, but we definitely still want the extra change of NOT allowing a queue token for the initial loop inside __wbt_wait() for when we have current sleepers on the queue. Without that, the initial check in __wbt_wait() is not useful at all. -- Jens Axboe