在 2022/06/17 1:21, Jan Kara 写道:
Hello! I've been debugging some customer reports of tasks hanging (forever) waiting for free tags when in fact all tags are free. After looking into it for some time I think I know what it happening. First, keep in mind that it concerns a device which uses shared tags. There are 127 tags available and the number of active queues using these tags is easily 40 or more. So number of tags available for each device is rather small. Now I'm not sure how batched wakeups can ever work in such situations, but maybe I'm missing something. So take for example a situation where two tags are available for a device, they are both currently used. Now a process comes into blk_mq_get_tag() and wants to allocate tag and goes to sleep. Now how can it ever be woken up if wake_batch is 4? If the two IOs complete, sbitmap will get two wakeups but that's not enough to trigger the batched wakeup to really wakeup the waiter... Even if we have say 4 tags available so in theory there should be enough wakeups to fill the batch, there can be the following problem. So 4 tags are in use, two processes come to blk_mq_get_tag() and sleep, one on wait queue 0, one on wait queue 1. Now four IOs complete so sbitmap_queue_wake_up() gets called 4 times and the fourth call decrements wait_cnt to 0 so it ends up calling wake_up_nr(wq0, 4). Fine, one of the waiters is woken up but the other one is still sleeping in wq1 and there are not enough wakeups to fill the batch and wake it up? This is essentially because we have lost three wakeups on wq0 because it didn't have enough waiters to wake...
Hi, From what I see, if tags are shared for multiple devices, wake_batch should make sure that all waiter will be woke up: For example: there are total 64 tags shared for two devices, then wake_batch is 4(if both devices are active). If there are waiters, which means at least 32 tags are grabed, thus 8 queues will ensure to wake up at least once after 32 tags are freed.
Finally, sbitmap_queue_wake_up() is racy and if two of them race together, they can end up decrementing wait_cnt of wq which does not have any process queued which again effectively leads to lost wakeups and possibly indefinitely sleeping tasks.
BTW, I do this implementation have some problems on concurrent scenario, as described in following patch: https://lore.kernel.org/lkml/20220415101053.554495-4-yukuai3@xxxxxxxxxx/
Now this all looks so broken that I'm unsure whether I'm not missing something fundamental. Also I'm not quite sure how to fix all this without basically destroying the batched wakeup feature (but I didn't put too much thought to this yet). Comments? Honza