On 11/11/21 9:19 AM, Jens Axboe wrote: > On 11/11/21 8:29 AM, Jens Axboe wrote: >> On 11/11/21 7:58 AM, Jens Axboe wrote: >>> On 11/11/21 7:30 AM, Jens Axboe wrote: >>>> On 11/10/21 11:52 PM, Daniel Black wrote: >>>>>> Would it be possible to turn this into a full reproducer script? >>>>>> Something that someone that knows nothing about mysqld/mariadb can just >>>>>> run and have it reproduce. If I install the 10.6 packages from above, >>>>>> then it doesn't seem to use io_uring or be linked against liburing. >>>>> >>>>> Sorry Jens. >>>>> >>>>> Hope containers are ok. >>>> >>>> Don't think I have a way to run that, don't even know what podman is >>>> and nor does my distro. I'll google a bit and see if I can get this >>>> running. >>>> >>>> I'm fine building from source and running from there, as long as I >>>> know what to do. Would that make it any easier? It definitely would >>>> for me :-) >>> >>> The podman approach seemed to work, and I was able to run all three >>> steps. Didn't see any hangs. I'm going to try again dropping down >>> the innodb pool size (box only has 32G of RAM). >>> >>> The storage can do a lot more than 5k IOPS, I'm going to try ramping >>> that up. >>> >>> Does your reproducer box have multiple NUMA nodes, or is it a single >>> socket/nod box? >> >> Doesn't seem to reproduce for me on current -git. What file system are >> you using? > > I seem to be able to hit it with ext4, guessing it has more cases that > punt to buffered IO. As I initially suspected, I think this is a race > with buffered file write hashing. I have a debug patch that just turns > a regular non-numa box into multi nodes, may or may not be needed be > needed to hit this, but I definitely can now. Looks like this: > > Node7 DUMP > index=0, nr_w=1, max=128, r=0, f=1, h=0 > w=ffff8f5e8b8470c0, hashed=1/0, flags=2 > w=ffff8f5e95a9b8c0, hashed=1/0, flags=2 > index=1, nr_w=0, max=127877, r=0, f=0, h=0 > free_list > worker=ffff8f5eaf2e0540 > all_list > worker=ffff8f5eaf2e0540 > > where we seed node7 in this case having two work items pending, but the > worker state is stalled on hash. > > The hash logic was rewritten as part of the io-wq worker threads being > changed for 5.11 iirc, which is why that was my initial suspicion here. > > I'll take a look at this and make a test patch. Looks like you are able > to test self-built kernels, is that correct? Can you try with this patch? It's against -git, but it will apply to 5.15 as well. diff --git a/fs/io-wq.c b/fs/io-wq.c index afd955d53db9..7917b8866dcc 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -423,9 +423,10 @@ static inline unsigned int io_get_work_hash(struct io_wq_work *work) return work->flags >> IO_WQ_HASH_SHIFT; } -static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash) +static bool io_wait_on_hash(struct io_wqe *wqe, unsigned int hash) { struct io_wq *wq = wqe->wq; + bool ret = false; spin_lock_irq(&wq->hash->wait.lock); if (list_empty(&wqe->wait.entry)) { @@ -433,9 +434,11 @@ static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash) if (!test_bit(hash, &wq->hash->map)) { __set_current_state(TASK_RUNNING); list_del_init(&wqe->wait.entry); + ret = true; } } spin_unlock_irq(&wq->hash->wait.lock); + return ret; } static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct, @@ -447,6 +450,7 @@ static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct, unsigned int stall_hash = -1U; struct io_wqe *wqe = worker->wqe; +retry: wq_list_for_each(node, prev, &acct->work_list) { unsigned int hash; @@ -475,14 +479,18 @@ static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct, } if (stall_hash != -1U) { + bool do_retry; + /* * Set this before dropping the lock to avoid racing with new * work being added and clearing the stalled bit. */ set_bit(IO_ACCT_STALLED_BIT, &acct->flags); raw_spin_unlock(&wqe->lock); - io_wait_on_hash(wqe, stall_hash); + do_retry = io_wait_on_hash(wqe, stall_hash); raw_spin_lock(&wqe->lock); + if (do_retry) + goto retry; } return NULL; -- Jens Axboe