On 11/11/21 10:28 AM, Jens Axboe wrote: > On 11/11/21 9:55 AM, Jens Axboe wrote: >> On 11/11/21 9:19 AM, Jens Axboe wrote: >>> On 11/11/21 8:29 AM, Jens Axboe wrote: >>>> On 11/11/21 7:58 AM, Jens Axboe wrote: >>>>> On 11/11/21 7:30 AM, Jens Axboe wrote: >>>>>> On 11/10/21 11:52 PM, Daniel Black wrote: >>>>>>>> Would it be possible to turn this into a full reproducer script? >>>>>>>> Something that someone that knows nothing about mysqld/mariadb can just >>>>>>>> run and have it reproduce. If I install the 10.6 packages from above, >>>>>>>> then it doesn't seem to use io_uring or be linked against liburing. >>>>>>> >>>>>>> Sorry Jens. >>>>>>> >>>>>>> Hope containers are ok. >>>>>> >>>>>> Don't think I have a way to run that, don't even know what podman is >>>>>> and nor does my distro. I'll google a bit and see if I can get this >>>>>> running. >>>>>> >>>>>> I'm fine building from source and running from there, as long as I >>>>>> know what to do. Would that make it any easier? It definitely would >>>>>> for me :-) >>>>> >>>>> The podman approach seemed to work, and I was able to run all three >>>>> steps. Didn't see any hangs. I'm going to try again dropping down >>>>> the innodb pool size (box only has 32G of RAM). >>>>> >>>>> The storage can do a lot more than 5k IOPS, I'm going to try ramping >>>>> that up. >>>>> >>>>> Does your reproducer box have multiple NUMA nodes, or is it a single >>>>> socket/nod box? >>>> >>>> Doesn't seem to reproduce for me on current -git. What file system are >>>> you using? >>> >>> I seem to be able to hit it with ext4, guessing it has more cases that >>> punt to buffered IO. As I initially suspected, I think this is a race >>> with buffered file write hashing. I have a debug patch that just turns >>> a regular non-numa box into multi nodes, may or may not be needed be >>> needed to hit this, but I definitely can now. Looks like this: >>> >>> Node7 DUMP >>> index=0, nr_w=1, max=128, r=0, f=1, h=0 >>> w=ffff8f5e8b8470c0, hashed=1/0, flags=2 >>> w=ffff8f5e95a9b8c0, hashed=1/0, flags=2 >>> index=1, nr_w=0, max=127877, r=0, f=0, h=0 >>> free_list >>> worker=ffff8f5eaf2e0540 >>> all_list >>> worker=ffff8f5eaf2e0540 >>> >>> where we seed node7 in this case having two work items pending, but the >>> worker state is stalled on hash. >>> >>> The hash logic was rewritten as part of the io-wq worker threads being >>> changed for 5.11 iirc, which is why that was my initial suspicion here. >>> >>> I'll take a look at this and make a test patch. Looks like you are able >>> to test self-built kernels, is that correct? >> >> Can you try with this patch? It's against -git, but it will apply to >> 5.15 as well. > > I think that one covered one potential gap, but I just managed to > reproduce a stall even with it. So hang on testing that one, I'll send > you something more complete when I have confidence in it. Alright, give this one a go if you can. Against -git, but will apply to 5.15 as well. diff --git a/fs/io-wq.c b/fs/io-wq.c index afd955d53db9..88202de519f6 100644 --- a/fs/io-wq.c +++ b/fs/io-wq.c @@ -423,9 +423,10 @@ static inline unsigned int io_get_work_hash(struct io_wq_work *work) return work->flags >> IO_WQ_HASH_SHIFT; } -static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash) +static bool io_wait_on_hash(struct io_wqe *wqe, unsigned int hash) { struct io_wq *wq = wqe->wq; + bool ret = false; spin_lock_irq(&wq->hash->wait.lock); if (list_empty(&wqe->wait.entry)) { @@ -433,9 +434,11 @@ static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash) if (!test_bit(hash, &wq->hash->map)) { __set_current_state(TASK_RUNNING); list_del_init(&wqe->wait.entry); + ret = true; } } spin_unlock_irq(&wq->hash->wait.lock); + return ret; } static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct, @@ -475,14 +478,21 @@ static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct, } if (stall_hash != -1U) { + bool unstalled; + /* * Set this before dropping the lock to avoid racing with new * work being added and clearing the stalled bit. */ set_bit(IO_ACCT_STALLED_BIT, &acct->flags); raw_spin_unlock(&wqe->lock); - io_wait_on_hash(wqe, stall_hash); + unstalled = io_wait_on_hash(wqe, stall_hash); raw_spin_lock(&wqe->lock); + if (unstalled) { + clear_bit(IO_ACCT_STALLED_BIT, &acct->flags); + if (wq_has_sleeper(&wqe->wq->hash->wait)) + wake_up(&wqe->wq->hash->wait); + } } return NULL; @@ -564,8 +574,11 @@ static void io_worker_handle_work(struct io_worker *worker) io_wqe_enqueue(wqe, linked); if (hash != -1U && !next_hashed) { + /* serialize hash clear with wake_up() */ + spin_lock_irq(&wq->hash->wait.lock); clear_bit(hash, &wq->hash->map); clear_bit(IO_ACCT_STALLED_BIT, &acct->flags); + spin_unlock_irq(&wq->hash->wait.lock); if (wq_has_sleeper(&wq->hash->wait)) wake_up(&wq->hash->wait); raw_spin_lock(&wqe->lock); -- Jens Axboe