Re: uring regression - lost write request

Jens Axboe <axboe@xxxxxxxxx> · Thu, 11 Nov 2021 16:44:29 -0700

On 11/11/21 10:28 AM, Jens Axboe wrote:
> On 11/11/21 9:55 AM, Jens Axboe wrote:
>> On 11/11/21 9:19 AM, Jens Axboe wrote:
>>> On 11/11/21 8:29 AM, Jens Axboe wrote:
>>>> On 11/11/21 7:58 AM, Jens Axboe wrote:
>>>>> On 11/11/21 7:30 AM, Jens Axboe wrote:
>>>>>> On 11/10/21 11:52 PM, Daniel Black wrote:
>>>>>>>> Would it be possible to turn this into a full reproducer script?
>>>>>>>> Something that someone that knows nothing about mysqld/mariadb can just
>>>>>>>> run and have it reproduce. If I install the 10.6 packages from above,
>>>>>>>> then it doesn't seem to use io_uring or be linked against liburing.
>>>>>>>
>>>>>>> Sorry Jens.
>>>>>>>
>>>>>>> Hope containers are ok.
>>>>>>
>>>>>> Don't think I have a way to run that, don't even know what podman is
>>>>>> and nor does my distro. I'll google a bit and see if I can get this
>>>>>> running.
>>>>>>
>>>>>> I'm fine building from source and running from there, as long as I
>>>>>> know what to do. Would that make it any easier? It definitely would
>>>>>> for me :-)
>>>>>
>>>>> The podman approach seemed to work, and I was able to run all three
>>>>> steps. Didn't see any hangs. I'm going to try again dropping down
>>>>> the innodb pool size (box only has 32G of RAM).
>>>>>
>>>>> The storage can do a lot more than 5k IOPS, I'm going to try ramping
>>>>> that up.
>>>>>
>>>>> Does your reproducer box have multiple NUMA nodes, or is it a single
>>>>> socket/nod box?
>>>>
>>>> Doesn't seem to reproduce for me on current -git. What file system are
>>>> you using?
>>>
>>> I seem to be able to hit it with ext4, guessing it has more cases that
>>> punt to buffered IO. As I initially suspected, I think this is a race
>>> with buffered file write hashing. I have a debug patch that just turns
>>> a regular non-numa box into multi nodes, may or may not be needed be
>>> needed to hit this, but I definitely can now. Looks like this:
>>>
>>> Node7 DUMP                                                                      
>>> index=0, nr_w=1, max=128, r=0, f=1, h=0                                         
>>>   w=ffff8f5e8b8470c0, hashed=1/0, flags=2                                       
>>>   w=ffff8f5e95a9b8c0, hashed=1/0, flags=2                                       
>>> index=1, nr_w=0, max=127877, r=0, f=0, h=0                                      
>>> free_list                                                                       
>>>   worker=ffff8f5eaf2e0540                                                       
>>> all_list                                                                        
>>>   worker=ffff8f5eaf2e0540
>>>
>>> where we seed node7 in this case having two work items pending, but the
>>> worker state is stalled on hash.
>>>
>>> The hash logic was rewritten as part of the io-wq worker threads being
>>> changed for 5.11 iirc, which is why that was my initial suspicion here.
>>>
>>> I'll take a look at this and make a test patch. Looks like you are able
>>> to test self-built kernels, is that correct?
>>
>> Can you try with this patch? It's against -git, but it will apply to
>> 5.15 as well.
> 
> I think that one covered one potential gap, but I just managed to
> reproduce a stall even with it. So hang on testing that one, I'll send
> you something more complete when I have confidence in it.

Alright, give this one a go if you can. Against -git, but will apply to
5.15 as well.

diff --git a/fs/io-wq.c b/fs/io-wq.c
index afd955d53db9..88202de519f6 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -423,9 +423,10 @@ static inline unsigned int io_get_work_hash(struct io_wq_work *work)
 	return work->flags >> IO_WQ_HASH_SHIFT;
 }
 
-static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
+static bool io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
 {
 	struct io_wq *wq = wqe->wq;
+	bool ret = false;
 
 	spin_lock_irq(&wq->hash->wait.lock);
 	if (list_empty(&wqe->wait.entry)) {
@@ -433,9 +434,11 @@ static void io_wait_on_hash(struct io_wqe *wqe, unsigned int hash)
 		if (!test_bit(hash, &wq->hash->map)) {
 			__set_current_state(TASK_RUNNING);
 			list_del_init(&wqe->wait.entry);
+			ret = true;
 		}
 	}
 	spin_unlock_irq(&wq->hash->wait.lock);
+	return ret;
 }
 
 static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
@@ -475,14 +478,21 @@ static struct io_wq_work *io_get_next_work(struct io_wqe_acct *acct,
 	}
 
 	if (stall_hash != -1U) {
+		bool unstalled;
+
 		/*
 		 * Set this before dropping the lock to avoid racing with new
 		 * work being added and clearing the stalled bit.
 		 */
 		set_bit(IO_ACCT_STALLED_BIT, &acct->flags);
 		raw_spin_unlock(&wqe->lock);
-		io_wait_on_hash(wqe, stall_hash);
+		unstalled = io_wait_on_hash(wqe, stall_hash);
 		raw_spin_lock(&wqe->lock);
+		if (unstalled) {
+			clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
+			if (wq_has_sleeper(&wqe->wq->hash->wait))
+				wake_up(&wqe->wq->hash->wait);
+		}
 	}
 
 	return NULL;
@@ -564,8 +574,11 @@ static void io_worker_handle_work(struct io_worker *worker)
 				io_wqe_enqueue(wqe, linked);
 
 			if (hash != -1U && !next_hashed) {
+				/* serialize hash clear with wake_up() */
+				spin_lock_irq(&wq->hash->wait.lock);
 				clear_bit(hash, &wq->hash->map);
 				clear_bit(IO_ACCT_STALLED_BIT, &acct->flags);
+				spin_unlock_irq(&wq->hash->wait.lock);
 				if (wq_has_sleeper(&wq->hash->wait))
 					wake_up(&wq->hash->wait);
 				raw_spin_lock(&wqe->lock);

-- 
Jens Axboe