Re: uring regression - lost write request

Jens Axboe <axboe@xxxxxxxxx> · Thu, 11 Nov 2021 09:19:58 -0700

On 11/11/21 8:29 AM, Jens Axboe wrote:
> On 11/11/21 7:58 AM, Jens Axboe wrote:
>> On 11/11/21 7:30 AM, Jens Axboe wrote:
>>> On 11/10/21 11:52 PM, Daniel Black wrote:
>>>>> Would it be possible to turn this into a full reproducer script?
>>>>> Something that someone that knows nothing about mysqld/mariadb can just
>>>>> run and have it reproduce. If I install the 10.6 packages from above,
>>>>> then it doesn't seem to use io_uring or be linked against liburing.
>>>>
>>>> Sorry Jens.
>>>>
>>>> Hope containers are ok.
>>>
>>> Don't think I have a way to run that, don't even know what podman is
>>> and nor does my distro. I'll google a bit and see if I can get this
>>> running.
>>>
>>> I'm fine building from source and running from there, as long as I
>>> know what to do. Would that make it any easier? It definitely would
>>> for me :-)
>>
>> The podman approach seemed to work, and I was able to run all three
>> steps. Didn't see any hangs. I'm going to try again dropping down
>> the innodb pool size (box only has 32G of RAM).
>>
>> The storage can do a lot more than 5k IOPS, I'm going to try ramping
>> that up.
>>
>> Does your reproducer box have multiple NUMA nodes, or is it a single
>> socket/nod box?
> 
> Doesn't seem to reproduce for me on current -git. What file system are
> you using?

I seem to be able to hit it with ext4, guessing it has more cases that
punt to buffered IO. As I initially suspected, I think this is a race
with buffered file write hashing. I have a debug patch that just turns
a regular non-numa box into multi nodes, may or may not be needed be
needed to hit this, but I definitely can now. Looks like this:

Node7 DUMP                                                                      
index=0, nr_w=1, max=128, r=0, f=1, h=0                                         
  w=ffff8f5e8b8470c0, hashed=1/0, flags=2                                       
  w=ffff8f5e95a9b8c0, hashed=1/0, flags=2                                       
index=1, nr_w=0, max=127877, r=0, f=0, h=0                                      
free_list                                                                       
  worker=ffff8f5eaf2e0540                                                       
all_list                                                                        
  worker=ffff8f5eaf2e0540

where we seed node7 in this case having two work items pending, but the
worker state is stalled on hash.

The hash logic was rewritten as part of the io-wq worker threads being
changed for 5.11 iirc, which is why that was my initial suspicion here.

I'll take a look at this and make a test patch. Looks like you are able
to test self-built kernels, is that correct?

-- 
Jens Axboe