Re: blktests with zbd/006 ZNS triggers a possible false positive RCU stall

Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> · Wed, 27 Apr 2022 17:39:46 +0900

On 4/27/22 16:41, Klaus Jensen wrote:
> On Apr 27 05:08, Shinichiro Kawasaki wrote:
>> On Apr 21, 2022 / 11:00, Luis Chamberlain wrote:
>>> On Wed, Apr 20, 2022 at 05:54:29AM +0000, Shinichiro Kawasaki wrote:
>>>> On Apr 14, 2022 / 15:02, Luis Chamberlain wrote:
>>>>> Hey folks,
>>>>>
>>>>> While enhancing kdevops [0] to embrace automation of testing with
>>>>> blktests for ZNS I ended up spotting a possible false positive RCU stall
>>>>> when running zbd/006 after zbd/005. The curious thing though is that
>>>>> this possible RCU stall is only possible when using the qemu
>>>>> ZNS drive, not when using nbd. In so far as kdevops is concerned
>>>>> it creates ZNS drives for you when you enable the config option
>>>>> CONFIG_QEMU_ENABLE_NVME_ZNS=y. So picking any of the ZNS drives
>>>>> suffices. When configuring blktests you can just enable the zbd
>>>>> guest, so only a pair of guests are reated the zbd guest and the
>>>>> respective development guest, zbd-dev guest. When using
>>>>> CONFIG_KDEVOPS_HOSTS_PREFIX="linux517" this means you end up with
>>>>> just two guests:
>>>>>
>>>>>   * linux517-blktests-zbd
>>>>>   * linux517-blktests-zbd-dev
>>>>>
>>>>> The RCU stall can be triggered easily as follows:
>>>>>
>>>>> make menuconfig # make sure to enable CONFIG_QEMU_ENABLE_NVME_ZNS=y and blktests
>>>>> make
>>>>> make bringup # bring up guests
>>>>> make linux # build and boot into v5.17-rc7
>>>>> make blktests # build and install blktests
>>>>>
>>>>> Now let's ssh to the guest while leaving a console attached
>>>>> with `sudo virsh vagrant_linux517-blktests-zbd` in a window:
>>>>>
>>>>> ssh linux517-blktests-zbd
>>>>> sudo su -
>>>>> cd /usr/local/blktests
>>>>> export TEST_DEVS=/dev/nvme9n1
>>>>> i=0; while true; do ./check zbd/005 zbd/006; if [[ $? -ne 0 ]]; then echo "BAD at $i"; break; else echo GOOOD $i ; fi; let i=$i+1; done;
>>>>>
>>>>> The above should never fail, but you should eventually see an RCU
>>>>> stall candidate on the console. The full details can be observed on the
>>>>> gist [1] but for completeness I list some of it below. It may be a false
>>>>> positive at this point, not sure.
>>>>>
>>>>> [493272.711271] run blktests zbd/005 at 2022-04-14 20:03:22
>>>>> [493305.769531] run blktests zbd/006 at 2022-04-14 20:03:55
>>>>> [493336.979482] nvme nvme9: I/O 192 QID 5 timeout, aborting
>>>>> [493336.981666] nvme nvme9: Abort status: 0x0
>>>>> [493367.699440] nvme nvme9: I/O 192 QID 5 timeout, reset controller
>>>>> [493388.819341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
>>>>
>>>> Hello Luis,
>>>>
>>>> I run blktests zbd group on several QEMU ZNS emulation devices for every rcX
>>>> kernel releases. But, I have not ever observed the symptom above. Now I'm
>>>> repeating zbd/005 and zbd/006 using v5.18-rc3 and a QEMU ZNS device, and do
>>>> not observe the symptom so far, after 400 times repeat.
>>>
>>> Did you try v5.17-rc7 ?
>>
>> I hadn't tried it. Then I tried v5.17-rc7 and observed the same symptom.
>>
>>>
>>>> I would like to run the test using same ZNS set up as yours. Can you share how
>>>> your ZNS device is set up? I would like to know device size and QEMU -device
>>>> options, such as zoned.zone_size or zoned.max_active.
>>>
>>> It is as easy as the above make commands, and follow up login commands.
>>
>> I managed to run kdevops on my machine, and saw the I/O timeout and abort
>> messages. Using similar QEMU ZNS set up as kdevops, the messages were recreated
>> in my test environment also (the reset controller message and RCU relegated
>> error were not observed).
>>
> 
> Can you extract the relevant part of the QEMU parameters? I tried to
> reproduce this, but could not with a 10G, neither with discard=on or
> off, qcow2 or raw.
> 
>> [  214.134083][ T1028] run blktests zbd/005 at 2022-04-22 21:29:54
>> [  246.383978][ T1142] run blktests zbd/006 at 2022-04-22 21:30:26
>> [  276.784284][  T386] nvme nvme6: I/O 494 QID 4 timeout, aborting
>> [  276.788391][    C0] nvme nvme6: Abort status: 0x0
>>
>> The conditions to recreate the I/O timeout error are as follows:
>>
>> - Larger size of QEMU ZNS drive (10GB)
>>     - I use QEMU ZNS drives with 1GB size for my test runs. With this smaller
>>       size, the I/O timeout is not observed.
>>
>> - Issue zone reset command for all zones (with 'blkzone reset' command) just
>>   after zbd/005 completion to the drive.
>>     - The test case zbd/006 calls the zone reset command. It's enough to repeat
>>       zbd/005 and zone reset command to recreate the I/O timeout.
>>     - When 10 seconds sleep is added between zbd/005 run and zone reset command,
>>       the I/O timeout was not observed.
>>     - The data write pattern of zbd/005 looks important. Simple dd command to
>>       fill the device before 'blkzone reset' did not recreate the I/O timeout.
>>
>> I dug into QEMU code and found that it takes long time to complete zone reset
>> command with all zones flag. It takes more than 30 seconds and looks triggering
>> the I/O timeout in the block layer. The QEMU calls fallocate punch hole to the
>> backend file for each zone, so that data of each zone is zero cleared. Each
>> fallocate call is quick but between the calls, 0.7 second delay was observed
>> often. I guess some fsync or fdatasync operation would be running and causing
>> the delay.
>>
> 
> QEMU uses a write zeroes for zone reset. This is because of the
> requirement that block in empty zones must be considered deallocated.
> 
> When the drive is configured with `discard=on`, these write zeroes
> *should* turn into discards. However, I also tested with discard=off and
> I could not reproduce it.
> 
> It might make sense to force QEMU to use a discard for zone reset in all
> cases, and then change the reported DLFEAT appropriately, since we
> cannot guarantee zeroes then.

Why not punch a hole in the backing store file with fallocate() with mode
set to FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE ? That would be way
faster than a write zeroes which potentially actually do the writes,
leading to large command processing times. Reading in a hole in a file is
guaranteed to return zeroes, at least on Linux.

If the backingstore is a block device, then sure, write zeroes is the only
solution. Discard should be used with caution since that is a hint only
and some drives may actually do nothing.

> 
>> In other words, QEMU ZNS zone reset for all zones is so slow depending on the
>> ZNS drive's size and status. Performance improvement of zone reset is desired in
>> QEMU. I will seek for the chance to work on it.
>>
> 
> Currently, each zone is a separate discard/write zero call. It would be
> fair to special case all zones and do it in much larger chunks.

Yep, for a backing file, a full file fallocate(FALLOC_FL_PUNCH_HOLE) would
do nicely. Or truncate(0) + truncate(storage size) would do too.

Since resets are always all zones or one zone, special optimized handling
of the reset all case will definitely have huge benefits for that command.

-- 
Damien Le Moal
Western Digital Research