Re: repeatable crash in librbd1

Johannes Naab <johannes.naab@xxxxxxxxxxxxxxxx> · Tue, 28 Jul 2020 17:19:14 +0200

On 2020-07-28 15:52, Jason Dillaman wrote:
> On Tue, Jul 28, 2020 at 9:44 AM Johannes Naab
> <johannes.naab@xxxxxxxxxxxxxxxx> wrote:
>>
>> On 2020-07-28 14:49, Jason Dillaman wrote:
>>>> VM in libvirt with:
>>>> <pre>
>>>>     <disk type='network' device='disk'>
>>>>       <driver name='qemu' type='raw' discard='unmap'/>
>>>>       <source protocol='rbd' name='pool/disk' index='4'>
>>>>         <!-- omitted -->
>>>>       </source>
>>>>       <iotune>
>>>>         <read_bytes_sec>209715200</read_bytes_sec>
>>>>         <write_bytes_sec>209715200</write_bytes_sec>
>>>>         <read_iops_sec>5000</read_iops_sec>
>>>>         <write_iops_sec>5000</write_iops_sec>
>>>>         <read_bytes_sec_max>314572800</read_bytes_sec_max>
>>>>         <write_bytes_sec_max>314572800</write_bytes_sec_max>
>>>>         <read_iops_sec_max>7500</read_iops_sec_max>
>>>>         <write_iops_sec_max>7500</write_iops_sec_max>
>>>>         <read_bytes_sec_max_length>60</read_bytes_sec_max_length>
>>>>         <write_bytes_sec_max_length>60</write_bytes_sec_max_length>
>>>>         <read_iops_sec_max_length>60</read_iops_sec_max_length>
>>>>         <write_iops_sec_max_length>60</write_iops_sec_max_length>
>>>>       </iotune>
>>>>     </disk>
>>>> </pre>
>>>>
>>>> workload:
>>>> <pre>
>>>> fio --rw=write --name=test --size=10M
>>>> timeout 30s fio --rw=write --name=test --size=20G
>>>> timeout 3m fio --rw=write --name=test --size=20G --direct=1
>>>> timeout 1m fio --rw=randrw --name=test --size=20G --direct=1
>>>> timeout 10s fio --numjobs=8 --rw=randrw --name=test --size=1G --direct=1
>>>> # the backtraces are then observed while the following command is running
>>>> fio --ioengine=libaio --iodepth=16 --numjobs=8 --rw=randrw --name=test --size=1G --direct=1
>>>
>>> I'm not sure I understand this workload. Are you running these 6 "fio"
>>> processes sequentially or concurrently? Does it only crash on that
>>> last one? Do you have "exclusive-lock" enabled on the image since
>>> "--numjobs 8" would cause lots of lock fighting if it was enabled.
>>
>> The workload is a virtual machine with the above libvirt device
>> configuration. Within that virtual machine, the workload is run
>> sequentially (as script crash.sh) on the xfs formatted device.
>>
>> I.e. librbd/ceph should only the one qemu process, which is then running
>> the workload.
>>
>> Only the last fio invocation causes the problems.
>> When skipping some (I did not test it exhaustively) of the fio
>> invocations, the crash is no longer reliably triggered.
> 
> Hmm, all those crash backtraces are in
> "AioCompletion::complete_event_socket", but QEMU does not have any
> code that utilizes the event socket notification system. AFAIK, only
> the fio librbd engine has integrated with that callback system.
> 

The host is an Ubuntu 20.04 with minor backports in libvirt (6.0.0-0ubuntu8.1)
and qemu (1:4.2-3ubuntu6.3) for specific CPU IDs, and the ceph.com librbd1.

Upon further testing, changing the libvirt device configuration to:

> <driver name='qemu' type='raw' cache='none' io='native' discard='unmap'/>

(adding cache='none' amd io='native'), did not yet resurface the crash.

Based on my understanding, cache='writeback' and io='thread' are the
defaults when not otherwise configured. However, I do not yet fully
understand the dependencies between those options.

Are the libvirt <driver cache='...'> and librbd caches distinct, or do
they refer to the same cache
(http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008486.html)?

>>> Are all the crashes seg faults? They all seem to hint that the
>>> internal ImageCtx instance was destroyed somehow while there was still
>>> in-flight IO. If the crashes appeared during the "timeout XYZ fio ..."
>>> calls, I would think it's highly likely that "fio" is incorrectly
>>> closing the RBD image while there was still in-flight IO via its
>>> signal handler.
>>
>> They are all segfaults of the qemu process, captured on the host system.
>> librbd should not see any image open/closing during the workload run
>> within the VM.
>> The `timeout` is used to approximate the initial (manual) workload
>> generation, which caused a crash.
>>
> 
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx