On 2020-07-28 14:49, Jason Dillaman wrote: >> VM in libvirt with: >> <pre> >> <disk type='network' device='disk'> >> <driver name='qemu' type='raw' discard='unmap'/> >> <source protocol='rbd' name='pool/disk' index='4'> >> <!-- omitted --> >> </source> >> <iotune> >> <read_bytes_sec>209715200</read_bytes_sec> >> <write_bytes_sec>209715200</write_bytes_sec> >> <read_iops_sec>5000</read_iops_sec> >> <write_iops_sec>5000</write_iops_sec> >> <read_bytes_sec_max>314572800</read_bytes_sec_max> >> <write_bytes_sec_max>314572800</write_bytes_sec_max> >> <read_iops_sec_max>7500</read_iops_sec_max> >> <write_iops_sec_max>7500</write_iops_sec_max> >> <read_bytes_sec_max_length>60</read_bytes_sec_max_length> >> <write_bytes_sec_max_length>60</write_bytes_sec_max_length> >> <read_iops_sec_max_length>60</read_iops_sec_max_length> >> <write_iops_sec_max_length>60</write_iops_sec_max_length> >> </iotune> >> </disk> >> </pre> >> >> workload: >> <pre> >> fio --rw=write --name=test --size=10M >> timeout 30s fio --rw=write --name=test --size=20G >> timeout 3m fio --rw=write --name=test --size=20G --direct=1 >> timeout 1m fio --rw=randrw --name=test --size=20G --direct=1 >> timeout 10s fio --numjobs=8 --rw=randrw --name=test --size=1G --direct=1 >> # the backtraces are then observed while the following command is running >> fio --ioengine=libaio --iodepth=16 --numjobs=8 --rw=randrw --name=test --size=1G --direct=1 > > I'm not sure I understand this workload. Are you running these 6 "fio" > processes sequentially or concurrently? Does it only crash on that > last one? Do you have "exclusive-lock" enabled on the image since > "--numjobs 8" would cause lots of lock fighting if it was enabled. The workload is a virtual machine with the above libvirt device configuration. Within that virtual machine, the workload is run sequentially (as script crash.sh) on the xfs formatted device. I.e. librbd/ceph should only the one qemu process, which is then running the workload. Only the last fio invocation causes the problems. When skipping some (I did not test it exhaustively) of the fio invocations, the crash is no longer reliably triggered. > Are all the crashes seg faults? They all seem to hint that the > internal ImageCtx instance was destroyed somehow while there was still > in-flight IO. If the crashes appeared during the "timeout XYZ fio ..." > calls, I would think it's highly likely that "fio" is incorrectly > closing the RBD image while there was still in-flight IO via its > signal handler. They are all segfaults of the qemu process, captured on the host system. librbd should not see any image open/closing during the workload run within the VM. The `timeout` is used to approximate the initial (manual) workload generation, which caused a crash. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx