Re: repeatable crash in librbd1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jul 28, 2020 at 7:19 AM Johannes Naab
<johannes.naab@xxxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> we observe crashes in librbd1 on specific workloads in virtual machines
> on Ubuntu 20.04 hosts with librbd1=15.2.4-1focal.
>
> The changes in
> https://github.com/ceph/ceph/commit/50694f790245ca90a3b8a644da7b128a7a148cc6
> could be related, but do not easily apply against v15.2.4.

I don't think that commit is related. That commit fixed an issue in
the python-rbd test suite where the "wait_for_complete_and_cb" API
method could return before the callback was invoked when running with
multiple IO threads (Octopus is locked to a single thread).

> We have collected several backtraces, using a reliable local
> reproducer.
>
> (This is a resent of the message stuck in the moderation queue since yesterday.)
>
> Best regards
> Johannes
>
>
> VM in libvirt with:
> <pre>
>     <disk type='network' device='disk'>
>       <driver name='qemu' type='raw' discard='unmap'/>
>       <source protocol='rbd' name='pool/disk' index='4'>
>         <!-- omitted -->
>       </source>
>       <iotune>
>         <read_bytes_sec>209715200</read_bytes_sec>
>         <write_bytes_sec>209715200</write_bytes_sec>
>         <read_iops_sec>5000</read_iops_sec>
>         <write_iops_sec>5000</write_iops_sec>
>         <read_bytes_sec_max>314572800</read_bytes_sec_max>
>         <write_bytes_sec_max>314572800</write_bytes_sec_max>
>         <read_iops_sec_max>7500</read_iops_sec_max>
>         <write_iops_sec_max>7500</write_iops_sec_max>
>         <read_bytes_sec_max_length>60</read_bytes_sec_max_length>
>         <write_bytes_sec_max_length>60</write_bytes_sec_max_length>
>         <read_iops_sec_max_length>60</read_iops_sec_max_length>
>         <write_iops_sec_max_length>60</write_iops_sec_max_length>
>       </iotune>
>     </disk>
> </pre>
>
> workload:
> <pre>
> fio --rw=write --name=test --size=10M
> timeout 30s fio --rw=write --name=test --size=20G
> timeout 3m fio --rw=write --name=test --size=20G --direct=1
> timeout 1m fio --rw=randrw --name=test --size=20G --direct=1
> timeout 10s fio --numjobs=8 --rw=randrw --name=test --size=1G --direct=1
> # the backtraces are then observed while the following command is running
> fio --ioengine=libaio --iodepth=16 --numjobs=8 --rw=randrw --name=test --size=1G --direct=1

I'm not sure I understand this workload. Are you running these 6 "fio"
processes sequentially or concurrently? Does it only crash on that
last one? Do you have "exclusive-lock" enabled on the image since
"--numjobs 8" would cause lots of lock fighting if it was enabled.

Are all the crashes seg faults? They all seem to hint that the
internal ImageCtx instance was destroyed somehow while there was still
in-flight IO. If the crashes appeared during the "timeout XYZ fio ..."
calls, I would think it's highly likely that "fio" is incorrectly
closing the RBD image while there was still in-flight IO via its
signal handler.

> </pre>
>
>
> observed stack traces
>
> three times:
> <pre>
> #0  librbd::io::AioCompletion::complete_event_socket (this=this@entry=0x557f633e9400) at ./src/common/event_socket.h:32
> #1  0x00007ffb9740ba34 in librbd::io::AioCompletion::complete_external_callback (this=this@entry=0x557f633f7600) at ./src/librbd/io/AioCompletion.cc:262
> #2  0x00007ffb9740ce98 in librbd::io::AioCompletion::complete (this=0x557f633f7600) at ./src/librbd/io/AioCompletion.cc:104
> #3  0x00007ffb9740d1a0 in librbd::io::AioCompletion::complete_request (this=0x557f633f7600, r=r@entry=4096) at ./src/librbd/io/AioCompletion.cc:229
> #4  0x00007ffb9742fdca in librbd::io::ReadResult::C_ObjectReadRequest::finish (this=0x7ffb68364a60, r=4096) at ./src/librbd/io/ReadResult.cc:155
> #5  0x00007ffb9728334d in Context::complete (this=0x7ffb68364a60, r=<optimized out>) at ./src/include/Context.h:77
> #6  0x00007ffb9742c7d9 in librbd::io::ObjectDispatchSpec::C_Dispatcher::finish (this=0x7ffb6832bed0, r=<optimized out>) at ./src/librbd/io/ObjectDispatchSpec.cc:32
> #7  0x00007ffb9742c735 in librbd::io::ObjectDispatchSpec::C_Dispatcher::complete (this=<optimized out>, r=<optimized out>) at ./src/librbd/io/ObjectDispatchSpec.cc:23
> #8  0x00007ffb9755e942 in librbd::io::ObjectRequest<librbd::ImageCtx>::finish (this=this@entry=0x7ffb683394a0, r=r@entry=0) at ./src/include/Context.h:78
> #9  0x00007ffb97562e3b in librbd::io::ObjectReadRequest<librbd::ImageCtx>::handle_read_object (this=0x7ffb683394a0, r=0) at ./src/log/SubsystemMap.h:72
> #10 0x00007ffb970f4177 in librados::C_AioComplete::finish (this=0x7ffb68362e30, r=<optimized out>) at ./src/librados/AioCompletionImpl.h:140
> #11 0x00007ffb970afd1d in Context::complete (this=0x7ffb68362e30, r=<optimized out>) at ./src/include/Context.h:77
> #12 0x00007ffb8765791d in Finisher::finisher_thread_entry (this=0x557f63387a30) at ./src/common/Finisher.cc:66
> #13 0x00007ffb9946f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
> #14 0x00007ffb99396103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> </pre>
>
> once:
> <pre>
> #0  0x00007fe7a972169d in std::atomic<boost::lockfree::detail::tagged_ptr<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::load (__m=std::memory_order_acquire, this=<optimized out>) at /usr/include/c++/9/atomic:250
> #1  boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::do_push<false> (t=<optimized out>, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/queue.hpp:311
> #2  boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::push (t=<optimized out>, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/queue.hpp:280
> #3  librbd::io::AioCompletion::complete_event_socket (this=this@entry=0x5613db630440) at ./src/librbd/io/AioCompletion.cc:276
> #4  0x00007fe7a9721a34 in librbd::io::AioCompletion::complete_external_callback (this=this@entry=0x5613dbbfa2c0) at ./src/librbd/io/AioCompletion.cc:262
> #5  0x00007fe7a9722e98 in librbd::io::AioCompletion::complete (this=0x5613dbbfa2c0) at ./src/librbd/io/AioCompletion.cc:104
> #6  0x00007fe7a97231a0 in librbd::io::AioCompletion::complete_request (this=0x5613dbbfa2c0, r=r@entry=4096) at ./src/librbd/io/AioCompletion.cc:229
> #7  0x00007fe7a9745dca in librbd::io::ReadResult::C_ObjectReadRequest::finish (this=0x5613dc6a5c90, r=4096) at ./src/librbd/io/ReadResult.cc:155
> #8  0x00007fe7a959934d in Context::complete (this=0x5613dc6a5c90, r=<optimized out>) at ./src/include/Context.h:77
> #9  0x00007fe7a97427d9 in librbd::io::ObjectDispatchSpec::C_Dispatcher::finish (this=0x5613db8bae40, r=<optimized out>) at ./src/librbd/io/ObjectDispatchSpec.cc:32
> #10 0x00007fe7a9742735 in librbd::io::ObjectDispatchSpec::C_Dispatcher::complete (this=<optimized out>, r=<optimized out>) at ./src/librbd/io/ObjectDispatchSpec.cc:23
> #11 0x00007fe7a9874942 in librbd::io::ObjectRequest<librbd::ImageCtx>::finish (this=this@entry=0x5613dbbfc900, r=r@entry=0) at ./src/include/Context.h:78
> #12 0x00007fe7a9878e3b in librbd::io::ObjectReadRequest<librbd::ImageCtx>::handle_read_object (this=0x5613dbbfc900, r=0) at ./src/log/SubsystemMap.h:72
> #13 0x00007fe7a940a177 in librados::C_AioComplete::finish (this=0x5613db85e6d0, r=<optimized out>) at ./src/librados/AioCompletionImpl.h:140
> #14 0x00007fe7a93c5d1d in Context::complete (this=0x5613db85e6d0, r=<optimized out>) at ./src/include/Context.h:77
> #15 0x00007fe79b65791d in Finisher::finisher_thread_entry (this=0x5613db369ee0) at ./src/common/Finisher.cc:66
> #16 0x00007fe7ab788609 in start_thread (arg=<optimized out>) at pthread_create.c:477
> #17 0x00007fe7ab6af103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> </pre>
>
> twice:
> <pre>
> #0  boost::lockfree::detail::tagged_ptr<boost::lockfree::detail::freelist_stack<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node, std::allocator<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::freelist_node>::extract_ptr (i=<error reading variable>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/detail/tagged_ptr_ptrcompression.hpp:113
> #1  boost::lockfree::detail::tagged_ptr<boost::lockfree::detail::freelist_stack<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node, std::allocator<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::freelist_node>::get_ptr (this=0xfffff40ee9c0) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/detail/tagged_ptr_ptrcompression.hpp:115
> #2  boost::lockfree::detail::freelist_stack<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node, std::allocator<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::allocate_impl<false> (this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/detail/freelist.hpp:187
> #3  boost::lockfree::detail::freelist_stack<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node, std::allocator<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::allocate<true, false> (this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/detail/freelist.hpp:168
> #4  boost::lockfree::detail::freelist_stack<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node, std::allocator<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::construct<true, false, librbd::io::AioCompletion*, boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node*> (arg2=<optimized out>, arg1=<optimized out>, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/detail/freelist.hpp:100
> #5  boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::do_push<false> (t=<optimized out>, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/queue.hpp:302
> #6  boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::push (t=<optimized out>, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/queue.hpp:280
> #7  librbd::io::AioCompletion::complete_event_socket (this=this@entry=0x560f5e10c000) at ./src/librbd/io/AioCompletion.cc:276
> #8  0x00007f8c8bcada34 in librbd::io::AioCompletion::complete_external_callback (this=this@entry=0x560f5edd5d30) at ./src/librbd/io/AioCompletion.cc:262
> #9  0x00007f8c8bcaee98 in librbd::io::AioCompletion::complete (this=0x560f5edd5d30) at ./src/librbd/io/AioCompletion.cc:104
> #10 0x00007f8c8bcaf1a0 in librbd::io::AioCompletion::complete_request (this=0x560f5edd5d30, r=r@entry=4096) at ./src/librbd/io/AioCompletion.cc:229
> #11 0x00007f8c8bcd1dca in librbd::io::ReadResult::C_ObjectReadRequest::finish (this=0x7f8bc4342280, r=4096) at ./src/librbd/io/ReadResult.cc:155
> #12 0x00007f8c8bb2534d in Context::complete (this=0x7f8bc4342280, r=<optimized out>) at ./src/include/Context.h:77
> #13 0x00007f8c8bcce7d9 in librbd::io::ObjectDispatchSpec::C_Dispatcher::finish (this=0x7f8bc4326820, r=<optimized out>) at ./src/librbd/io/ObjectDispatchSpec.cc:32
> #14 0x00007f8c8bcce735 in librbd::io::ObjectDispatchSpec::C_Dispatcher::complete (this=<optimized out>, r=<optimized out>) at ./src/librbd/io/ObjectDispatchSpec.cc:23
> #15 0x00007f8c8be00942 in librbd::io::ObjectRequest<librbd::ImageCtx>::finish (this=this@entry=0x7f8bc42b2a30, r=r@entry=0) at ./src/include/Context.h:78
> #16 0x00007f8c8be04e3b in librbd::io::ObjectReadRequest<librbd::ImageCtx>::handle_read_object (this=0x7f8bc42b2a30, r=0) at ./src/log/SubsystemMap.h:72
> #17 0x00007f8c8b996177 in librados::C_AioComplete::finish (this=0x7f8bc43423b0, r=<optimized out>) at ./src/librados/AioCompletionImpl.h:140
> #18 0x00007f8c8b951d1d in Context::complete (this=0x7f8bc43423b0, r=<optimized out>) at ./src/include/Context.h:77
> #19 0x00007f8c82f4291d in Finisher::finisher_thread_entry (this=0x560f5f44b180) at ./src/common/Finisher.cc:66
> #20 0x00007f8c91e1a609 in start_thread (arg=<optimized out>) at pthread_create.c:477
> #21 0x00007f8c91d41103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> </pre>
>
> once:
> <pre>
> #0  boost::lockfree::detail::tagged_ptr<boost::lockfree::detail::freelist_stack<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node, std::allocator<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::freelist_node>::extract_ptr (i=<error reading variable>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/detail/tagged_ptr_ptrcompression.hpp:113
> #1  boost::lockfree::detail::tagged_ptr<boost::lockfree::detail::freelist_stack<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node, std::allocator<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::freelist_node>::get_ptr (this=0xfffff40ee9c0) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/detail/tagged_ptr_ptrcompression.hpp:115
> #2  boost::lockfree::detail::freelist_stack<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node, std::allocator<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::allocate_impl<false> (this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/detail/freelist.hpp:187
> #3  boost::lockfree::detail::freelist_stack<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node, std::allocator<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::allocate<true, false> (this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/detail/freelist.hpp:168
> #4  boost::lockfree::detail::freelist_stack<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node, std::allocator<boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node> >::construct<true, false, librbd::io::AioCompletion*, boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::node*> (arg2=<optimized out>, arg1=<optimized out>, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/detail/freelist.hpp:100
> #5  boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::do_push<false> (t=<optimized out>, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/queue.hpp:302
> #6  boost::lockfree::queue<librbd::io::AioCompletion*, boost::lockfree::allocator<std::allocator<void> > >::push (t=<optimized out>, this=<optimized out>) at ./obj-x86_64-linux-gnu/boost/include/boost/lockfree/queue.hpp:280
> #7  librbd::io::AioCompletion::complete_event_socket (this=this@entry=0x55d1b89113c0) at ./src/librbd/io/AioCompletion.cc:276
> #8  0x00007f275f6d7a34 in librbd::io::AioCompletion::complete_external_callback (this=this@entry=0x55d1b8a7a0d0) at ./src/librbd/io/AioCompletion.cc:262
> #9  0x00007f275f6d8e98 in librbd::io::AioCompletion::complete (this=0x55d1b8a7a0d0) at ./src/librbd/io/AioCompletion.cc:104
> #10 0x00007f275f6d91a0 in librbd::io::AioCompletion::complete_request (this=0x55d1b8a7a0d0, r=<optimized out>) at ./src/librbd/io/AioCompletion.cc:229
> #11 0x00007f275f54f34d in Context::complete (this=0x7f268833f850, r=<optimized out>) at ./src/include/Context.h:77
> #12 0x00007f274f6b647b in ThreadPool::worker (this=0x55d1b9d04860, wt=<optimized out>) at ./src/common/WorkQueue.cc:118
> #13 0x00007f274f6b7545 in ThreadPool::WorkThread::entry (this=<optimized out>) at ./src/common/WorkQueue.h:466
> #14 0x00007f276173e609 in start_thread (arg=<optimized out>) at pthread_create.c:477
> #15 0x00007f2761665103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
> </pre>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>


-- 
Jason
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux