Hallo all We also facing the problem and we would like to upgrade the clients to the specific release. @jason can you point us to the respective commit and the point release that contains the fix? Thanks in advance for your help. Best regards, Michael On 18.09.20, 15:12, "Lomayani S. Laizer" <lomlaizer@xxxxxxxxx> wrote: Hello Jason, I confirm this release fixes the crashes. there is no a single crash for past 4 days On Mon, Sep 14, 2020 at 2:55 PM Jason Dillaman <jdillama@xxxxxxxxxx> wrote: > On Mon, Sep 14, 2020 at 5:13 AM Lomayani S. Laizer <lomlaizer@xxxxxxxxx> > wrote: > > > > Hello, > > Last week i got time to try debug crashes of these vms > > > > Below log includes rados debug which i left last time > > > > https://storage.habari.co.tz/index.php/s/AQEJ7tQS7epC4Zn > > > > I have observed the following with these settings in openstack and ceph > > > > disk_cachemodes="network=writeback" is set in openstack environment > > and in ceph rbd_cache_policy = writearound is set--- crashes occur > > > > disk_cachemodes="network=writeback" is set in openstack environment > > and in ceph rbd_cache_policy = writeback is set--- no crashes > > > > disk_cachemodes="none" is set in openstack environment and in ceph > > rbd_cache_policy = writearound is set---- no crashes > > > > disk_cachemodes="none" is set in openstack environment and in ceph > > rbd_cache_policy = writeback is set --- crashes occur > > > > Is disk_cachemodes="network=writeback" is no longer recommended in > > octopus because i see it is left out in new documentation for octopus? > > > > https://ceph.readthedocs.io/en/latest/rbd/rbd-openstack/ > > Can you try the latest development release of Octopus [1]? A librbd > crash fix has been sitting in that branch for about a month now to be > included in the next point release. > > > > > > > >* > Hello,* > > >* >* > > >* > Below is full debug log of 2 minutes before crash of virtual > machine.* > > >* Download from below url* > > >* >* > > >* > https://storage.habari.co.tz/index.php/s/31eCwZbOoRTMpcU < > https://storage.habari.co.tz/index.php/s/31eCwZbOoRTMpcU>* > > > > > >* This log has rbd debug output, but not rados :(* > > > > > >* I guess you'll need to try and capture a coredump if you can't get a* > > >* backtrace.* > > > > > >* I'd also suggest opening a tracker in case one of the rbd devs has > any* > > >* ideas on this, or has seen something similar. Without a backtrace or* > > >* core it will be impossible to definitively identify the issue though.* > > > > > > > +1 to needing the backtrace. I don't see any indications of a problem in > > that log. > > > > > > >* >* > > >* >* > > >* > apport.log* > > >* >* > > >* > Wed May 13 09:35:30 2020: host pid 4440 crashed in a separate mount* > > >* namespace, ignoring* > > >* >* > > >* > kernel.log* > > >* > May 13 09:35:30 compute5 kernel: [123071.373217] > fn-radosclient[4485]:* > > >* segfault at 0 ip 00007f4c8c85d7ed sp 00007f4c66ffc470 error 4 in* > > >* librbd.so.1.12.0[7f4c8c65a000+5cb000]* > > >* > May 13 09:35:30 compute5 kernel: [123071.373228] Code: 8d 44 24 08 > 48 81* > > >* c3 d8 3e 00 00 49 21 f9 48 c1 e8 30 83 c0 01 48 c1 e0 30 48 89 02 48 > 8b 03* > > >* 48 89 04 24 48 8b 34 24 48 21 fe <48> 8b 06 48 89 44 24 08 48 8b 44 > 24 08* > > >* 48 8b 0b 48 21 f8 48 39 0c* > > >* > May 13 09:35:33 compute5 kernel: [123074.832700] brqa72d845b-e9: > port* > > >* 1(tap33511c4d-2c) entered disabled state* > > >* > May 13 09:35:33 compute5 kernel: [123074.838520] device > tap33511c4d-2c* > > >* left promiscuous mode* > > >* > May 13 09:35:33 compute5 kernel: [123074.838527] brqa72d845b-e9: > port* > > >* 1(tap33511c4d-2c) entered disabled state* > > >* >* > > >* > syslog* > > >* > compute5 kernel: [123071.373217] fn-radosclient[4485]: segfault at > 0 ip* > > >* 00007f4c8c85d7ed sp 00007f4c66ffc470 error 4 i* > > >* > n librbd.so.1.12.0[7f4c8c65a000+5cb000]* > > >* > May 13 09:35:30 compute5 kernel: [123071.373228] Code: 8d 44 24 08 > 48 81* > > >* c3 d8 3e 00 00 49 21 f9 48 c1 e8 30 83 c0 01 48 c1 e0 30 48 8* > > >* > 9 02 48 8b 03 48 89 04 24 48 8b 34 24 48 21 fe <48> 8b 06 48 89 44 > 24 08* > > >* 48 8b 44 24 08 48 8b 0b 48 21 f8 48 39 0c* > > >* > May 13 09:35:30 compute5 libvirtd[1844]: internal error: End of > file* > > >* from qemu monitor* > > >* > May 13 09:35:33 compute5 systemd-networkd[1326]: tap33511c4d-2c: > Link* > > >* DOWN* > > >* > May 13 09:35:33 compute5 systemd-networkd[1326]: tap33511c4d-2c: > Lost* > > >* carrier* > > >* > May 13 09:35:33 compute5 kernel: [123074.832700] brqa72d845b-e9: > port* > > >* 1(tap33511c4d-2c) entered disabled state* > > >* > May 13 09:35:33 compute5 kernel: [123074.838520] device > tap33511c4d-2c* > > >* left promiscuous mode* > > >* > May 13 09:35:33 compute5 kernel: [123074.838527] brqa72d845b-e9: > port* > > >* 1(tap33511c4d-2c) entered disabled state* > > >* > May 13 09:35:33 compute5 networkd-dispatcher[1614]: Failed to > request* > > >* link: No such device* > > >* >* > > >* > On Fri, May 8, 2020 at 5:40 AM Brad Hubbard <bhubbard@xxxxxxxxxx> > wrote:* > > >* >>* > > >* >> On Fri, May 8, 2020 at 12:10 PM Lomayani S. Laizer > <lomlaizer@xxxxxxxxx>* > > >* wrote:* > > >* >> >* > > >* >> > Hello,* > > >* >> > On my side at point of vm crash these are logs below. At the > moment* > > >* my debug is at 10 value. I will rise to 20 for full debug. these > crashes* > > >* are random and so far happens on very busy vms. Downgrading clients > in host* > > >* to Nautilus these crashes disappear* > > >* >>* > > >* >> You could try adding debug_rados as well but you may get a very > large* > > >* >> log so keep an eye on things.* > > >* >>* > > >* >> >* > > >* >> > Qemu is not shutting down in general because other vms on the > same* > > >* host continues working* > > >* >>* > > >* >> A process can not reliably continue after encountering a segfault > so* > > >* >> the qemu-kvm process must be ending and therefore it should be* > > >* >> possible to capture a coredump with the right configuration.* > > >* >>* > > >* >> In the following example, if you were to search for pid 6060 you > would* > > >* >> find it is no longer running.* > > >* >> >> > [ 7682.233684] fn-radosclient[6060]: segfault at 2b19 ip* > > >* 00007f8165cc0a50 sp 00007f81397f6490 error 4 in* > > >* librbd.so.1.12.0[7f8165ab4000+537000]* > > >* >>* > > >* >> Without a backtrace at a minimum it may be very difficult to work > out* > > >* >> what's going on with certainty. If you open a tracker for the > issue* > > >* >> though maybe one of the devs specialising in rbd may have some* > > >* >> feedback.* > > >* >>* > > >* >> >* > > >* >> > 2020-05-07T13:02:12.121+0300 7f88d57fa700 10 > librbd::io::ReadResult:* > > >* 0x7f88c80bfbf0 finish: got {} for [0,24576] bl 24576* > > >* >> > 2020-05-07T13:02:12.193+0300 7f88d57fa700 10 > librbd::io::ReadResult:* > > >* 0x7f88c80f9330 finish: C_ObjectReadRequest: r=0* > > >* >> > 2020-05-07T13:02:12.193+0300 7f88d57fa700 10 > librbd::io::ReadResult:* > > >* 0x7f88c80f9330 finish: got {} for [0,16384] bl 16384* > > >* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10 librbd::ImageState:* > > >* 0x5569b5da9bb0 0x5569b5da9bb0 send_close_unlock* > > >* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10 librbd::ImageState:* > > >* 0x5569b5da9bb0 0x5569b5da9bb0 send_close_unlock* > > >* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 send_block_image_watcher* > > >* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10 > librbd::ImageWatcher:* > > >* 0x7f88c400dfe0 block_notifies* > > >* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 5 librbd::Watcher:* > > >* 0x7f88c400dfe0 block_notifies: blocked_count=1* > > >* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 > handle_block_image_watcher: r=0* > > >* >> > 2020-05-07T13:02:28.694+0300 7f890ba90500 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 > send_shut_down_update_watchers* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0* > > >* handle_shut_down_update_watchers: r=0* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 send_shut_down_io_queue* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 5* > > >* librbd::io::ImageRequestWQ: 0x7f88e8001570 shut_down: shut_down: > in_flight=0* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 > handle_shut_down_io_queue: r=0* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 > send_shut_down_exclusive_lock* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10 > librbd::ExclusiveLock:* > > >* 0x7f88c4011ba0 shut_down* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10 > librbd::ManagedLock:* > > >* 0x7f88c4011bb8 shut_down:* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10 > librbd::ManagedLock:* > > >* 0x7f88c4011bb8 send_shutdown:* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10 > librbd::ManagedLock:* > > >* 0x7f88c4011bb8 send_shutdown_release:* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10 > librbd::ExclusiveLock:* > > >* 0x7f88c4011ba0 pre_release_lock_handler* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020* > > >* send_cancel_op_requests:* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020* > > >* handle_cancel_op_requests: r=0* > > >* >> > 2020-05-07T13:02:28.694+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020 > send_block_writes:* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 5* > > >* librbd::io::ImageRequestWQ: 0x7f88e8001570 block_writes: > 0x5569b5e1ffd0,* > > >* num=1* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020* > > >* handle_block_writes: r=0* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020 > send_wait_for_ops:* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020* > > >* handle_wait_for_ops:* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020* > > >* send_invalidate_cache:* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 5* > > >* librbd::io::ObjectDispatcher: 0x5569b5dab700 invalidate_cache:* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020* > > >* handle_invalidate_cache: r=0* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020* > > >* send_flush_notifies:* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020* > > >* handle_flush_notifies:* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020* > > >* send_close_object_map:* > > >* >> > 2020-05-07T13:02:28.698+0300 7f88d4ff9700 10* > > >* librbd::object_map::UnlockRequest: 0x7f88c807a450 send_unlock:* > > >* oid=rbd_object_map.2f18f2a67fad72* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d57fa700 10* > > >* librbd::object_map::UnlockRequest: 0x7f88c807a450 handle_unlock: r=0* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d57fa700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020* > > >* handle_close_object_map: r=0* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d57fa700 10* > > >* librbd::exclusive_lock::PreReleaseRequest: 0x7f88c80b6020 > send_unlock:* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 > librbd::ManagedLock:* > > >* 0x7f88c4011bb8 handle_shutdown_pre_release: r=0* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10* > > >* librbd::managed_lock::ReleaseRequest: 0x7f88c80b68a0 send_unlock:* > > >* entity=client.58292796, cookie=auto 140225447738256* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d57fa700 10* > > >* librbd::managed_lock::ReleaseRequest: 0x7f88c80b68a0 handle_unlock: > r=0* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 > librbd::ExclusiveLock:* > > >* 0x7f88c4011ba0 post_release_lock_handler: r=0 shutting_down=1* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 5* > > >* librbd::io::ImageRequestWQ: 0x7f88e8001570 unblock_writes: > 0x5569b5e1ffd0,* > > >* num=0* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 > librbd::ImageWatcher:* > > >* 0x7f88c400dfe0 notify released lock* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 > librbd::ImageWatcher:* > > >* 0x7f88c400dfe0 current lock owner: [0,0]* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 > librbd::ManagedLock:* > > >* 0x7f88c4011bb8 handle_shutdown_post_release: r=0* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 > librbd::ManagedLock:* > > >* 0x7f88c4011bb8 wait_for_tracked_ops: r=0* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 > librbd::ManagedLock:* > > >* 0x7f88c4011bb8 complete_shutdown: r=0* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0* > > >* handle_shut_down_exclusive_lock: r=0* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 > send_unregister_image_watcher* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 > librbd::ImageWatcher:* > > >* 0x7f88c400dfe0 unregistering image watcher* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d4ff9700 10 librbd::Watcher:* > > >* 0x7f88c400dfe0 unregister_watch:* > > >* >> > 2020-05-07T13:02:28.702+0300 7f88d57fa700 5 librbd::Watcher:* > > >* 0x7f88c400dfe0 notifications_blocked: blocked=1* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88ceffd700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0* > > >* handle_unregister_image_watcher: r=0* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88ceffd700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 send_flush_readahead* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 handle_flush_readahead: > r=0* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 > send_shut_down_object_dispatcher* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 5* > > >* librbd::io::ObjectDispatcher: 0x5569b5dab700 shut_down:* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 5* > > >* librbd::io::ObjectDispatch: 0x5569b5ee8360 shut_down:* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 5* > > >* librbd::io::SimpleSchedulerObjectDispatch: 0x7f88c4013ce0 shut_down:* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 5* > > >* librbd::cache::WriteAroundObjectDispatch: 0x7f88c8003780 shut_down:* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0* > > >* handle_shut_down_object_dispatcher: r=0* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 send_flush_op_work_queue* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 > handle_flush_op_work_queue: r=0* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10* > > >* librbd::image::CloseRequest: 0x7f88c8175fd0 > handle_flush_image_watcher: r=0* > > >* >> > 2020-05-07T13:02:28.706+0300 7f88d4ff9700 10 librbd::ImageState:* > > >* 0x5569b5da9bb0 0x5569b5da9bb0 handle_close: r=0* > > >* >> >* > > >* >> > On Fri, May 8, 2020 at 12:40 AM Brad Hubbard <bhubbard@xxxxxxxxxx > >* > > >* wrote:* > > >* >> >>* > > >* >> >> On Fri, May 8, 2020 at 3:42 AM Erwin Lubbers <erwin@xxxxxxxxxxx > >* > > >* wrote:* > > >* >> >> >* > > >* >> >> > Hi,* > > >* >> >> >* > > >* >> >> > Did anyone found a way to resolve the problem? I'm seeing the > same* > > >* on a clean Octopus Ceph installation on Ubuntu 18 with an Octopus > compiled* > > >* KVM server running on CentOS 7.8. The KVM machine shows:* > > >* >> >> >* > > >* >> >> > [ 7682.233684] fn-radosclient[6060]: segfault at 2b19 ip* > > >* 00007f8165cc0a50 sp 00007f81397f6490 error 4 in* > > >* librbd.so.1.12.0[7f8165ab4000+537000]* > > >* >> >>* > > >* >> >> Are you able to either capture a backtrace from a coredump or > set up* > > >* >> >> logging and hopefully capture a backtrace that way?* > > >* >> >>* > > >* >> >> >* > > >* >> >> > Ceph is healthy and stable for a few weeks and I did not get > these* > > >* messages while running on KVM compiled with Luminous libraries.* > > >* >> >> >* > > >* >> >> > Regards,* > > >* >> >> > Erwin* > > >* >> >> > _______________________________________________* > > >* >> >> > ceph-users mailing list -- ceph-users@xxxxxxx* > > >* >> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx* > > >* >> >> >* > > >* >> >>* > > >* >> >>* > > >* >> >> --* > > >* >> >> Cheers,* > > >* >> >> Brad* > > >* >> >> _______________________________________________* > > >* >> >> ceph-users mailing list -- ceph-users@xxxxxxx* > > >* >> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx* > > >* >>* > > >* >>* > > >* >>* > > >* >> --* > > >* >> Cheers,* > > >* >> Brad* > > >* >>* > > > > > > > > >* --* > > >* Cheers,* > > >* Brad* > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > [1] https://shaman.ceph.com/repos/ceph/octopus/ > > -- > Jason > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx