We have several OSDs that are crashing on start. We are running nautilus 14.2.16; here is the relevant bit of the log: -10> 2021-01-08 14:52:38.800 7feec5f27c00 20 bluefs _read left 0xf1000 len 0x1000 -9> 2021-01-08 14:52:38.800 7feec5f27c00 20 bluefs _read got 4096 -8> 2021-01-08 14:52:38.800 7feec5f27c00 10 bluefs _replay 0x10f000: txn(seq 5972194 len 0x27 crc 0x9783dfc6) -7> 2021-01-08 14:52:38.800 7feec5f27c00 20 bluefs _replay 0x10f000: op_file_update file(ino 68757 size 0x6041f mtime 2021-01-07 21:25:57.793664 allocated 100000 extents [1:0x17e00000~100000]) -6> 2021-01-08 14:52:38.800 7feec5f27c00 10 bluefs _read h 0x55abfb20a3c0 0x110000~1000 from file(ino 1 size 0x110000 mtime 0.000000 allocated 500000 extents [1:0x481700000~100000,1:0x7ad00000~400000]) -5> 2021-01-08 14:52:38.800 7feec5f27c00 20 bluefs _read left 0xf0000 len 0x1000 -4> 2021-01-08 14:52:38.800 7feec5f27c00 20 bluefs _read got 4096 -3> 2021-01-08 14:52:38.800 7feec5f27c00 10 bluefs _replay 0x110000: txn(seq 5972195 len 0x116 crc 0xec6cec7) -2> 2021-01-08 14:52:38.800 7feec5f27c00 20 bluefs _replay 0x110000: op_dir_link db/109668.log to 68759 -1> 2021-01-08 14:52:38.804 7feec5f27c00 -1 /build/ceph-14.2.16/src/os/bluestore/BlueFS.cc<http://BlueFS.cc>: In function 'int BlueFS::_replay(bool, bool)' thread 7feec5f27c00 time 2021-01-08 14:52:38.802560 /build/ceph-14.2.16/src/os/bluestore/BlueFS.cc<http://BlueFS.cc>: 1029: FAILED ceph_assert(file->fnode.ino) ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x55abefa51fba] 2: (ceph::__ceph_assertf_fail(char const*, char const*, int, char const*, char const*, ...)+0) [0x55abefa52195] 3: (BlueFS::_replay(bool, bool)+0x4fa5) [0x55abf006f8f5] 4: (BlueFS::mount()+0x229) [0x55abf006fd79] 5: (BlueStore::_open_bluefs(bool)+0x78) [0x55abeff57958] 6: (BlueStore::_open_db(bool, bool, bool)+0x8a3) [0x55abeff58e63] 7: (BlueStore::_open_db_and_around(bool)+0x44) [0x55abeff6a1a4] 8: (BlueStore::_mount(bool, bool)+0x584) [0x55abeffc0b64] 9: (OSD::init()+0x3f3) [0x55abefb01db3] 10: (main()+0x5214) [0x55abefa5acf4] 11: (__libc_start_main()+0xe7) [0x7feec279cbf7] 12: (_start()+0x2a) [0x55abefa8c72a] 0> 2021-01-08 14:52:38.808 7feec5f27c00 -1 *** Caught signal (Aborted) ** in thread 7feec5f27c00 thread_name:ceph-osd ceph version 14.2.16 (762032d6f509d5e7ee7dc008d80fe9c87086603c) nautilus (stable) It seems like a version of this: https://tracker.ceph.com/issues/45519 and maybe this https://tracker.ceph.com/issues/21087I haven't been able to get it to start with the stupid allocator. Here is the ceph.conf I've been using to try to get it to start; a thread seemed to indicate that increasing the bluefs_max_log_runway would help. [global] fsid =<redacted> mon initial members = <redacted> public_network = 10.210.20.0/22 mon_host = 10.210.20.21,10.210.20.22,10.210.20.23 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx #kernel doesn't support all features, so disable this #rbd default features = 3 [osd] #set snap trim priority to lowest, 1 osd_snap_trim_priority = 1 osd_recovery_op_priority = 8 osd-max-backfills = 16 osd-max-backfills = 1 #keep allocator commented out normally; defaults to bitmap but #may need stupid #inspired by https://tracker.ceph.com/issues/45519 bluestore_allocator = stupid bluefs_allocator = stupid debug_bluefs = 20/20 #set to 3x default 4194304 bluefs_max_log_runway = 12582912 The cluster is having issues and it is urgent we get these OSDs up. The DB is on a shared NVMe device (81G) and the disk is a 2.2TB 2.5 inch enterprise disk. I'll be very grateful for any assistance. Best, Will _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx