Hello everyone, I have been running ceph for the last 2 years, with great experience so far. Yesterday I started encountering some strange issues. All OSD's are part of an erasure coding pool with k=8 m=2 and host failure domain. Neither the crashing OSD's yesterday nor today show any symptoms of physical errors, no error in smart data, and can be read without any I/O errors. Yesterday the whole cluster was restarted, 2 OSD's in one server would not come back up, while all other OSD's were fine. Since most PG's had 10/10 available and a few had 9/10, I wasn't very worried and wiped the disks and started recovery. Both drives were crashing with the log below. OSD 244: 2021-09-12T07:44:06.705+0000 7f6539879f00 0 set uid:gid to 64045:64045 (ceph:ceph) 2021-09-12T07:44:06.705+0000 7f6539879f00 0 ceph version 16.2.5 (9b9dd76e12f1907fe5dcc0c1fadadbb784022a42) pacific (stable), process ceph-osd, pid 18954 2021-09-12T07:44:06.705+0000 7f6539879f00 0 pidfile_write: ignore empty --pid-file 2021-09-12T07:44:07.281+0000 7f6539879f00 0 starting osd.244 osd_data /var/lib/ceph/osd/ceph-244 /var/lib/ceph/osd/ceph-244/journal 2021-09-12T07:44:07.297+0000 7f6539879f00 0 load: jerasure load: lrc load: isa 2021-09-12T07:44:07.877+0000 7f6539879f00 0 osd.244:0.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-12T07:44:08.193+0000 7f6539879f00 0 osd.244:1.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-12T07:44:08.481+0000 7f6539879f00 0 osd.244:2.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-12T07:44:08.785+0000 7f6539879f00 0 osd.244:3.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-12T07:44:09.117+0000 7f6539879f00 0 osd.244:4.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-12T07:44:09.133+0000 7f6539879f00 0 bluestore(/var/lib/ceph/osd/ceph-244) _open_db_and_around read-only:0 repair:0 2021-09-12T07:44:09.393+0000 7f6539879f00 -1 bluestore(/var/lib/ceph/osd/ceph-244) _open_db erroring opening db: 2021-09-12T07:44:09.925+0000 7f6539879f00 -1 osd.244 0 OSD:init: unable to mount object store 2021-09-12T07:44:09.925+0000 7f6539879f00 -1 ** ERROR: osd init failed: (5) Input/output error OSD 229: 2021-09-12T07:44:10.461+0000 7f22575b7f00 0 set uid:gid to 64045:64045 (ceph:ceph) 2021-09-12T07:44:10.461+0000 7f22575b7f00 0 ceph version 16.2.5 (9b9dd76e12f1907fe5dcc0c1fadadbb784022a42) pacific (stable), process ceph-osd, pid 19144 2021-09-12T07:44:10.461+0000 7f22575b7f00 0 pidfile_write: ignore empty --pid-file 2021-09-12T07:44:11.041+0000 7f22575b7f00 0 starting osd.229 osd_data /var/lib/ceph/osd/ceph-229 /var/lib/ceph/osd/ceph-229/journal 2021-09-12T07:44:11.053+0000 7f22575b7f00 0 load: jerasure load: lrc load: isa 2021-09-12T07:44:11.641+0000 7f22575b7f00 0 osd.229:0.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-12T07:44:11.925+0000 7f22575b7f00 0 osd.229:1.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-12T07:44:12.225+0000 7f22575b7f00 0 osd.229:2.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-12T07:44:12.521+0000 7f22575b7f00 0 osd.229:3.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-12T07:44:12.813+0000 7f22575b7f00 0 osd.229:4.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-12T07:44:12.813+0000 7f22575b7f00 0 bluestore(/var/lib/ceph/osd/ceph-229) _open_db_and_around read-only:0 repair:0 2021-09-12T07:44:13.089+0000 7f22575b7f00 -1 bluestore(/var/lib/ceph/osd/ceph-229) _open_db erroring opening db: 2021-09-12T07:44:13.613+0000 7f22575b7f00 -1 osd.229 0 OSD:init: unable to mount object store 2021-09-12T07:44:13.613+0000 7f22575b7f00 -1 ** ERROR: osd init failed: (5) Input/output error Today 2 different hosts had to be rebooted. After each reboot both had a single crashing OSD. One with the same error as the previous ones, the other one with a different error. Logs below. OSD 201 Before reboot 2021-09-13T03:53:13.086+0000 7fcbe12dc700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0 2021-09-13T03:53:13.086+0000 7fcbe12dc700 -1 osd.201 478669 *** Got signal Terminated *** 2021-09-13T03:53:13.086+0000 7fcbe12dc700 -1 osd.201 478669 *** Immediate shutdown (osd_fast_shutdown=true) *** And starting it up again... 2021-09-13T04:07:28.252+0000 7f9d47dacf00 0 set uid:gid to 64045:64045 (ceph:ceph) 2021-09-13T04:07:28.252+0000 7f9d47dacf00 0 ceph version 16.2.5 (9b9dd76e12f1907fe5dcc0c1fadadbb784022a42) pacific (stable), process ceph-osd, pid 15288 2021-09-13T04:07:28.252+0000 7f9d47dacf00 0 pidfile_write: ignore empty --pid-file 2021-09-13T04:07:28.824+0000 7f9d47dacf00 0 starting osd.201 osd_data /var/lib/ceph/osd/ceph-201 /var/lib/ceph/osd/ceph-201/journal 2021-09-13T04:07:28.836+0000 7f9d47dacf00 0 load: jerasure load: lrc load: isa 2021-09-13T04:07:29.481+0000 7f9d47dacf00 0 osd.201:0.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-13T04:07:29.789+0000 7f9d47dacf00 0 osd.201:1.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-13T04:07:30.133+0000 7f9d47dacf00 0 osd.201:2.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-13T04:07:30.497+0000 7f9d47dacf00 0 osd.201:3.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-13T04:07:30.837+0000 7f9d47dacf00 0 osd.201:4.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-13T04:07:30.837+0000 7f9d47dacf00 0 bluestore(/var/lib/ceph/osd/ceph-201) _open_db_and_around read-only:0 repair:0 2021-09-13T04:07:30.981+0000 7f9d47dacf00 -1 bluestore(/var/lib/ceph/osd/ceph-201) _open_db erroring opening db: 2021-09-13T04:07:31.417+0000 7f9d47dacf00 -1 osd.201 0 OSD:init: unable to mount object store 2021-09-13T04:07:31.417+0000 7f9d47dacf00 -1 ** ERROR: osd init failed: (5) Input/output error Log for OSD 201 with debug_bdev, debug_bluefs, debug_bluestore = 20, https://drive.google.com/file/d/1wb2WivtLmQvcFM6389UgD020GpZ-S0pE/view OSD 240 Before reboot 2021-09-13T10:33:20.229+0000 7f8c82744700 -1 received signal: Terminated from /sbin/init (PID: 1) UID: 0 2021-09-13T10:33:20.229+0000 7f8c82744700 -1 osd.240 479171 *** Got signal Terminated *** 2021-09-13T10:33:20.229+0000 7f8c82744700 -1 osd.240 479171 *** Immediate shutdown (osd_fast_shutdown=true) *** And starting it up again... 2021-09-13T11:05:19.567+0000 7f9c0e887f00 0 set uid:gid to 64045:64045 (ceph:ceph) 2021-09-13T11:05:19.567+0000 7f9c0e887f00 0 ceph version 16.2.5 (9b9dd76e12f1907fe5dcc0c1fadadbb784022a42) pacific (stable), process ceph-osd, pid 3865926 2021-09-13T11:05:19.567+0000 7f9c0e887f00 0 pidfile_write: ignore empty --pid-file 2021-09-13T11:05:20.135+0000 7f9c0e887f00 0 starting osd.240 osd_data /var/lib/ceph/osd/ceph-240 /var/lib/ceph/osd/ceph-240/journal 2021-09-13T11:05:20.203+0000 7f9c0e887f00 0 load: jerasure load: lrc load: isa 2021-09-13T11:05:20.835+0000 7f9c0e887f00 0 osd.240:0.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-13T11:05:21.171+0000 7f9c0e887f00 0 osd.240:1.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-13T11:05:21.483+0000 7f9c0e887f00 0 osd.240:2.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-13T11:05:21.835+0000 7f9c0e887f00 0 osd.240:3.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-13T11:05:22.179+0000 7f9c0e887f00 0 osd.240:4.OSDShard using op scheduler ClassedOpQueueScheduler(queue=WeightedPriorityQueue, cutoff=196) 2021-09-13T11:05:22.179+0000 7f9c0e887f00 0 bluestore(/var/lib/ceph/osd/ceph-240) _open_db_and_around read-only:0 repair:0 2021-09-13T11:05:34.807+0000 7f9c0e887f00 -1 bluestore(/var/lib/ceph/osd/ceph-240) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x5aaa1569, expected 0x3fa7277d, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# 2021-09-13T11:05:34.807+0000 7f9c0e887f00 -1 bluestore(/var/lib/ceph/osd/ceph-240) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x5aaa1569, expected 0x3fa7277d, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# 2021-09-13T11:05:34.807+0000 7f9c0e887f00 -1 bluestore(/var/lib/ceph/osd/ceph-240) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x5aaa1569, expected 0x3fa7277d, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# 2021-09-13T11:05:34.819+0000 7f9c0e887f00 -1 bluestore(/var/lib/ceph/osd/ceph-240) _verify_csum bad crc32c/0x1000 checksum at blob offset 0x0, got 0x5aaa1569, expected 0x3fa7277d, device location [0x10000~1000], logical extent 0x0~1000, object #-1:7b3f43c4:::osd_superblock:0# 2021-09-13T11:05:34.819+0000 7f9c0e887f00 -1 osd.240 0 OSD::init() : unable to read osd superblock 2021-09-13T11:05:34.819+0000 7f9c01d8a700 0 bluestore(/var/lib/ceph/osd/ceph-240) allocation stats probe 0: cnt: 0 frags: 0 size: 0 2021-09-13T11:05:34.819+0000 7f9c01d8a700 0 bluestore(/var/lib/ceph/osd/ceph-240) probe -1: 0, 0, 0 2021-09-13T11:05:34.819+0000 7f9c01d8a700 0 bluestore(/var/lib/ceph/osd/ceph-240) probe -2: 0, 0, 0 2021-09-13T11:05:34.819+0000 7f9c01d8a700 0 bluestore(/var/lib/ceph/osd/ceph-240) probe -4: 0, 0, 0 2021-09-13T11:05:34.819+0000 7f9c01d8a700 0 bluestore(/var/lib/ceph/osd/ceph-240) probe -8: 0, 0, 0 2021-09-13T11:05:34.819+0000 7f9c01d8a700 0 bluestore(/var/lib/ceph/osd/ceph-240) probe -16: 0, 0, 0 2021-09-13T11:05:34.819+0000 7f9c01d8a700 0 bluestore(/var/lib/ceph/osd/ceph-240) ------------ 2021-09-13T11:05:35.399+0000 7f9c0e887f00 -1 ** ERROR: osd init failed: (22) Invalid argument Log for OSD 240 with debug_bdev, debug_bluefs, debug_bluestore = 20, https://drive.google.com/file/d/1pjgXBBubwr14SMDDwQXX8C9pTWz6hYb6/view I believe managing to start either OSD 201 or OSD 240 should resolve all down PG's. I have tried ceph-bluestore-tool fsck and repair which did not work. I have also tried exporting the PG shards using ceph-objectstore-tool without success. Thanks for taking the time to read, and I hope someone can help me out. Best regards, Kári _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx