Hallo all, hope you can help me with very strange problems which arose suddenly today. Tried to search, also in this mailing list, but could not find anything relevant. At some point today, without any action from my side, I noticed some OSDs in my production cluster would go down and never come up. I am on Luminous 12.2.13, CentOS7, kernel 3.10: my setup is non-standard as OSD disks are served off a SAN (which is for sure OK now, although I cannot exclude some glitch). Tried to reboot OSD servers a few times, ran "activate --all", added bluestore_ignore_data_csum=true in the [osd] section in ceph.conf... the number of "down" OSDs changed for a while but now seems rather stable. There are actually two classes of problems (bit more details right below): - ERROR: osd init failed: (5) Input/output error - failed to load OSD map for epoch 141282, got 0 bytes *First problem* This affects 50 OSDs (all disks of this kind, on all but one server): these OSDs are reserved for object storage but I am not yet using them so I may in principle recreate them. But would be interested in understanding what the problem is, and learn how to solve it for future reference. Here is what I see in logs: ..... 2020-05-21 21:17:48.661348 7fa2e9a95ec0 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/cephpa1-72/block size 14.5TiB 2020-05-21 21:17:48.661428 7fa2e9a95ec0 1 bluefs mount 2020-05-21 21:17:48.662040 7fa2e9a95ec0 1 bluefs _init_alloc id 1 alloc_size 0x10000 size 0xe83a3400000 2020-05-21 21:52:43.858464 7fa2e9a95ec0 -1 bluefs mount failed to replay log: (5) Input/output error 2020-05-21 21:52:43.858589 7fa2e9a95ec0 1 fbmap_alloc 0x55c6bba92e00 shutdown 2020-05-21 21:52:43.858728 7fa2e9a95ec0 -1 bluestore(/var/lib/ceph/osd/cephpa1-72) _open_db failed bluefs mount: (5) Input/output error 2020-05-21 21:52:43.858790 7fa2e9a95ec0 1 bdev(0x55c6bbdb6600 /var/lib/ceph/osd/cephpa1-72/block) close 2020-05-21 21:52:44.103536 7fa2e9a95ec0 1 bdev(0x55c6bbdb8600 /var/lib/ceph/osd/cephpa1-72/block) close 2020-05-21 21:52:44.352899 7fa2e9a95ec0 -1 osd.72 0 OSD:init: unable to mount object store 2020-05-21 21:52:44.352956 7fa2e9a95ec0 -1 ESC[0;31m ** ERROR: osd init failed: (5) Input/output errorESC[0m *Second problem* This affects 11 OSDs, which I use *in production* for Cinder block storage: looks like all PGs for this pool are currently OK. Here is the excerpt from the logs. ..... -5> 2020-05-21 20:52:06.756469 7fd2ccc19ec0 0 _get_class not permitted to load kvs -4> 2020-05-21 20:52:06.759686 7fd2ccc19ec0 1 <cls>/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/rgw/cls_rgw.cc:3869:
Loaded rgw class! -3> 2020-05-21 20:52:06.760021 7fd2ccc19ec0 1 <cls>/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/log/cls_log.cc:299:
Loaded log class! -2> 2020-05-21 20:52:06.760730 7fd2ccc19ec0 1 <cls>/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/replica_log/cls_replica_log.cc:135:
Loaded replica log class! -1> 2020-05-21 20:52:06.760873 7fd2ccc19ec0 -1 osd.63 0 failed to load OSD map for epoch 141282, got 0 bytes 0> 2020-05-21 20:52:06.763277 7fd2ccc19ec0 -1/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h:
In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fd2ccc19ec0 time 2020-05-21 20:52:06.760916/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h:
994: FAILED assert(ret) Has anyone any idea how I could fix these problems, or what I could do to try and shed some light? And also, what caused them, and whether there is some magic configuration flag I could use to protect my cluster? Thanks a lot for your help! Fulvio
Attachment:
smime.p7s
Description: Firma crittografica S/MIME
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx