Hi Fulvio, The symptom of several osd's all asserting at the same time in OSDMap::get_map really sounds like this bug: https://tracker.ceph.com/issues/39525 lz4 compression is buggy on CentOS 7 and Ubuntu 18.04 -- you need to disable compression or use a different algorithm. Mimic and nautilus will get a workaround, but it's not planned to be backported to luminous. -- Dan On Thu, May 21, 2020 at 11:18 PM Fulvio Galeazzi <fulvio.galeazzi@xxxxxxx> wrote: > > Hallo all, > hope you can help me with very strange problems which arose > suddenly today. Tried to search, also in this mailing list, but could > not find anything relevant. > > At some point today, without any action from my side, I noticed some > OSDs in my production cluster would go down and never come up. > I am on Luminous 12.2.13, CentOS7, kernel 3.10: my setup is non-standard > as OSD disks are served off a SAN (which is for sure OK now, although I > cannot exclude some glitch). > Tried to reboot OSD servers a few times, ran "activate --all", added > bluestore_ignore_data_csum=true in the [osd] section in ceph.conf... > the number of "down" OSDs changed for a while but now seems rather stable. > > > There are actually two classes of problems (bit more details right below): > - ERROR: osd init failed: (5) Input/output error > - failed to load OSD map for epoch 141282, got 0 bytes > > > *First problem* > This affects 50 OSDs (all disks of this kind, on all but one server): > these OSDs are reserved for object storage but I am not yet using them > so I may in principle recreate them. But would be interested in > understanding what the problem is, and learn how to solve it for future > reference. > Here is what I see in logs: > ..... > 2020-05-21 21:17:48.661348 7fa2e9a95ec0 1 bluefs add_block_device bdev > 1 path /var/lib/ceph/osd/cephpa1-72/block size 14.5TiB > 2020-05-21 21:17:48.661428 7fa2e9a95ec0 1 bluefs mount > 2020-05-21 21:17:48.662040 7fa2e9a95ec0 1 bluefs _init_alloc id 1 > alloc_size 0x10000 size 0xe83a3400000 > 2020-05-21 21:52:43.858464 7fa2e9a95ec0 -1 bluefs mount failed to replay > log: (5) Input/output error > 2020-05-21 21:52:43.858589 7fa2e9a95ec0 1 fbmap_alloc 0x55c6bba92e00 > shutdown > 2020-05-21 21:52:43.858728 7fa2e9a95ec0 -1 > bluestore(/var/lib/ceph/osd/cephpa1-72) _open_db failed bluefs mount: > (5) Input/output error > 2020-05-21 21:52:43.858790 7fa2e9a95ec0 1 bdev(0x55c6bbdb6600 > /var/lib/ceph/osd/cephpa1-72/block) close > 2020-05-21 21:52:44.103536 7fa2e9a95ec0 1 bdev(0x55c6bbdb8600 > /var/lib/ceph/osd/cephpa1-72/block) close > 2020-05-21 21:52:44.352899 7fa2e9a95ec0 -1 osd.72 0 OSD:init: unable to > mount object store > 2020-05-21 21:52:44.352956 7fa2e9a95ec0 -1 ESC[0;31m ** ERROR: osd init > failed: (5) Input/output errorESC[0m > > *Second problem* > This affects 11 OSDs, which I use *in production* for Cinder block > storage: looks like all PGs for this pool are currently OK. > Here is the excerpt from the logs. > ..... > -5> 2020-05-21 20:52:06.756469 7fd2ccc19ec0 0 _get_class not > permitted to load kvs > -4> 2020-05-21 20:52:06.759686 7fd2ccc19ec0 1 <cls> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/rgw/cls_rgw.cc:3869: > > Loaded rgw class! > -3> 2020-05-21 20:52:06.760021 7fd2ccc19ec0 1 <cls> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/log/cls_log.cc:299: > > Loaded log class! > -2> 2020-05-21 20:52:06.760730 7fd2ccc19ec0 1 <cls> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/cls/replica_log/cls_replica_log.cc:135: > > Loaded replica log class! > -1> 2020-05-21 20:52:06.760873 7fd2ccc19ec0 -1 osd.63 0 failed to > load OSD map for epoch 141282, got 0 bytes > 0> 2020-05-21 20:52:06.763277 7fd2ccc19ec0 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h: > > In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fd2ccc19ec0 > time 2020-05-21 20:52:06.760916 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.13/rpm/el7/BUILD/ceph-12.2.13/src/osd/OSD.h: > > 994: FAILED assert(ret) > > Has anyone any idea how I could fix these problems, or what I could > do to try and shed some light? And also, what caused them, and whether > there is some magic configuration flag I could use to protect my cluster? > > Thanks a lot for your help! > > Fulvio > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx