Several OSDs won't come up. Worried for complete data loss

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
I've recently upgraded from Nautilus 14.2.2 to 14.2.6. I've also been installing some new OSDs to my cluster. It looks as though either the backplane I've added has power issues or the raid card I've added has bad memory. Several new-ish, known good drives were bounced out of their JBOD configs (which I know is bad practice and work is being done to remove the raid card in favour of an HBA). cephfs pool is k=4 m=2, rbd pool is replication 3.

    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 2: (()+0x4dddd7) [0x55ed69e03dd7]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 3: (BlueStore::_upgrade_super()+0x52b) [0x55ed6a32968b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 4: (BlueStore::_mount(bool, bool)+0x5d3) [0x55ed6a3692a3]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 5: (OSD::init()+0x321) [0x55ed69f08521]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 6: (main()+0x195b) [0x55ed69e6945b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 7: (__libc_start_main()+0xf5) [0x7f929de4d505]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 8: (()+0x578be5) [0x55ed69e9ebe5]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 0> 2020-01-22 14:35:49.324 7f92a2012a80 -1 *** Caught signal (Aborted) **
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: in thread 7f92a2012a80 thread_name:ceph-osd
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 1: (()+0xf5f0) [0x7f929f06d5f0]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 2: (gsignal()+0x37) [0x7f929de61337]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 3: (abort()+0x148) [0x7f929de62a28]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x55ed69e03c5e]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 5: (()+0x4dddd7) [0x55ed69e03dd7]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 6: (BlueStore::_upgrade_super()+0x52b) [0x55ed6a32968b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 7: (BlueStore::_mount(bool, bool)+0x5d3) [0x55ed6a3692a3]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 8: (OSD::init()+0x321) [0x55ed69f08521]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 9: (main()+0x195b) [0x55ed69e6945b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 10: (__libc_start_main()+0xf5) [0x7f929de4d505]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 11: (()+0x578be5) [0x55ed69e9ebe5]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: -10> 2020-01-22 14:35:49.291 7f92a2012a80 -1 rocksdb: Corruption: missing start of fragmented record(2)
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: -9> 2020-01-22 14:35:49.291 7f92a2012a80 -1 bluestore(/var/lib/ceph/osd/ceph-26) _open_db erroring opening db:
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: -1> 2020-01-22 14:35:49.320 7f92a2012a80 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/os/bluest
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.6/rpm/el7/BUILD/ceph-14.2.6/src/os/bluestore/BlueStore.cc: 10135: FAILED ceph_assert(
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14a) [0x55ed69e03c0f]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 2: (()+0x4dddd7) [0x55ed69e03dd7]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 3: (BlueStore::_upgrade_super()+0x52b) [0x55ed6a32968b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 4: (BlueStore::_mount(bool, bool)+0x5d3) [0x55ed6a3692a3]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 5: (OSD::init()+0x321) [0x55ed69f08521]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 6: (main()+0x195b) [0x55ed69e6945b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 7: (__libc_start_main()+0xf5) [0x7f929de4d505]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 8: (()+0x578be5) [0x55ed69e9ebe5]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 0> 2020-01-22 14:35:49.324 7f92a2012a80 -1 *** Caught signal (Aborted) **
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: in thread 7f92a2012a80 thread_name:ceph-osd
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: ceph version 14.2.6 (f0aa067ac7a02ee46ea48aa26c6e298b5ea272e9) nautilus (stable)
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 1: (()+0xf5f0) [0x7f929f06d5f0]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 2: (gsignal()+0x37) [0x7f929de61337]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 3: (abort()+0x148) [0x7f929de62a28]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x199) [0x55ed69e03c5e]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 5: (()+0x4dddd7) [0x55ed69e03dd7]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 6: (BlueStore::_upgrade_super()+0x52b) [0x55ed6a32968b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 7: (BlueStore::_mount(bool, bool)+0x5d3) [0x55ed6a3692a3]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 8: (OSD::init()+0x321) [0x55ed69f08521]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 9: (main()+0x195b) [0x55ed69e6945b]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 10: (__libc_start_main()+0xf5) [0x7f929de4d505]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: 11: (()+0x578be5) [0x55ed69e9ebe5]
    Jan 22 14:35:49 kvm2.mordor.local ceph-osd[95924]: NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
    Jan 22 14:35:49 kvm2.mordor.local systemd[1]: ceph-osd@26.service: main process exited, code=killed, status=6/ABRT
    Jan 22 14:35:49 kvm2.mordor.local systemd[1]: Unit ceph-osd@26.service entered failed state.
    Jan 22 14:35:49 kvm2.mordor.local systemd[1]: ceph-osd@26.service failed.
    Jan 22 14:35:49 kvm2.mordor.local systemd[1]: ceph-osd@26.service holdoff time over, scheduling restart.
    Jan 22 14:35:49 kvm2.mordor.local systemd[1]: Stopped Ceph object storage daemon osd.26.

I've attempted to run fsck and repair on these OSD, but I get an error there:

    [root@kvm2 ~]# ceph-bluestore-tool repair --path /var/lib/ceph/osd/ceph-20 --deep 1
    2020-01-22 14:31:34.346 7f0399e4bc00 -1 rocksdb: Corruption: missing start of fragmented record(2)
    2020-01-22 14:31:34.346 7f0399e4bc00 -1 bluestore(/var/lib/ceph/osd/ceph-20) _open_db erroring opening db:
    error from fsck: (5) Input/output error

I'd really like to not have to start over. I have several backups, but I doubt everything is up to date and complete. I'm sure I can get more robust logs if necessary, though I'm not sure how to enable them :/

--

Justin Engwer
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux