OSD wont start after moving to a new node with ceph 12.2.10

Cassiano Pilipavicius <cassiano@xxxxxxxxxxx> · Tue, 27 Nov 2018 20:04:18 -0200

Hi, I am facing a problem where a OSD wont start after moving to a new 
node with 12.2.10 (the old one has 12.2.8)

I have one node of my cluster failed and trued to move 3 osds to a new 
node. 2 of the 3 osds has started and is running fine at the moment 
(backfiling is still in place.) but one of the osds just dont start with 
the following error on the logs (writing mostly to try to find if this 
is a bug or if have I done something wrong):

2018-11-27 19:44:38.013454 7fba0d35fd80 -1 
bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x1000 
checksum at blob offset 0x0, got 0xb1a184d1, expected 0xb682fc52, device 
location [0x10000~1000], logical extent 0x0~1000, object 
#-1:7b3f43c4:::osd_superblock:0#
2018-11-27 19:44:38.013501 7fba0d35fd80 -1 osd.1 0 OSD::init() : unable 
to read osd superblock
2018-11-27 19:44:38.013511 7fba0d35fd80  1 
bluestore(/var/lib/ceph/osd/ceph-1) umount
2018-11-27 19:44:38.065478 7fba0d35fd80  1 stupidalloc 0x0x55ebb04c3f80 
shutdown
2018-11-27 19:44:38.077261 7fba0d35fd80  1 freelist shutdown
2018-11-27 19:44:38.077316 7fba0d35fd80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:217] 
Shutdown: canceling all background work
2018-11-27 19:44:38.077982 7fba0d35fd80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:343] 
Shutdown complete
2018-11-27 19:44:38.107923 7fba0d35fd80  1 bluefs umount
2018-11-27 19:44:38.108248 7fba0d35fd80  1 stupidalloc 0x0x55ebb01cddc0 
shutdown
2018-11-27 19:44:38.108302 7fba0d35fd80  1 bdev(0x55ebb01cf800 
/var/lib/ceph/osd/ceph-1/block) close
2018-11-27 19:44:38.362984 7fba0d35fd80  1 bdev(0x55ebb01cf600 
/var/lib/ceph/osd/ceph-1/block) close
2018-11-27 19:44:38.470791 7fba0d35fd80 -1  ** ERROR: osd init failed: 
(22) Invalid argument

My cluster has too many mixed versions, I havent realized that the 
versions is changed when running a yum update and righ now I have the 
following situation:ceph versions
{
    "mon": {
        "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) 
luminous (stable)": 1,
        "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 2
    },
    "mgr": {
        "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) 
luminous (stable)": 1
    },
    "osd": {
        "ceph version 12.2.10 
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
        "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) 
luminous (stable)": 18,
        "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 27,
        "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
luminous (stable)": 1
    },
    "mds": {},
    "overall": {
        "ceph version 12.2.10 
(177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
        "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) 
luminous (stable)": 20,
        "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0) 
luminous (stable)": 29,
        "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) 
luminous (stable)": 1
    }
}

Is there an easy way to get the OSD working again? I am thinking about 
waiting the backfill/recovery to finish and them upgrade all nodes to 
12.2.10 and if the OSD dont come up, recreating the OSD.

Regards,
Cassiano Pilipavicius.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com