Re: OSD wont start after moving to a new node with ceph 12.2.10

Paul Emmerich <paul.emmerich@xxxxxxxx> · Tue, 27 Nov 2018 23:16:42 +0100

This is *probably* unrelated to the upgrade as it's complaining at a
very early stage about data corruption.
(Earlier than the bug that would trigger related to the 12.2.9 issues)
So this might just be a coincidence with a bad disk.

That being said: you are running a 12.2.9 OSD and you probably should
not upgrade to 12.2.10 especially while a backfill is running.

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

Am Di., 27. Nov. 2018 um 23:04 Uhr schrieb Cassiano Pilipavicius
<cassiano@xxxxxxxxxxx>:
>
> Hi, I am facing a problem where a OSD wont start after moving to a new
> node with 12.2.10 (the old one has 12.2.8)
>
> I have one node of my cluster failed and trued to move 3 osds to a new
> node. 2 of the 3 osds has started and is running fine at the moment
> (backfiling is still in place.) but one of the osds just dont start with
> the following error on the logs (writing mostly to try to find if this
> is a bug or if have I done something wrong):
>
> 2018-11-27 19:44:38.013454 7fba0d35fd80 -1
> bluestore(/var/lib/ceph/osd/ceph-1) _verify_csum bad crc32c/0x1000
> checksum at blob offset 0x0, got 0xb1a184d1, expected 0xb682fc52, device
> location [0x10000~1000], logical extent 0x0~1000, object
> #-1:7b3f43c4:::osd_superblock:0#
> 2018-11-27 19:44:38.013501 7fba0d35fd80 -1 osd.1 0 OSD::init() : unable
> to read osd superblock
> 2018-11-27 19:44:38.013511 7fba0d35fd80  1
> bluestore(/var/lib/ceph/osd/ceph-1) umount
> 2018-11-27 19:44:38.065478 7fba0d35fd80  1 stupidalloc 0x0x55ebb04c3f80
> shutdown
> 2018-11-27 19:44:38.077261 7fba0d35fd80  1 freelist shutdown
> 2018-11-27 19:44:38.077316 7fba0d35fd80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:217]
> Shutdown: canceling all background work
> 2018-11-27 19:44:38.077982 7fba0d35fd80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.10/rpm/el7/BUILD/ceph-12.2.10/src/rocksdb/db/db_impl.cc:343]
> Shutdown complete
> 2018-11-27 19:44:38.107923 7fba0d35fd80  1 bluefs umount
> 2018-11-27 19:44:38.108248 7fba0d35fd80  1 stupidalloc 0x0x55ebb01cddc0
> shutdown
> 2018-11-27 19:44:38.108302 7fba0d35fd80  1 bdev(0x55ebb01cf800
> /var/lib/ceph/osd/ceph-1/block) close
> 2018-11-27 19:44:38.362984 7fba0d35fd80  1 bdev(0x55ebb01cf600
> /var/lib/ceph/osd/ceph-1/block) close
> 2018-11-27 19:44:38.470791 7fba0d35fd80 -1  ** ERROR: osd init failed:
> (22) Invalid argument
>
> My cluster has too many mixed versions, I havent realized that the
> versions is changed when running a yum update and righ now I have the
> following situation:ceph versions
> {
>      "mon": {
>          "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
> luminous (stable)": 1,
>          "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
> luminous (stable)": 2
>      },
>      "mgr": {
>          "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
> luminous (stable)": 1
>      },
>      "osd": {
>          "ceph version 12.2.10
> (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
>          "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
> luminous (stable)": 18,
>          "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
> luminous (stable)": 27,
>          "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
> luminous (stable)": 1
>      },
>      "mds": {},
>      "overall": {
>          "ceph version 12.2.10
> (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 2,
>          "ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5)
> luminous (stable)": 20,
>          "ceph version 12.2.8 (ae699615bac534ea496ee965ac6192cb7e0e07c0)
> luminous (stable)": 29,
>          "ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217)
> luminous (stable)": 1
>      }
> }
>
> Is there an easy way to get the OSD working again? I am thinking about
> waiting the backfill/recovery to finish and them upgrade all nodes to
> 12.2.10 and if the OSD dont come up, recreating the OSD.
>
> Regards,
> Cassiano Pilipavicius.
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com