Re: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

Paul Emmerich <paul.emmerich@xxxxxxxx> · Wed, 14 Aug 2019 11:54:20 +0200

Starting point to debug/fix this would be to extract the osdmap from
one of the dead OSDs:

ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/...

Then try to run osdmaptool on that osdmap to see if it also crashes,
set some --debug options (don't know which one off the top of my
head).
Does it also crash? How does it differ from the map retrieved with
"ceph osd getmap"?

You can also set the osdmap with "--op set-osdmap", does it help to
set the osdmap retrieved by "ceph osd getmap"?

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Wed, Aug 14, 2019 at 7:59 AM Troy Ablan <tablan@xxxxxxxxx> wrote:
>
> I've opened a tracker issue at https://tracker.ceph.com/issues/41240
>
> Background: Cluster of 13 hosts, 5 of which contain 14 SSD OSDs between
> them.  409 HDDs in as well.
>
> The SSDs contain the RGW index and log pools, and some smaller pools
> The HDDs ccontain all other pools, including the RGW data pool
>
> The RGW instance contains just over 1 billion objects across about 65k
> buckets.  I don't know of any action on the cluster that would have
> caused this.  There have been no changes to the crush map in months, but
> HDDs were added a couple weeks ago and backfilling is still in progress
> but in the home stretch.
>
> I don't know what I can do at this point, though something points to the
> osdmap on these being wrong and/or corrupted?  Log excerpt from crash
> included below.  All of the OSD logs I checked look very similar.
>
>
>
>
> 2019-08-13 18:09:52.913 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3362] Recovered
> from manifest file:db/MANIFEST-245361 succeeded,manifest_file_number is
> 245361, next_file_number is 245364, last_sequence is 606668564
> 6, log_number is 0,prev_log_number is 0,max_column_family is
> 0,deleted_log_number is 245359
>
> 2019-08-13 18:09:52.913 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3370] Column family
> [default] (ID 0), log number is 245360
>
> 2019-08-13 18:09:52.918 7f76484e9d80  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1565719792920682, "job": 1, "event": "recovery_started",
> "log_files": [245362]}
> 2019-08-13 18:09:52.918 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering
> log #245362 mode 0
> 2019-08-13 18:09:52.919 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:2863] Creating
> manifest 245364
>
> 2019-08-13 18:09:52.933 7f76484e9d80  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1565719792935329, "job": 1, "event": "recovery_finished"}
> 2019-08-13 18:09:52.951 7f76484e9d80  4 rocksdb:
> [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
> /el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:1218] DB pointer
> 0x56445a6c8000
> 2019-08-13 18:09:52.951 7f76484e9d80  1
> bluestore(/var/lib/ceph/osd/ceph-46) _open_db opened rocksdb path db
> options
> compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=
> 1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
> 2019-08-13 18:09:52.964 7f76484e9d80  1 freelist init
> 2019-08-13 18:09:52.976 7f76484e9d80  1
> bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc opening allocation metadata
> 2019-08-13 18:09:53.119 7f76484e9d80  1
> bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc loaded 926 GiB in 13292
> extents
> 2019-08-13 18:09:53.133 7f76484e9d80 -1 *** Caught signal (Aborted) **
>   in thread 7f76484e9d80 thread_name:ceph-osd
>
>   ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic
> (stable)
>   1: (()+0xf5d0) [0x7f763c4455d0]
>   2: (gsignal()+0x37) [0x7f763b466207]
>   3: (abort()+0x148) [0x7f763b4678f8]
>   4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f763bd757d5]
>   5: (()+0x5e746) [0x7f763bd73746]
>   6: (()+0x5e773) [0x7f763bd73773]
>   7: (__cxa_rethrow()+0x49) [0x7f763bd739e9]
>   8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8)
> [0x7f763fcb48d8]
>   9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f763fa924ad]
>   10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f763fa94db1]
>   11: (OSDService::try_get_map(unsigned int)+0x4f8) [0x5644576e1e08]
>   12: (OSDService::get_map(unsigned int)+0x1e) [0x564457743dae]
>   13: (OSD::init()+0x1d32) [0x5644576ef982]
>   14: (main()+0x23a3) [0x5644575cc7a3]
>   15: (__libc_start_main()+0xf5) [0x7f763b4523d5]
>   16: (()+0x385900) [0x5644576a4900]
>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com