Mimic downgrade (13.2.8 -> 13.2.6) failed assert in combination with bitmap allocator

Stefan Kooman <stefan@xxxxxx> · Fri, 27 Dec 2019 17:45:26 +0100

Hi,

We have seen several issues (mailed about that earlier to this list)
after the upgrade to Mimic 13.2.8. We decided to downgrade the OSD
servewrs to 13.2.6 to check if issues disappear. However we ran into
issues with that ...

We use bluestore allocator since Luminous 12.2.12 to combat latency
issues on the OSDs. We also used that succesfully on Mimic 13.2.6.

bluestore_allocator = bitmap
bluefs_allocator = bitmap

When downgrading to 13.2.6 we hit the following assert:

2019-12-27 14:14:16.409 7f2ed2dcce00  1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block size 3.5 TiB
2019-12-27 14:14:16.409 7f2ed2dcce00  1 bluefs mount
2019-12-27 14:14:16.413 7f2ed2dcce00 -1 /build/ceph-13.2.6/src/os/bluestore/fastbmap_allocator_impl.h: In function 'void AllocatorLevel02<T>::_mark_allocated(uint64_t, uint64_t) [with L1 = AllocatorLevel01Loose; uint64_t = long unsigned int]' thread 7f2ed2dcce00 time 2019-12-27 14:14:16.414793
/build/ceph-13.2.6/src/os/bluestore/fastbmap_allocator_impl.h: 749: FAILED assert(available >= allocated)

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7f2eca0a497e]
 2: (()+0x2fab07) [0x7f2eca0a4b07]
 3: (BitmapAllocator::init_rm_free(unsigned long, unsigned long)+0x44d) [0xc91dbd]
 4: (BlueFS::mount()+0x260) [0xc6e6c0]
 5: (BlueStore::_open_db(bool, bool)+0x17cd) [0xb8f50d]
 6: (BlueStore::_mount(bool, bool)+0x4b7) [0xbbfb77]
 7: (OSD::init()+0x295) [0x761fc5]
 8: (main()+0x367b) [0x64f23b]
 9: (__libc_start_main()+0xf0) [0x7f2ec7c3f830]
 10: (_start()+0x29) [0x718929]

We upgraded the node back to 13.2.8 again which started without issues.

We did do a "downgrade test" on a test cluster ... that cluster did not
suffer from this issue. It turned out that the cluster was not using
the bitmap allocator ... after enabling the bitmap allocator there on a
13.2.6 node (that has been previviously downgraded but had never run
with the bitmap allocator) and restarting the node this came online just
fine. However, an upgrade to 13.2.8 with bitmap allocator enabled, and a
downgrade again to 13.2.6 would trigger the same assert.

Switching back to default (stupid allocator) again would work
(initially) for 2 out of 3 OSDs. One would fail right away with rocksdb corruption:

2019-12-27 15:10:50.945 7fc77fbcbe00 20 osd.6 1952 register_pg 2.16 0x990c800
2019-12-27 15:10:50.945 7fc77fbcbe00 10 osd.6:2._attach_pg 2.16 0x990c800
2019-12-27 15:10:50.945 7fc77fbcbe00 10 osd.6 1952 pgid 2.0 coll 2.0_head
2019-12-27 15:10:50.945 7fc77fbcbe00 10 osd.6 1952 _make_pg 2.0
2019-12-27 15:10:50.945 7fc77fbcbe00  5 osd.6 pg_epoch: 1952 pg[2.0(unlocked)] enter Initial
2019-12-27 15:10:50.945 7fc77fbcbe00 20 osd.6 pg_epoch: 1952 pg[2.0(unlocked)] enter NotTrimming
2019-12-27 15:10:50.945 7fc77fbcbe00 -1 abort: Corruption: block checksum mismatch: expected 1122551773, got 2333355710  in db/000397.sst offset 57741 size 4044
2019-12-27 15:10:50.949 7fc77fbcbe00 -1 *** Caught signal (Aborted) **
 in thread 7fc77fbcbe00 thread_name:ceph-osd

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (()+0x11390) [0x7fc775520390]
 2: (gsignal()+0x38) [0x7fc774a53428]
 3: (abort()+0x16a) [0x7fc774a5502a]
 4: (RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, ceph::buffer::list*)+0x4a8) [0xbff498]
 5: (BlueStore::omap_get_values(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, ghobject_t const&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::list, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::list> > >*)+0x201) [0xb852e1]
 6: (PG::read_info(ObjectStore*, spg_t, coll_t const&, pg_info_t&, PastIntervals&, unsigned char&)+0x16b) [0x7ecc8b]
 7: (PG::read_state(ObjectStore*)+0x56) [0x81aff6]
 8: (OSD::load_pgs()+0x566) [0x759516]
 9: (OSD::init()+0xcd3) [0x762a03]
 10: (main()+0x367b) [0x64f23b]
 11: (__libc_start_main()+0xf0) [0x7fc774a3e830]
 12: (_start()+0x29) [0x718929]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

And after a restart we got rocksdb messages like this:

...
2019-12-27 15:11:32.322 7fd1c0fbbe00  4 rocksdb: [/build/ceph-13.2.6/src/rocksdb/db/version_set.cc:3088] Recovering from manifest file: MANIFEST-000402
...
...
-352> 2019-12-27 15:11:11.598 7fa0f0558e00 -1 abort: Corruption: Bad table magic number: expected 9863518390377041911, found 11124 in db/000397.sst
...

After we set osd.6 out ... osd.7 crashed after a while (while backfilling) and
would fail to restart again with the following message:

2019-12-27 15:27:10.833 7f1bb4701e00  4 rocksdb: [/build/ceph-13.2.6/src/rocksdb/db/db_impl.cc:252] Shutdown: canceling all background work
2019-12-27 15:27:10.833 7f1bb4701e00  4 rocksdb: [/build/ceph-13.2.6/src/rocksdb/db/db_impl.cc:397] Shutdown complete
2019-12-27 15:27:10.833 7f1bb4701e00 -1 rocksdb: Corruption: CURRENT file does not end with newline
2019-12-27 15:27:10.833 7f1bb4701e00 -1 bluestore(/var/lib/ceph/osd/ceph-7) _open_db erroring opening db: 
2019-12-27 15:27:10.833 7f1bb4701e00  1 bluefs umount
2019-12-27 15:27:10.833 7f1bb4701e00  1 stupidalloc 0x0x325aee0 shutdown
2019-12-27 15:27:10.833 7f1bb4701e00  1 bdev(0x380a380 /var/lib/ceph/osd/ceph-7/block) close
2019-12-27 15:27:11.093 7f1bb4701e00  1 bdev(0x380a000 /var/lib/ceph/osd/ceph-7/block) close
2019-12-27 15:27:11.345 7f1bb4701e00 -1 osd.7 0 OSD:init: unable to mount object store
2019-12-27 15:27:11.345 7f1bb4701e00 -1 ESC[0;31m ** ERROR: osd init failed: (5) Input/output errorESC[0m

A restart / reboot of the node would not help.

For those of you still running 13.2.6 ... I would not recommend upgrading to
13.2.8 (at least not for storage nodes ... mon / mds still seem to work fine).

Does bitmap allocator modify the OSD on disk data in some way? Are you supposed
to be able to switch between different allocators?

Thanks,

Stefan

-- 
| BIT BV  https://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx