Re: Cluster broken and ODSs crash with failed assertion in PGLog::merge_log

Jonas Jelten <jelten@xxxxxxxxx> · Tue, 9 Oct 2018 12:45:05 +0200

Yay!

I've managed to get the cluster up and running again.
Deep scrub is still fixing inconsistencies.

I had to do a depth-first-search in the tree of startup errors.
My procedure was the already described one: Find and delete PGs from OSDs which trigger the assertion.
I've created a script to automate the seek-and-destroy of the broken PGs.

https://gist.github.com/TheJJ/c6be62e612ac4782bd0aa279d8c82197

Cheers
JJ

On 04/10/2018 18.29, Jonas Jelten wrote:
> Hello!
> 
> Unfortunately, our single-node-"Cluster" with 11 ODSs is broken because some ODSs crash when they start peering.
> I'm on Ubuntu 18.04 with Ceph Mimic (13.2.2).
> 
> The problem was induced by when RAM was filled up and ODS processes then crashed because of memory allocation failures.
> 
> No weird commands (e.g. force_create_pg) were used on this cluster and it was set up with 13.2.1 initially.
> The affected pool seems to be a replicated pool with size=3 and min_size=2 (which haven't been changed).
> 
> Crash log of osd.4 (only the crashed thread):
> 
> 99424: -1577> 2018-10-04 13:40:11.024 7f3838417700 10 log is not dirty
> 99425: -1576> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 1433 queue_want_up_thru want 1433 <= queued 1433, currently 1426
> 99427: -1574> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 3.8 to_process <> waiting <>
> waiting_peering {}
> 99428: -1573> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
> 1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 3.8( v 866'122691 (569'119300,866'122691]
> local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433)) features:
> 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] acting 4,10),([1401,1403] acting
> 2,10),([1426,1428] acting 2,4)) +create_info) prio 255 cost 10 e1433) queued
> 99430: -1571> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 3.8 to_process <OpQueueItem(3.8
> PGPeeringEvent(epoch_sent: 1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 3.8( v
> 866'122691 (569'119300,866'122691] local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0
> 1433/1433/1433)) features: 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] acting
> 4,10),([1401,1403] acting 2,10),([1426,1428] acting 2,4)) +create_info) prio 255 cost 10 e1433)> waiting <>
> waiting_peering {}
> 99433: -1568> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
> 1433 epoch_requested: 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 3.8( v 866'122691 (569'119300,866'122691]
> local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433)) features:
> 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] acting 4,10),([1401,1403] acting
> 2,10),([1426,1428] acting 2,4)) +create_info) prio 255 cost 10 e1433) pg 0x56013bc87400
> 99437: -1564> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 1433 epoch_requested:
> 1433 MNotifyRec 3.8 from 2 notify: (query:1433 sent:1433 3.8( v 866'122691 (569'119300,866'122691]
> local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433)) features:
> 0x3ffddff8ffa4fffb ([859,1432] intervals=([1213,1215] acting 0,2),([1308,1311] acting 4,10),([1401,1403] acting
> 2,10),([1426,1428] acting 2,4)) +create_info
> 99440: -1561> 2018-10-04 13:40:11.024 7f3838417700  7 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary>: handle_pg_notify from osd.2
> 99444: -1557> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}]  got dup osd.2 info 3.8( v 866'122691
> (569'119300,866'122691] local-lis/les=1401/1402 n=54053 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433),
> identical to ours
> 99445: -1556> 2018-10-04 13:40:11.024 7f3838417700 10 log is not dirty
> 99446: -1555> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 1433 queue_want_up_thru want 1433 <= queued 1433, currently 1426
> 99448: -1553> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 3.8 to_process <> waiting <>
> waiting_peering {}
> 99450: -1551> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
> 1433 epoch_requested: 1433 MLogRec from 2 +create_info) prio 255 cost 10 e1433) queued
> 99456: -1545> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process 3.8 to_process <OpQueueItem(3.8
> PGPeeringEvent(epoch_sent: 1433 epoch_requested: 1433 MLogRec from 2 +create_info) prio 255 cost 10 e1433)> waiting <>
> waiting_peering {}
> 99458: -1543> 2018-10-04 13:40:11.024 7f3838417700 20 osd.4 op_wq(3) _process OpQueueItem(3.8 PGPeeringEvent(epoch_sent:
> 1433 epoch_requested: 1433 MLogRec from 2 +create_info) prio 255 cost 10 e1433) pg 0x56013bc87400
> 99461: -1540> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] do_peering_event: epoch_sent: 1433 epoch_requested:
> 1433 MLogRec from 2 +create_info
> 99465: -1536> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetLog>: GetLog:
> received master log from osd2
> 99467: -1534> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetLog>: leaving GetLog
> 99473: -1529> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] state<Started/Primary/Peering/GetLog>: processing
> master log
> 99477: -1525> 2018-10-04 13:40:11.024 7f3838417700 10 osd.4 pg_epoch: 1433 pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 1433/1433/1433) [4,2] r=0 lpr=1433
> pi=[859,1433)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] proc_master_log for osd.2:
> log((866'122691,866'122691], crt=866'122691) missing(0 may_include_deletes = 1)
> 99480: -1522> 2018-10-04 13:40:11.024 7f3838417700 10 merge_log log((866'122691,866'122691], crt=866'122691) from osd.2
> into log((866'124700,866'127774], crt=866'127774)
> 101018:    -1> 2018-10-04 13:40:11.028 7f3838417700 -1 /home/jj/cephsrc/ceph-13.2.2/src/osd/PGLog.cc: In function 'void
> PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)' thread
> 7f3838417700 time 2018-10-04 13:40:11.027763
> 101109:2018-10-04 13:40:11.084 7f3838417700 -1 *** Caught signal (Aborted) **
>     -1> 2018-10-04 13:40:11.028 7f3838417700 -1 /home/jj/cephsrc/ceph-13.2.2/src/osd/PGLog.cc: In function 'void
> PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool
> /home/jj/cephsrc/ceph-13.2.2/src/osd/PGLog.cc: 345: FAILED assert(log.head >= olog.tail && olog.head >= log.tail)
> 
>  ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f385abdc822]
>  2: (()+0x26c9e7) [0x7f385abdc9e7]
>  3: (PGLog::merge_log(pg_info_t&, pg_log_t&, pg_shard_t, pg_info_t&, PGLog::LogEntryHandler*, bool&, bool&)+0x1ad0)
> [0x5601185c7f50]
>  4: (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_shard_t)+0x64) [0x560118525194]
>  5: (PG::proc_master_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, pg_missing_set<false>&, pg_shard_t)+0x8b)
> [0x560118566abb]
>  6: (PG::RecoveryState::GetLog::react(PG::RecoveryState::GotLog const&)+0xab) [0x560118566e4b]
>  7: (boost::statechart::simple_state<PG::RecoveryState::GetLog, PG::RecoveryState::Peering, boost::mpl::list<mpl_::na,
> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl
>  8: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial,
> std::allocator<void>, boost::statechart::null_exception_translator>::process_queued_events()+0xb3) [0x56011858
>  9: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial,
> std::allocator<void>, boost::statechart::null_exception_translator>::process_event(boost::statechart::event_ba
>  10: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PG::RecoveryCtx*)+0x143) [0x560118573a53]
>  11: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xcf)
> [0x5601184b450f]
>  12: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) [0x560118720820]
>  13: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x598) [0x5601184c4148]
>  14: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x46e) [0x7f385abe175e]
>  15: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f385abe37e0]
>  16: (()+0x76db) [0x7f38592bf6db]
>  17: (clone()+0x3f) [0x7f385827188f]
> 
> So it seems the contact to osd.2 for pg 3.8 leads to the crash.
> pool 3 'ssn_root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 last_change 147
> flags hashpspool stripe_width 0 application cephfs
> 
> The pg dump for 3.8:
> 3.8       56570                  0        0         0       0            0 3074     3074
>              down 2018-10-04 18:21:59.450788 866'127774 2554:129697                                  [4]          4
>                             [4]              4  541'94186 2018-10-02 21:57:31.049984       427'34480 2018-09-22
> 18:09:40.485861             0
> 
> When I stop osd.2, It crashes with:
>    -20> 2018-10-04 18:20:20.624 7f0168bef700 10 osd.4 pg_epoch: 2540 pg[3.8( v 866'127774 (866'124700,866'127774]
> local-lis/les=859/860 n=56570 ec=126/126 lis/c 1401/859 les/c/f 1402/860/0 2540/2540/2537) [4,10] r=0 lpr=2540
> pi=[859,2540)/4 crt=866'127774 lcod 0'0 mlcod 0'0 peering mbc={}] proc_master_log for osd.10:
> log((866'122691,866'122691], crt=866'122691) missing(382 may_include_deletes = 1)
>    -19> 2018-10-04 18:20:20.624 7f0168bef700 10 merge_log log((866'122691,866'122691], crt=866'122691) from osd.10 into
> log((866'124700,866'127774], crt=866'127774)
> [same traceback]
> 
> So I stop osd.10. Then osd.4 keeps running.
> 
> pg 3.8 query says this:
>   "recovery_state": [
>     {
>       "name": "Started/Primary/Peering/Down",
>       "enter_time": "2018-10-04 18:21:59.450774",
>       "comment": "not enough up instances of this PG to go active"
>     },
>     {
>       "name": "Started/Primary/Peering",
>       "enter_time": "2018-10-04 18:21:59.450460",
>       "past_intervals": [
>         {
>           "first": "859",
>           "last": "2550",
>           "all_participants": [
>             {
>               "osd": 0
>             },
>             {
>               "osd": 2
>             },
>             {
>               "osd": 4
>             },
>             {
>               "osd": 10
>             }
>           ],
> [...]
>       "probing_osds": [
>         "4"
>       ],
>       "blocked": "peering is blocked due to down osds",
>       "down_osds_we_would_probe": [
>         0,
>         2,
>         10
>       ],
>       "peering_blocked_by": [
>         {
>           "osd": 2,
>           "current_lost_at": 0,
>           "comment": "starting or marking this osd lost may let us proceed"
>         },
>         {
>           "osd": 10,
>           "current_lost_at": 0,
>           "comment": "starting or marking this osd lost may let us proceed"
>         }
>       ]
> 
> 
> What shall I do? :)
> 
> Cheers,
> JJ

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com