Re: Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

Brad Hubbard <bhubbard@xxxxxxxxxx> · Thu, 19 Jul 2018 15:30:30 +1000

On Thu, Jul 19, 2018 at 12:47 PM, Troy Ablan <tablan@xxxxxxxxx> wrote:
>
>
> On 07/18/2018 06:37 PM, Brad Hubbard wrote:
>> On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan <tablan@xxxxxxxxx> wrote:
>>>
>>>
>>> On 07/17/2018 11:14 PM, Brad Hubbard wrote:
>>>>
>>>> On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan <tablan@xxxxxxxxx> wrote:
>>>>>
>>>>> I was on 12.2.5 for a couple weeks and started randomly seeing
>>>>> corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
>>>>> loose.  I panicked and moved to Mimic, and when that didn't solve the
>>>>> problem, only then did I start to root around in mailing lists archives.
>>>>>
>>>>> It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
>>>>> out, but I'm unsure how to proceed now that the damaged cluster is
>>>>> running under Mimic.  Is there anything I can do to get the cluster back
>>>>> online and objects readable?
>>>>
>>>> That depends on what the specific problem is. Can you provide some
>>>> data that fills in the blanks around "randomly seeing corruption"?
>>>>
>>> Thanks for the reply, Brad.  I have a feeling that almost all of this stems
>>> from the time the cluster spent running 12.2.6.  When booting VMs that use
>>> rbd as a backing store, they typically get I/O errors during boot and cannot
>>> read critical parts of the image.  I also get similar errors if I try to rbd
>>> export most of the images. Also, CephFS is not started as ceph -s indicates
>>> damage.
>>>
>>> Many of the OSDs have been crashing and restarting as I've tried to rbd
>>> export good versions of images (from older snapshots).  Here's one
>>> particular crash:
>>>
>>> 2018-07-18 15:52:15.809 7fcbaab77700 -1
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h
>>> uge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/os/bluestore/BlueStore.h:
>>> In function 'void
>>> BlueStore::SharedBlobSet::remove_last(BlueStore::SharedBlob*)' thread
>>> 7fcbaab7
>>> 7700 time 2018-07-18 15:52:15.750916
>>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13
>>> .2.0/src/os/bluestore/BlueStore.h: 455: FAILED assert(sb->nref == 0)
>>>
>>>  ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic
>>> (stable)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0xff) [0x7fcbc197a53f]
>>>  2: (()+0x286727) [0x7fcbc197a727]
>>>  3: (BlueStore::SharedBlob::put()+0x1da) [0x5641f39181ca]
>>>  4: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>,
>>> boost::intrusive_ptr<BlueStore::SharedBlob>,
>>> std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >,
>>> std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >,
>>> std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> >
>>>> ::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B
>>> lueStore::SharedBlob> >*)+0x2d) [0x5641f3977cfd]
>>>  5: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>,
>>> boost::intrusive_ptr<BlueStore::SharedBlob>,
>>> std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >,
>>> std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >,
>>> std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> >
>>>> ::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B
>>> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>>  6: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>,
>>> boost::intrusive_ptr<BlueStore::SharedBlob>,
>>> std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >,
>>> std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >,
>>> std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> >
>>>> ::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B
>>> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>>  7: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>,
>>> boost::intrusive_ptr<BlueStore::SharedBlob>,
>>> std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >,
>>> std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >,
>>> std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> >
>>>> ::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B
>>> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>>  8: (BlueStore::TransContext::~TransContext()+0xf7) [0x5641f3979297]
>>>  9: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x610)
>>> [0x5641f391c9b0]
>>>  10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x9a)
>>> [0x5641f392a38a]
>>>  11: (BlueStore::_kv_finalize_thread()+0x41e) [0x5641f392b3be]
>>>  12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x5641f397d85d]
>>>  13: (()+0x7e25) [0x7fcbbe4d2e25]
>>>  14: (clone()+0x6d) [0x7fcbbd5c3bad]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>>> interpret this.
>>>
>>>
>>> Here's the output of ceph -s that might fill in some configuration
>>> questions.  Since osds are continually restarting if I try to put load on
>>> it, the cluster seems to be churning a bit.  That's why I set nodown for
>>> now.
>>>
>>>   cluster:
>>>     id:     b2873c9a-5539-4c76-ac4a-a6c9829bfed2
>>>     health: HEALTH_ERR
>>>             1 filesystem is degraded
>>>             1 filesystem is offline
>>>             1 mds daemon damaged
>>>             nodown,noscrub,nodeep-scrub flag(s) set
>>>             9 scrub errors
>>>             Reduced data availability: 61 pgs inactive, 56 pgs peering, 4
>>> pgs stale
>>>             Possible data damage: 3 pgs inconsistent
>>>             16 slow requests are blocked > 32 sec
>>>             26 stuck requests are blocked > 4096 sec
>>>
>>>   services:
>>>     mon: 5 daemons, quorum a,b,c,d,e
>>>     mgr: a(active), standbys: b, d, e, c
>>>     mds: lcs-0/1/1 up , 2 up:standby, 1 damaged
>>>     osd: 34 osds: 34 up, 34 in
>>>          flags nodown,noscrub,nodeep-scrub
>>>
>>>   data:
>>>     pools:   15 pools, 640 pgs
>>>     objects: 9.73 M objects, 13 TiB
>>>     usage:   24 TiB used, 55 TiB / 79 TiB avail
>>>     pgs:     23.438% pgs not active
>>>              487 active+clean
>>>              73  peering
>>>              70  activating
>>>              5   stale+peering
>>>              3   active+clean+inconsistent
>>>              2   stale+activating
>>>
>>>   io:
>>>     client:   1.3 KiB/s wr, 0 op/s rd, 0 op/s wr
>>>
>>>
>>> If there's any other information I can provide that can help point to the
>>> problem, I'd be glad to share.
>>
>> If you leave the cluster to recover what point does it get to (ceph -s output)?
>>
>
> OK, I stopped client activity as much as I could, and watched OSDs still
> crashing, so I set noout and stopped all OSDs.  I then brought up the
> ones that were crashing, and they came up fine in isolation.  But they
> crashed once I brought enough of their peers up.
>
> I could have sworn that prior to this, there was a point where I had all
> 640 pgs active on Mimic.
>
> Here's the best state I've seen that I've been able to get (the most
> active PGs).
>
>   cluster:
>     id:     b2873c9a-5539-4c76-ac4a-a6c9829bfed2
>     health: HEALTH_ERR
>             1 filesystem is degraded
>             1 filesystem is offline
>             1 mds daemon damaged
>             nodown,noout,noscrub,nodeep-scrub flag(s) set
>             8 scrub errors
>             Reduced data availability: 25 pgs inactive, 18 pgs peering
>             Possible data damage: 2 pgs inconsistent
>             Degraded data redundancy: 152/75371544 objects degraded
> (0.000%), 4 pgs degraded
>             3 slow requests are blocked > 32 sec
>
>   services:
>     mon: 5 daemons, quorum a,b,c,d,e
>     mgr: a(active), standbys: b, d, e, c
>     mds: lcs-0/1/1 up , 2 up:standby, 1 damaged
>     osd: 34 osds: 34 up, 34 in; 1 remapped pgs
>          flags nodown,noout,noscrub,nodeep-scrub
>
>   data:
>     pools:   15 pools, 640 pgs
>     objects: 9.73 M objects, 13 TiB
>     usage:   24 TiB used, 55 TiB / 79 TiB avail
>     pgs:     5.156% pgs not active
>              152/75371544 objects degraded (0.000%)
>              603 active+clean
>              18  peering
>              13  activating
>              2   active+clean+inconsistent
>              2   active+recovery_wait+degraded
>              1   activating+degraded
>              1   activating+undersized+degraded+remapped
>
>
>
>
> These crashes seem to happen without FAILED assert messages, but rather
> just an abort() signal.  FWIW, all of the crashing OSDs are on SSD.
> They have a writeback cache pool as well as their own pools outright.
>
>
> 2018-07-19 02:23:04.463 7fa5f483c700  1 osd.18 pg_epoch: 30273
> pg[18.7s4( empty local-lis/les=30242/30244 n=0 ec=12913/12913 lis/c
> 30242/30242 les/c/f 30244/30244/0 30273/30273/30222) [30,3,7,4,18,26,16,
> 28]p30(0) r=4 lpr=30273 pi=[30242,30273)/1 crt=0'0 unknown NOTIFY
> mbc={}] start_peering_interval up [30,3,7,4,18,26,2147483647,28] ->
> [30,3,7,4,18,26,16,28], acting [30,3,7,4,18,26,2147483647,28] -> [30,
> 3,7,4,18,26,16,28], acting_primary 30(0) -> 30, up_primary 30(0) -> 30,
> role 4 -> 4, features acting 2305244844817448955 upacting
> 2305244844817448955
> 2018-07-19 02:23:04.464 7fa5f483c700  1 osd.18 pg_epoch: 30273
> pg[18.7s4( empty local-lis/les=30242/30244 n=0 ec=12913/12913 lis/c
> 30242/30242 les/c/f 30244/30244/0 30273/30273/30222) [30,3,7,4,18,26,16,
> 28]p30(0) r=4 lpr=30273 pi=[30242,30273)/1 crt=0'0 unknown NOTIFY
> mbc={}] state<Start>: transitioning to Stray
> 2018-07-19 02:23:04.465 7fa5f383a700  1 osd.18 pg_epoch: 30273 pg[6.es3(
> v 22653'1514806 (18839'1511721,22653'1514806] local-lis/les=30196/30198
> n=5993 ec=778/778 lis/c 30196/30196 les/c/f 30198/30198/0
> 30273/30273/30222) [30,1,21,18,28,16,33,13]p30(0) r=3 lpr=30273
> pi=[30196,30273)/1 crt=22635'1514805 lcod 0'0 unknown NOTIFY mbc={}
> ps=16639] start_peering_interval up [30,1,21,18,28,2147483647,33,13] ->
>  [30,1,21,18,28,16,33,13], acting [30,1,21,18,28,2147483647,33,13] ->
> [30,1,21,18,28,16,33,13], acting_primary 30(0) -> 30, up_primary 30(0)
> -> 30, role 3 -> 3, features acting 2305244844817448955 upacti
> ng 2305244844817448955
> 2018-07-19 02:23:04.465 7fa5f383a700  1 osd.18 pg_epoch: 30273 pg[6.es3(
> v 22653'1514806 (18839'1511721,22653'1514806] local-lis/les=30196/30198
> n=5993 ec=778/778 lis/c 30196/30196 les/c/f 30198/30198/0
> 30273/30273/30222) [30,1,21,18,28,16,33,13]p30(0) r=3 lpr=30273
> pi=[30196,30273)/1 crt=22635'1514805 lcod 0'0 unknown NOTIFY mbc={}
> ps=16639] state<Start>: transitioning to Stray
> 2018-07-19 02:23:04.822 7fa5f483c700 -1 *** Caught signal (Aborted) **
>  in thread 7fa5f483c700 thread_name:tp_osd_tp
>
>  ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic
> (stable)
>  1: (()+0x8e1870) [0x56345079c870]
>  2: (()+0xf6d0) [0x7fa6141b76d0]
>  3: (gsignal()+0x37) [0x7fa6131d8277]
>  4: (abort()+0x148) [0x7fa6131d9968]
>  5: (BlueStore::_wctx_finish(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*,
> std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>,
> std::allocator<BlueStore::SharedBlob*> >*)+0xdea) [0x563450687cda]
>  6: (BlueStore::_do_truncate(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>, unsigned long,
> std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>,
> std::allocator<BlueStore::SharedBlob*> >*)+0x13d) [0x563450696ead]
>  7: (BlueStore::_do_remove(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>)+0xbf) [0x56345069767f]
>  8: (BlueStore::_remove(BlueStore::TransContext*,
> boost::intrusive_ptr<BlueStore::Collection>&,
> boost::intrusive_ptr<BlueStore::Onode>&)+0x60) [0x563450698e50]
>  9: (BlueStore::_txc_add_transaction(BlueStore::TransContext*,
> ObjectStore::Transaction*)+0x105d) [0x5634506a366d]
>  10:
> (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
> std::vector<ObjectStore::Transaction,
> std::allocator<ObjectStore::Transaction> >&,
> boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x519)
> [0x5634506a57b9]
>  11:
> (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&,
> ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>,
> ThreadPool::TPHandle*)+0x80) [0x5634502b68d0]
>  12: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*,
> ThreadPool::TPHandle*)+0x58) [0x56345024d788]
>  13: (OSD::dequeue_peering_evt(OSDShard*, PG*,
> std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xfe)
> [0x56345027b23e]
>  14: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&,
> ThreadPool::TPHandle&)+0x50) [0x5634504d2820]
>  15: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x592) [0x563450285e02]
>  16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3)
> [0x7fa61765d333]
>  17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fa61765df20]
>  18: (()+0x7e25) [0x7fa6141afe25]
>  19: (clone()+0x6d) [0x7fa6132a0bad]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> I'm on IRC (as MooingLemur) if more real-time communication would help :)

Sure, I'll try to contact you there. In the meantime could you open up
a tracker showing the crash stack trace above and a brief description
of the current situation and the events leading up to it? Could you
also get a debug log of one of these crashes with "debug bluestore =
20" and, ideally, a coredump?

>
> I much appreciate the help.  I feel like a doofus for panicking and
> upgrading to Mimic rather than finding the mailing list posts, being
> cautious, and waiting.
>
> -Troy

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com