On Thu, Jul 19, 2018 at 12:47 PM, Troy Ablan <tablan@xxxxxxxxx> wrote: > > > On 07/18/2018 06:37 PM, Brad Hubbard wrote: >> On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan <tablan@xxxxxxxxx> wrote: >>> >>> >>> On 07/17/2018 11:14 PM, Brad Hubbard wrote: >>>> >>>> On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan <tablan@xxxxxxxxx> wrote: >>>>> >>>>> I was on 12.2.5 for a couple weeks and started randomly seeing >>>>> corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke >>>>> loose. I panicked and moved to Mimic, and when that didn't solve the >>>>> problem, only then did I start to root around in mailing lists archives. >>>>> >>>>> It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is >>>>> out, but I'm unsure how to proceed now that the damaged cluster is >>>>> running under Mimic. Is there anything I can do to get the cluster back >>>>> online and objects readable? >>>> >>>> That depends on what the specific problem is. Can you provide some >>>> data that fills in the blanks around "randomly seeing corruption"? >>>> >>> Thanks for the reply, Brad. I have a feeling that almost all of this stems >>> from the time the cluster spent running 12.2.6. When booting VMs that use >>> rbd as a backing store, they typically get I/O errors during boot and cannot >>> read critical parts of the image. I also get similar errors if I try to rbd >>> export most of the images. Also, CephFS is not started as ceph -s indicates >>> damage. >>> >>> Many of the OSDs have been crashing and restarting as I've tried to rbd >>> export good versions of images (from older snapshots). Here's one >>> particular crash: >>> >>> 2018-07-18 15:52:15.809 7fcbaab77700 -1 >>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h >>> uge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/os/bluestore/BlueStore.h: >>> In function 'void >>> BlueStore::SharedBlobSet::remove_last(BlueStore::SharedBlob*)' thread >>> 7fcbaab7 >>> 7700 time 2018-07-18 15:52:15.750916 >>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13 >>> .2.0/src/os/bluestore/BlueStore.h: 455: FAILED assert(sb->nref == 0) >>> >>> ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic >>> (stable) >>> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >>> const*)+0xff) [0x7fcbc197a53f] >>> 2: (()+0x286727) [0x7fcbc197a727] >>> 3: (BlueStore::SharedBlob::put()+0x1da) [0x5641f39181ca] >>> 4: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, >>> boost::intrusive_ptr<BlueStore::SharedBlob>, >>> std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, >>> std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, >>> std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >>>> ::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B >>> lueStore::SharedBlob> >*)+0x2d) [0x5641f3977cfd] >>> 5: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, >>> boost::intrusive_ptr<BlueStore::SharedBlob>, >>> std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, >>> std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, >>> std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >>>> ::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B >>> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb] >>> 6: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, >>> boost::intrusive_ptr<BlueStore::SharedBlob>, >>> std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, >>> std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, >>> std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >>>> ::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B >>> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb] >>> 7: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, >>> boost::intrusive_ptr<BlueStore::SharedBlob>, >>> std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, >>> std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, >>> std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >>>> ::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B >>> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb] >>> 8: (BlueStore::TransContext::~TransContext()+0xf7) [0x5641f3979297] >>> 9: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x610) >>> [0x5641f391c9b0] >>> 10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x9a) >>> [0x5641f392a38a] >>> 11: (BlueStore::_kv_finalize_thread()+0x41e) [0x5641f392b3be] >>> 12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x5641f397d85d] >>> 13: (()+0x7e25) [0x7fcbbe4d2e25] >>> 14: (clone()+0x6d) [0x7fcbbd5c3bad] >>> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >>> interpret this. >>> >>> >>> Here's the output of ceph -s that might fill in some configuration >>> questions. Since osds are continually restarting if I try to put load on >>> it, the cluster seems to be churning a bit. That's why I set nodown for >>> now. >>> >>> cluster: >>> id: b2873c9a-5539-4c76-ac4a-a6c9829bfed2 >>> health: HEALTH_ERR >>> 1 filesystem is degraded >>> 1 filesystem is offline >>> 1 mds daemon damaged >>> nodown,noscrub,nodeep-scrub flag(s) set >>> 9 scrub errors >>> Reduced data availability: 61 pgs inactive, 56 pgs peering, 4 >>> pgs stale >>> Possible data damage: 3 pgs inconsistent >>> 16 slow requests are blocked > 32 sec >>> 26 stuck requests are blocked > 4096 sec >>> >>> services: >>> mon: 5 daemons, quorum a,b,c,d,e >>> mgr: a(active), standbys: b, d, e, c >>> mds: lcs-0/1/1 up , 2 up:standby, 1 damaged >>> osd: 34 osds: 34 up, 34 in >>> flags nodown,noscrub,nodeep-scrub >>> >>> data: >>> pools: 15 pools, 640 pgs >>> objects: 9.73 M objects, 13 TiB >>> usage: 24 TiB used, 55 TiB / 79 TiB avail >>> pgs: 23.438% pgs not active >>> 487 active+clean >>> 73 peering >>> 70 activating >>> 5 stale+peering >>> 3 active+clean+inconsistent >>> 2 stale+activating >>> >>> io: >>> client: 1.3 KiB/s wr, 0 op/s rd, 0 op/s wr >>> >>> >>> If there's any other information I can provide that can help point to the >>> problem, I'd be glad to share. >> >> If you leave the cluster to recover what point does it get to (ceph -s output)? >> > > OK, I stopped client activity as much as I could, and watched OSDs still > crashing, so I set noout and stopped all OSDs. I then brought up the > ones that were crashing, and they came up fine in isolation. But they > crashed once I brought enough of their peers up. > > I could have sworn that prior to this, there was a point where I had all > 640 pgs active on Mimic. > > Here's the best state I've seen that I've been able to get (the most > active PGs). > > cluster: > id: b2873c9a-5539-4c76-ac4a-a6c9829bfed2 > health: HEALTH_ERR > 1 filesystem is degraded > 1 filesystem is offline > 1 mds daemon damaged > nodown,noout,noscrub,nodeep-scrub flag(s) set > 8 scrub errors > Reduced data availability: 25 pgs inactive, 18 pgs peering > Possible data damage: 2 pgs inconsistent > Degraded data redundancy: 152/75371544 objects degraded > (0.000%), 4 pgs degraded > 3 slow requests are blocked > 32 sec > > services: > mon: 5 daemons, quorum a,b,c,d,e > mgr: a(active), standbys: b, d, e, c > mds: lcs-0/1/1 up , 2 up:standby, 1 damaged > osd: 34 osds: 34 up, 34 in; 1 remapped pgs > flags nodown,noout,noscrub,nodeep-scrub > > data: > pools: 15 pools, 640 pgs > objects: 9.73 M objects, 13 TiB > usage: 24 TiB used, 55 TiB / 79 TiB avail > pgs: 5.156% pgs not active > 152/75371544 objects degraded (0.000%) > 603 active+clean > 18 peering > 13 activating > 2 active+clean+inconsistent > 2 active+recovery_wait+degraded > 1 activating+degraded > 1 activating+undersized+degraded+remapped > > > > > These crashes seem to happen without FAILED assert messages, but rather > just an abort() signal. FWIW, all of the crashing OSDs are on SSD. > They have a writeback cache pool as well as their own pools outright. > > > 2018-07-19 02:23:04.463 7fa5f483c700 1 osd.18 pg_epoch: 30273 > pg[18.7s4( empty local-lis/les=30242/30244 n=0 ec=12913/12913 lis/c > 30242/30242 les/c/f 30244/30244/0 30273/30273/30222) [30,3,7,4,18,26,16, > 28]p30(0) r=4 lpr=30273 pi=[30242,30273)/1 crt=0'0 unknown NOTIFY > mbc={}] start_peering_interval up [30,3,7,4,18,26,2147483647,28] -> > [30,3,7,4,18,26,16,28], acting [30,3,7,4,18,26,2147483647,28] -> [30, > 3,7,4,18,26,16,28], acting_primary 30(0) -> 30, up_primary 30(0) -> 30, > role 4 -> 4, features acting 2305244844817448955 upacting > 2305244844817448955 > 2018-07-19 02:23:04.464 7fa5f483c700 1 osd.18 pg_epoch: 30273 > pg[18.7s4( empty local-lis/les=30242/30244 n=0 ec=12913/12913 lis/c > 30242/30242 les/c/f 30244/30244/0 30273/30273/30222) [30,3,7,4,18,26,16, > 28]p30(0) r=4 lpr=30273 pi=[30242,30273)/1 crt=0'0 unknown NOTIFY > mbc={}] state<Start>: transitioning to Stray > 2018-07-19 02:23:04.465 7fa5f383a700 1 osd.18 pg_epoch: 30273 pg[6.es3( > v 22653'1514806 (18839'1511721,22653'1514806] local-lis/les=30196/30198 > n=5993 ec=778/778 lis/c 30196/30196 les/c/f 30198/30198/0 > 30273/30273/30222) [30,1,21,18,28,16,33,13]p30(0) r=3 lpr=30273 > pi=[30196,30273)/1 crt=22635'1514805 lcod 0'0 unknown NOTIFY mbc={} > ps=16639] start_peering_interval up [30,1,21,18,28,2147483647,33,13] -> > [30,1,21,18,28,16,33,13], acting [30,1,21,18,28,2147483647,33,13] -> > [30,1,21,18,28,16,33,13], acting_primary 30(0) -> 30, up_primary 30(0) > -> 30, role 3 -> 3, features acting 2305244844817448955 upacti > ng 2305244844817448955 > 2018-07-19 02:23:04.465 7fa5f383a700 1 osd.18 pg_epoch: 30273 pg[6.es3( > v 22653'1514806 (18839'1511721,22653'1514806] local-lis/les=30196/30198 > n=5993 ec=778/778 lis/c 30196/30196 les/c/f 30198/30198/0 > 30273/30273/30222) [30,1,21,18,28,16,33,13]p30(0) r=3 lpr=30273 > pi=[30196,30273)/1 crt=22635'1514805 lcod 0'0 unknown NOTIFY mbc={} > ps=16639] state<Start>: transitioning to Stray > 2018-07-19 02:23:04.822 7fa5f483c700 -1 *** Caught signal (Aborted) ** > in thread 7fa5f483c700 thread_name:tp_osd_tp > > ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic > (stable) > 1: (()+0x8e1870) [0x56345079c870] > 2: (()+0xf6d0) [0x7fa6141b76d0] > 3: (gsignal()+0x37) [0x7fa6131d8277] > 4: (abort()+0x148) [0x7fa6131d9968] > 5: (BlueStore::_wctx_finish(BlueStore::TransContext*, > boost::intrusive_ptr<BlueStore::Collection>&, > boost::intrusive_ptr<BlueStore::Onode>, BlueStore::WriteContext*, > std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, > std::allocator<BlueStore::SharedBlob*> >*)+0xdea) [0x563450687cda] > 6: (BlueStore::_do_truncate(BlueStore::TransContext*, > boost::intrusive_ptr<BlueStore::Collection>&, > boost::intrusive_ptr<BlueStore::Onode>, unsigned long, > std::set<BlueStore::SharedBlob*, std::less<BlueStore::SharedBlob*>, > std::allocator<BlueStore::SharedBlob*> >*)+0x13d) [0x563450696ead] > 7: (BlueStore::_do_remove(BlueStore::TransContext*, > boost::intrusive_ptr<BlueStore::Collection>&, > boost::intrusive_ptr<BlueStore::Onode>)+0xbf) [0x56345069767f] > 8: (BlueStore::_remove(BlueStore::TransContext*, > boost::intrusive_ptr<BlueStore::Collection>&, > boost::intrusive_ptr<BlueStore::Onode>&)+0x60) [0x563450698e50] > 9: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, > ObjectStore::Transaction*)+0x105d) [0x5634506a366d] > 10: > (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, > std::vector<ObjectStore::Transaction, > std::allocator<ObjectStore::Transaction> >&, > boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x519) > [0x5634506a57b9] > 11: > (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, > ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, > ThreadPool::TPHandle*)+0x80) [0x5634502b68d0] > 12: (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, > ThreadPool::TPHandle*)+0x58) [0x56345024d788] > 13: (OSD::dequeue_peering_evt(OSDShard*, PG*, > std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xfe) > [0x56345027b23e] > 14: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, > ThreadPool::TPHandle&)+0x50) [0x5634504d2820] > 15: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0x592) [0x563450285e02] > 16: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x3d3) > [0x7fa61765d333] > 17: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fa61765df20] > 18: (()+0x7e25) [0x7fa6141afe25] > 19: (clone()+0x6d) [0x7fa6132a0bad] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is > needed to interpret this. > > I'm on IRC (as MooingLemur) if more real-time communication would help :) Sure, I'll try to contact you there. In the meantime could you open up a tracker showing the crash stack trace above and a brief description of the current situation and the events leading up to it? Could you also get a debug log of one of these crashes with "debug bluestore = 20" and, ideally, a coredump? > > I much appreciate the help. I feel like a doofus for panicking and > upgrading to Mimic rather than finding the mailing list posts, being > cautious, and waiting. > > -Troy -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com