On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan <tablan@xxxxxxxxx> wrote: > > > On 07/17/2018 11:14 PM, Brad Hubbard wrote: >> >> On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan <tablan@xxxxxxxxx> wrote: >>> >>> I was on 12.2.5 for a couple weeks and started randomly seeing >>> corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke >>> loose. I panicked and moved to Mimic, and when that didn't solve the >>> problem, only then did I start to root around in mailing lists archives. >>> >>> It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is >>> out, but I'm unsure how to proceed now that the damaged cluster is >>> running under Mimic. Is there anything I can do to get the cluster back >>> online and objects readable? >> >> That depends on what the specific problem is. Can you provide some >> data that fills in the blanks around "randomly seeing corruption"? >> > Thanks for the reply, Brad. I have a feeling that almost all of this stems > from the time the cluster spent running 12.2.6. When booting VMs that use > rbd as a backing store, they typically get I/O errors during boot and cannot > read critical parts of the image. I also get similar errors if I try to rbd > export most of the images. Also, CephFS is not started as ceph -s indicates > damage. > > Many of the OSDs have been crashing and restarting as I've tried to rbd > export good versions of images (from older snapshots). Here's one > particular crash: > > 2018-07-18 15:52:15.809 7fcbaab77700 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h > uge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/os/bluestore/BlueStore.h: > In function 'void > BlueStore::SharedBlobSet::remove_last(BlueStore::SharedBlob*)' thread > 7fcbaab7 > 7700 time 2018-07-18 15:52:15.750916 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13 > .2.0/src/os/bluestore/BlueStore.h: 455: FAILED assert(sb->nref == 0) > > ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0xff) [0x7fcbc197a53f] > 2: (()+0x286727) [0x7fcbc197a727] > 3: (BlueStore::SharedBlob::put()+0x1da) [0x5641f39181ca] > 4: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, > boost::intrusive_ptr<BlueStore::SharedBlob>, > std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, > std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, > std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >>::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B > lueStore::SharedBlob> >*)+0x2d) [0x5641f3977cfd] > 5: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, > boost::intrusive_ptr<BlueStore::SharedBlob>, > std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, > std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, > std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >>::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B > lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb] > 6: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, > boost::intrusive_ptr<BlueStore::SharedBlob>, > std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, > std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, > std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >>::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B > lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb] > 7: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, > boost::intrusive_ptr<BlueStore::SharedBlob>, > std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, > std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, > std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >>::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B > lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb] > 8: (BlueStore::TransContext::~TransContext()+0xf7) [0x5641f3979297] > 9: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x610) > [0x5641f391c9b0] > 10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x9a) > [0x5641f392a38a] > 11: (BlueStore::_kv_finalize_thread()+0x41e) [0x5641f392b3be] > 12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x5641f397d85d] > 13: (()+0x7e25) [0x7fcbbe4d2e25] > 14: (clone()+0x6d) [0x7fcbbd5c3bad] > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to > interpret this. > > > Here's the output of ceph -s that might fill in some configuration > questions. Since osds are continually restarting if I try to put load on > it, the cluster seems to be churning a bit. That's why I set nodown for > now. > > cluster: > id: b2873c9a-5539-4c76-ac4a-a6c9829bfed2 > health: HEALTH_ERR > 1 filesystem is degraded > 1 filesystem is offline > 1 mds daemon damaged > nodown,noscrub,nodeep-scrub flag(s) set > 9 scrub errors > Reduced data availability: 61 pgs inactive, 56 pgs peering, 4 > pgs stale > Possible data damage: 3 pgs inconsistent > 16 slow requests are blocked > 32 sec > 26 stuck requests are blocked > 4096 sec > > services: > mon: 5 daemons, quorum a,b,c,d,e > mgr: a(active), standbys: b, d, e, c > mds: lcs-0/1/1 up , 2 up:standby, 1 damaged > osd: 34 osds: 34 up, 34 in > flags nodown,noscrub,nodeep-scrub > > data: > pools: 15 pools, 640 pgs > objects: 9.73 M objects, 13 TiB > usage: 24 TiB used, 55 TiB / 79 TiB avail > pgs: 23.438% pgs not active > 487 active+clean > 73 peering > 70 activating > 5 stale+peering > 3 active+clean+inconsistent > 2 stale+activating > > io: > client: 1.3 KiB/s wr, 0 op/s rd, 0 op/s wr > > > If there's any other information I can provide that can help point to the > problem, I'd be glad to share. If you leave the cluster to recover what point does it get to (ceph -s output)? -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com