Re: Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 07/17/2018 11:14 PM, Brad Hubbard wrote:
On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan <tablan@xxxxxxxxx> wrote:
I was on 12.2.5 for a couple weeks and started randomly seeing
corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
loose.  I panicked and moved to Mimic, and when that didn't solve the
problem, only then did I start to root around in mailing lists archives.

It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
out, but I'm unsure how to proceed now that the damaged cluster is
running under Mimic.  Is there anything I can do to get the cluster back
online and objects readable?
That depends on what the specific problem is. Can you provide some
data that fills in the blanks around "randomly seeing corruption"?

Thanks for the reply, Brad.  I have a feeling that almost all of this stems from the time the cluster spent running 12.2.6.  When booting VMs that use rbd as a backing store, they typically get I/O errors during boot and cannot read critical parts of the image.  I also get similar errors if I try to rbd export most of the images. Also, CephFS is not started as ceph -s indicates damage.

Many of the OSDs have been crashing and restarting as I've tried to rbd export good versions of images (from older snapshots).  Here's one particular crash:

2018-07-18 15:52:15.809 7fcbaab77700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h uge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/os/bluestore/BlueStore.h: In function 'void BlueStore::SharedBlobSet::remove_last(BlueStore::SharedBlob*)' thread 7fcbaab7
7700 time 2018-07-18 15:52:15.750916
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13
.2.0/src/os/bluestore/BlueStore.h: 455: FAILED assert(sb->nref == 0)

 ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic (stable)  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0xff) [0x7fcbc197a53f]
 2: (()+0x286727) [0x7fcbc197a727]
 3: (BlueStore::SharedBlob::put()+0x1da) [0x5641f39181ca]
 4: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, boost::intrusive_ptr<BlueStore::SharedBlob>, std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B
lueStore::SharedBlob> >*)+0x2d) [0x5641f3977cfd]
 5: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, boost::intrusive_ptr<BlueStore::SharedBlob>, std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B
lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
 6: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, boost::intrusive_ptr<BlueStore::SharedBlob>, std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B
lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
 7: (std::_Rb_tree<boost::intrusive_ptr<BlueStore::SharedBlob>, boost::intrusive_ptr<BlueStore::SharedBlob>, std::_Identity<boost::intrusive_ptr<BlueStore::SharedBlob> >, std::less<boost::intrusive_ptr<BlueStore::SharedBlob> >, std::allocator<boost::intrusive_ptr<BlueStore::SharedBlob> > >::_M_erase(std::_Rb_tree_node<boost::intrusive_ptr<B
lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
 8: (BlueStore::TransContext::~TransContext()+0xf7) [0x5641f3979297]
 9: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x610) [0x5641f391c9b0]  10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x9a) [0x5641f392a38a]
 11: (BlueStore::_kv_finalize_thread()+0x41e) [0x5641f392b3be]
 12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x5641f397d85d]
 13: (()+0x7e25) [0x7fcbbe4d2e25]
 14: (clone()+0x6d) [0x7fcbbd5c3bad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Here's the output of ceph -s that might fill in some configuration questions.  Since osds are continually restarting if I try to put load on it, the cluster seems to be churning a bit.  That's why I set nodown for now.

  cluster:
    id:     b2873c9a-5539-4c76-ac4a-a6c9829bfed2
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem is offline
            1 mds daemon damaged
            nodown,noscrub,nodeep-scrub flag(s) set
            9 scrub errors
            Reduced data availability: 61 pgs inactive, 56 pgs peering, 4 pgs stale
            Possible data damage: 3 pgs inconsistent
            16 slow requests are blocked > 32 sec
            26 stuck requests are blocked > 4096 sec

  services:
    mon: 5 daemons, quorum a,b,c,d,e
    mgr: a(active), standbys: b, d, e, c
    mds: lcs-0/1/1 up , 2 up:standby, 1 damaged
    osd: 34 osds: 34 up, 34 in
         flags nodown,noscrub,nodeep-scrub

  data:
    pools:   15 pools, 640 pgs
    objects: 9.73 M objects, 13 TiB
    usage:   24 TiB used, 55 TiB / 79 TiB avail
    pgs:     23.438% pgs not active
             487 active+clean
             73  peering
             70  activating
             5   stale+peering
             3   active+clean+inconsistent
             2   stale+activating

  io:
    client:   1.3 KiB/s wr, 0 op/s rd, 0 op/s wr


If there's any other information I can provide that can help point to the problem, I'd be glad to share.

Thanks

-Troy
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux