recorded data digest != on disk

"Max A. Krasilnikov" <pseudo@xxxxxxxxxxxx> · Tue, 22 Mar 2016 10:19:54 +0200

Hello!

I have 3-node cluster running ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
on Ubuntu 14.04. When scrubbing I get error:

    -9> 2016-03-21 17:36:09.047029 7f253a4f6700  5 -- op tracker -- seq: 48045, time: 2016-03-21 17:36:09.046984, event: all_read, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -8> 2016-03-21 17:36:09.047035 7f253a4f6700  5 -- op tracker -- seq: 48045, time: 0.000000, event: dispatched, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -7> 2016-03-21 17:36:09.047066 7f254411b700  5 -- op tracker -- seq: 48045, time: 2016-03-21 17:36:09.047066, event: reached_pg, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -6> 2016-03-21 17:36:09.047086 7f254411b700  5 -- op tracker -- seq: 48045, time: 2016-03-21 17:36:09.047086, event: started, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -5> 2016-03-21 17:36:09.047127 7f254411b700  5 -- op tracker -- seq: 48045, time: 2016-03-21 17:36:09.047127, event: done, op: osd_sub_op(unknown.0.0:0 5.ca 0//0//-1 [scrub-map] v 0'0 snapset=0=[]:[] snapc=0=[])
    -4> 2016-03-21 17:36:09.047173 7f253f912700  2 osd.13 pg_epoch: 23286 pg[5.ca( v 23286'8176779 (23286'8173729,23286'8176779] local-les=23286 n=8132 ec=114 les/c 23286/23286 23285/23285/23285) [13,21] r=0 lpr=23285 crt=23286'8176777 lcod 23286'8176778 mlcod 23286'8176778 active+clean+scrubbing+deep+repair] scrub_compare_maps   osd.13 has 10 items
    -3> 2016-03-21 17:36:09.047377 7f253f912700  2 osd.13 pg_epoch: 23286 pg[5.ca( v 23286'8176779 (23286'8173729,23286'8176779] local-les=23286 n=8132 ec=114 les/c 23286/23286 23285/23285/23285) [13,21] r=0 lpr=23285 crt=23286'8176777 lcod 23286'8176778 mlcod 23286'8176778 active+clean+scrubbing+deep+repair] scrub_compare_maps replica 21 has 10 items
    -2> 2016-03-21 17:36:09.047983 7f253f912700  2 osd.13 pg_epoch: 23286 pg[5.ca( v 23286'8176779 (23286'8173729,23286'8176779] local-les=23286 n=8132 ec=114 les/c 23286/23286 23285/23285/23285) [13,21] r=0 lpr=23285 crt=23286'8176777 lcod 23286'8176778 mlcod 23286'8176778 active+clean+scrubbing+deep+repair] 5.ca recorded data digest 0xb284fef9 != on disk 0x43d61c5d on 6134ccca/rb
d_data.86280c78aaf7da.00000000000e0bb5/17//5

    -1> 2016-03-21 17:36:09.048201 7f253f912700 -1 log_channel(cluster) log [ERR] : 5.ca recorded data digest 0xb284fef9 != on disk 0x43d61c5d on 6134ccca/rbd_data.86280c78aaf7da.00000000000e0bb5/17//5
     0> 2016-03-21 17:36:09.050672 7f253f912700 -1 osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7f253f912700 time 2016-03-21 17:36:09.048341
osd/osd_types.cc: 4103: FAILED assert(clone_size.count(clone))

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0x5606c23633db]
 2: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x5606c1fd4666]
 3: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, std::pair<unsigned int, unsigned int>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::pair<unsigned int, unsigned int> > > > const&)+0xa1c) [0x5606c20b3c6c]
 4: (PG::scrub_compare_maps()+0xec9) [0x5606c2020d49]
 5: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x5606c20264be]
 6: (PG::scrub(ThreadPool::TPHandle&)+0x1f4) [0x5606c2027d44]
 7: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x5606c1f0c379]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x5606c2353fc6]
 9: (ThreadPool::WorkThread::entry()+0x10) [0x5606c2355070]
 10: (()+0x8182) [0x7f256168e182]
 11: (clone()+0x6d) [0x7f255fbf947d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Is there any way to recalculate data digest?
I have removed OSD with failed PG, data was recovered but error occurs on other
OSD. I think, I do not have consistent copy of data.
What can I do to recover?

pool size 2 (it's not so good, I know, but i have not ability to increase this
nearest 2 month).

-- 
WBR, Max A. Krasilnikov
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com