Most likely what happened is that the block represented by that file was fully overwritten replacing both copies. You can probably consider that one healed. The others should be dealt with similarly: the larger file should be the more correct one (since it should also reflect writes made recently to the smaller one). -Sam On Thu, Dec 13, 2012 at 8:48 AM, Oliver Francke <Oliver.Francke@xxxxxxxx> wrote: > Hi Sam, > > > On 12/13/2012 05:15 AM, Samuel Just wrote: >> >> Apologies, I missed your reply on Monday. Any attempt to read or > > > no prob ;) We are busy, too, with preparing new nodes and switch to 10GE > this evening. > > >> write the object will hit the file on the primary (the smaller one >> with the newer syslog entries). If you take down both OSDs (12 and >> 40) while performing the repair, the vm in question will hang if it >> tries to access that block, but should recover when you bring the OSDs >> back up. To expand on the the response Sage posted, writes/reads to >> that block have been hitting the primary (osd.12) which unfortunately >> is the incorrect file. I would, however, have expected that those >> writes would have been replicated to the larger file on osd.40 as >> well. Are you certain that the newer syslog entries on 12 aren't also >> present on 40? > > > well... time heals... I re-checked right now and both files are md5-wise > identical?! > Not checked the other 5 inconsistencies. > Still having three headers missing and 6 OSD's not checked with scrub, > though. > > Will be back... for sure ;) > > Thnx for now, > > Oliver. > > > >> -Sam >> >> On Tue, Dec 11, 2012 at 11:38 AM, Oliver Francke >> <Oliver.Francke@xxxxxxxx> wrote: >>> >>> Hi Sage, >>> >>> Am 11.12.2012 um 18:04 schrieb Sage Weil <sage@xxxxxxxxxxx>: >>> >>>> On Tue, 11 Dec 2012, Oliver Francke wrote: >>>>> >>>>> Hi Sam, >>>>> >>>>> perhaps you have overlooked my comments further down, beginning with >>>>> "been there" ? ;) >>>> >>>> We're pretty swamped with bobtail stuff at the moment, so ceph-devel >>>> inquiries are low on the priority list right now. >>>> >>> 100% agree, this thing here is "best effort" right now, true. >>> >>>> See below: >>>> >>>>> If so, please have a look, cause I'm clueless 8-) >>>>> >>>>> On 12/10/2012 11:48 AM, Oliver Francke wrote: >>>>>> >>>>>> Hi Sam, >>>>>> >>>>>> helpful input.. and... not so... >>>>>> >>>>>> On 12/07/2012 10:18 PM, Samuel Just wrote: >>>>>>> >>>>>>> Ah... unfortunately doing a repair in these 6 cases would probably >>>>>>> result in the wrong object surviving. It should work, but it might >>>>>>> corrupt the rbd image contents. If the images are expendable, you >>>>>>> could repair and then delete the images. >>>>>>> >>>>>>> The red flag here is that the "known size" is smaller than the other >>>>>>> size. This indicates that it most likely chose the wrong file as the >>>>>>> "correct" one since rbd image blocks usually get bigger over time. >>>>>>> To >>>>>>> fix this, you will need to manually copy the file for the larger of >>>>>>> the two object replicas to replace the smaller of the two object >>>>>>> replicas. >>>>>>> >>>>>>> For the first, soid >>>>>>> 87c96f10/rb.0.47d9b.1014b7b4.0000000002df/head//65 >>>>>>> in pg 65.10: >>>>>>> 1) Find the object on the primary and the replica (from above, >>>>>>> primary >>>>>>> is 12 and replica is 40). You can use find in the primary and >>>>>>> replica >>>>>>> current/65.10_head directories to look for a file matching >>>>>>> *rb.0.47d9b.1014b7b4.0000000002df*). The file name should be >>>>>>> 'rb.0.47d9b.1014b7b4.0000000002df__head_87C96F10__65' I think. >>>>>>> 2) Stop the primary and replica osds >>>>>>> 3) Compare the file sizes for the two files -- you should find that >>>>>>> the file sizes do not match. >>>>>>> 4) Replace the smaller file with the larger one (you'll probably want >>>>>>> to keep a copy of the smaller one around just in case). >>>>>>> 5) Restart the osds and scrub pg 65.10 -- the pg should come up clean >>>>>>> (possibly with a relatively harmless stat mismatch) >>>>>> >>>>>> been there. on OSD.12 it's >>>>>> -rw-r--r-- 1 root root 699904 Dec 9 06:25 >>>>>> rb.0.47d9b.1014b7b4.0000000002df__head_87C96F10__41 >>>>>> >>>>>> on OSD.40: >>>>>> -rw-r--r-- 1 root root 4194304 Dec 9 06:25 >>>>>> rb.0.47d9b.1014b7b4.0000000002df__head_87C96F10__41 >>>>>> >>>>>> going by a short glance into the file, there are some readable >>>>>> syslog-entries, in both files. >>>>>> For the bad luck in this example, the shorter file contains the more >>>>>> current >>>>>> entries?! >>>> >>>> It sounds like the larger one was at one point correct, but since they >>>> got >>>> out of sync an update was applied to the other. What fs is this (inside >>>> the VM)? If we're lucky the whole block is file data, in which case I >>>> would extend the small one with more recent out to the full size by >>>> taking >>>> the last chunk of the second one. (Or, if the bytes look like an >>>> unimportant file, just use truncate(1) to extend it, and get zeros for >>>> that region.) Make backups of the object first, and fsck inside the VM >>>> afterwards. >>>> >>>> -- >>>> >>>> We've seen this issue bite twice now, both times on argonaut. So far >>>> nobody using anything more recent..but that is a smaller pool of people, >>>> so no real comform there. Working on setting up a higher-stress >>>> long-term >>>> testing cluster to trigger this. >>>> >>>> Can you remind me what kernel version you are using? >>> >>> one of the affected nodes are driven by 3.5.4, the newer ones are >>> nowadays Ubtuntu 12.04.1 LTS with self-compiled 3.6.6. >>> Inside the VM's you can imagine all flavors, less forgiving CentOS 5.8, >>> some debian5.0 ( ext3)… mostly ext3, I think. Not optimum, at least. >>> >>> Couple of problems caused by slow requests, I can see in some log-files >>> customers pressing the "RESET" button, implemented via qemu-monitor. >>> Destructive as can be, with having some megs of cache with the >>> rbd-device. >>> >>> Thnx n regards, >>> >>> Oliver. >>> >>>> sage >>>> >>>> >>>>>> What exactly happens, if I try to copy or export the file? Which block >>>>>> will >>>>>> be chosen? >>>>>> VM is running as I'm writing, so flexibility reduced. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Oliver. >>>>>> >>>>>>> If this worked our correctly, you can repeat for the other 5 cases. >>>>>>> >>>>>>> Let me know if you have any questions. >>>>>>> -Sam >>>>>>> >>>>>>> On Fri, Dec 7, 2012 at 11:09 AM, Oliver Francke >>>>>>> <Oliver.Francke@xxxxxxxx> >>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Sam, >>>>>>>> >>>>>>>> Am 07.12.2012 um 19:37 schrieb Samuel Just <sam.just@xxxxxxxxxxx>: >>>>>>>> >>>>>>>>> That is very likely to be one of the merge_log bugs fixed between >>>>>>>>> 0.48 >>>>>>>>> and 0.55. I could confirm with a stacktrace from gdb with line >>>>>>>>> numbers or the remainder of the logging dumped when the daemon >>>>>>>>> crashed. >>>>>>>>> >>>>>>>>> My understanding of your situation is that currently all pgs are >>>>>>>>> active+clean but you are missing some rbd image headers and some >>>>>>>>> rbd >>>>>>>>> images appear to be corrupted. Is that accurate? >>>>>>>>> -Sam >>>>>>>>> >>>>>>>> thnx for droppig in. >>>>>>>> >>>>>>>> Uhm almost correct, there are now 6 pg in state inconsistent: >>>>>>>> >>>>>>>> HEALTH_WARN 6 pgs inconsistent >>>>>>>> pg 65.da is active+clean+inconsistent, acting [1,33] >>>>>>>> pg 65.d7 is active+clean+inconsistent, acting [13,42] >>>>>>>> pg 65.10 is active+clean+inconsistent, acting [12,40] >>>>>>>> pg 65.f is active+clean+inconsistent, acting [13,31] >>>>>>>> pg 65.75 is active+clean+inconsistent, acting [1,33] >>>>>>>> pg 65.6a is active+clean+inconsistent, acting [13,31] >>>>>>>> >>>>>>>> I know which images are affected, but does a repair help? >>>>>>>> >>>>>>>> 0 log [ERR] : 65.10 osd.40: soid >>>>>>>> 87c96f10/rb.0.47d9b.1014b7b4.0000000002df/head//65 size 4194304 != >>>>>>>> known >>>>>>>> size 699904 >>>>>>>> 0 log [ERR] : 65.6a osd.31: soid >>>>>>>> 19a2526a/rb.0.2dcf2.1da2a31e.000000000737/head//65 size 4191744 != >>>>>>>> known >>>>>>>> size 2757632 >>>>>>>> 0 log [ERR] : 65.75 osd.33: soid >>>>>>>> 20550575/rb.0.2d520.5c17a6e3.000000000339/head//65 size 4194304 != >>>>>>>> known >>>>>>>> size 1238016 >>>>>>>> 0 log [ERR] : 65.d7 osd.42: soid >>>>>>>> fa3a5d7/rb.0.2c2a8.12ec359d.00000000205c/head//65 size 4194304 != >>>>>>>> known >>>>>>>> size 1382912 >>>>>>>> 0 log [ERR] : 65.da osd.33: soid >>>>>>>> c2a344da/rb.0.2be17.cb4bd69.000000000081/head//65 size 4191744 != >>>>>>>> known >>>>>>>> size 1815552 >>>>>>>> 0 log [ERR] : 65.f osd.31: soid >>>>>>>> e8d2430f/rb.0.2d1e9.1339c5dd.000000000c41/head//65 size 2424832 != >>>>>>>> known >>>>>>>> size 2331648 >>>>>>>> >>>>>>>> of make things worse? >>>>>>>> >>>>>>>> I could only check 14 out of 20 OSD's so far, cause from two older >>>>>>>> nodes >>>>>>>> a scrub leads to slow-requests? > couple of minutes, so VM's got >>>>>>>> stalled? customers pressing the "reset-button", so losing caches? >>>>>>>> >>>>>>>> Comments welcome, >>>>>>>> >>>>>>>> Oliver. >>>>>>>> >>>>>>>>> On Fri, Dec 7, 2012 at 6:39 AM, Oliver Francke >>>>>>>>> <Oliver.Francke@xxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> is the following a "known one", too? Would be good to get it out >>>>>>>>>> of >>>>>>>>>> my head: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd() >>>>>>>>>>> [0x706c59] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0) >>>>>>>>>>> [0x7f7f306c0ff0] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35) >>>>>>>>>>> [0x7f7f2f35f1b5] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180) >>>>>>>>>>> [0x7f7f2f361fc0] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 5: >>>>>>>>>>> (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 6: (()+0xcb166) >>>>>>>>>>> [0x7f7f2fbf2166] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 7: (()+0xcb193) >>>>>>>>>>> [0x7f7f2fbf2193] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 8: (()+0xcb28e) >>>>>>>>>>> [0x7f7f2fbf228e] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 9: >>>>>>>>>>> (ceph::__ceph_assert_fail(char >>>>>>>>>>> const*, char const*, int, char const*)+0x793) [0x77e903] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 10: >>>>>>>>>>> (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&, >>>>>>>>>>> int)+0x1de3) [0x63db93] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 11: >>>>>>>>>>> (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec >>>>>>>>>>> const&)+0x2cc) >>>>>>>>>>> [0x63e00c] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 12: >>>>>>>>>>> (boost::statechart::simple_state<PG::RecoveryState::Stray, >>>>>>>>>>> PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na, >>>>>>>>>>> mpl_::na, >>>>>>>>>>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >>>>>>>>>>> mpl_::na, >>>>>>>>>>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >>>>>>>>>>> mpl_::na, >>>>>>>>>>> mpl_::na, mpl_::na, mpl_::na>, >>>>>>>>>>> >>>>>>>>>>> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base >>>>>>>>>>> const&, void const*)+0x203) [0x658a63] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 13: >>>>>>>>>>> >>>>>>>>>>> (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, >>>>>>>>>>> PG::RecoveryState::Initial, std::allocator<void>, >>>>>>>>>>> >>>>>>>>>>> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base >>>>>>>>>>> const&)+0x6b) [0x650b4b] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 14: >>>>>>>>>>> (PG::RecoveryState::handle_log(int, MOSDPGLog*, >>>>>>>>>>> PG::RecoveryCtx*)+0x190) >>>>>>>>>>> [0x60a520] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 15: >>>>>>>>>>> (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x666) >>>>>>>>>>> [0x5c62e6] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 16: >>>>>>>>>>> (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x11b) >>>>>>>>>>> [0x5c6f3b] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 17: >>>>>>>>>>> (OSD::_dispatch(Message*)+0x173) >>>>>>>>>>> [0x5d1983] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 18: >>>>>>>>>>> (OSD::ms_dispatch(Message*)+0x184) >>>>>>>>>>> [0x5d2254] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 19: >>>>>>>>>>> (SimpleMessenger::DispatchQueue::entry()+0x5e9) [0x7d3c09] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 20: >>>>>>>>>>> (SimpleMessenger::dispatch_entry()+0x15) [0x7d5195] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 21: >>>>>>>>>>> (SimpleMessenger::DispatchThread::entry()+0xd) [0x726bad] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 22: (()+0x68ca) >>>>>>>>>>> [0x7f7f306b88ca] >>>>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 23: (clone()+0x6d) >>>>>>>>>>> [0x7f7f2f3fc92d] >>>>>>>>>>> >>>>>>>>>> Thnx for looking, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Oliver. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Oliver Francke >>>>>>>>>> >>>>>>>>>> filoo GmbH >>>>>>>>>> Moltkestra?e 25a >>>>>>>>>> 33330 G?tersloh >>>>>>>>>> HRB4355 AG G?tersloh >>>>>>>>>> >>>>>>>>>> Gesch?ftsf?hrer: S.Grewing | J.Rehp?hler | C.Kunz >>>>>>>>>> >>>>>>>>>> Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>> ceph-devel" in >>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>> >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>> ceph-devel" >>>>>>>>> in >>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> Oliver Francke >>>>> >>>>> filoo GmbH >>>>> Moltkestra?e 25a >>>>> 33330 G?tersloh >>>>> HRB4355 AG G?tersloh >>>>> >>>>> Gesch?ftsf?hrer: S.Grewing | J.Rehp?hler | C.Kunz >>>>> >>>>> Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>> in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > Oliver Francke > > filoo GmbH > Moltkestraße 25a > 33330 Gütersloh > HRB4355 AG Gütersloh > > Geschäftsführer: S.Grewing | J.Rehpöhler | C.Kunz > > > Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html