Re: A couple of OSD-crashes after serious network trouble

Oliver Francke <Oliver.Francke@xxxxxxxx> · Tue, 11 Dec 2012 20:38:28 +0100

Hi Sage,

Am 11.12.2012 um 18:04 schrieb Sage Weil <sage@xxxxxxxxxxx>:

> On Tue, 11 Dec 2012, Oliver Francke wrote:
>> Hi Sam,
>> 
>> perhaps you have overlooked my comments further down, beginning with
>> "been there" ? ;)
> 
> We're pretty swamped with bobtail stuff at the moment, so ceph-devel 
> inquiries are low on the priority list right now.
> 

100% agree, this thing here is "best effort" right now, true.

> See below:
> 
>> 
>> If so, please have a look, cause I'm clueless 8-)
>> 
>> On 12/10/2012 11:48 AM, Oliver Francke wrote:
>>> Hi Sam,
>>> 
>>> helpful input.. and... not so...
>>> 
>>> On 12/07/2012 10:18 PM, Samuel Just wrote:
>>>> Ah... unfortunately doing a repair in these 6 cases would probably
>>>> result in the wrong object surviving.  It should work, but it might
>>>> corrupt the rbd image contents.  If the images are expendable, you
>>>> could repair and then delete the images.
>>>> 
>>>> The red flag here is that the "known size" is smaller than the other
>>>> size.  This indicates that it most likely chose the wrong file as the
>>>> "correct" one since rbd image blocks usually get bigger over time.  To
>>>> fix this, you will need to manually copy the file for the larger of
>>>> the two object replicas to replace the smaller of the two object
>>>> replicas.
>>>> 
>>>> For the first, soid 87c96f10/rb.0.47d9b.1014b7b4.0000000002df/head//65
>>>> in pg 65.10:
>>>> 1) Find the object on the primary and the replica (from above, primary
>>>> is 12 and replica is 40).  You can use find in the primary and replica
>>>> current/65.10_head directories to look for a file matching
>>>> *rb.0.47d9b.1014b7b4.0000000002df*).  The file name should be
>>>> 'rb.0.47d9b.1014b7b4.0000000002df__head_87C96F10__65' I think.
>>>> 2) Stop the primary and replica osds
>>>> 3) Compare the file sizes for the two files -- you should find that
>>>> the file sizes do not match.
>>>> 4) Replace the smaller file with the larger one (you'll probably want
>>>> to keep a copy of the smaller one around just in case).
>>>> 5) Restart the osds and scrub pg 65.10 -- the pg should come up clean
>>>> (possibly with a relatively harmless stat mismatch)
>>> 
>>> been there. on OSD.12 it's
>>> -rw-r--r-- 1 root root 699904 Dec  9 06:25
>>> rb.0.47d9b.1014b7b4.0000000002df__head_87C96F10__41
>>> 
>>> on OSD.40:
>>> -rw-r--r-- 1 root root 4194304 Dec  9 06:25
>>> rb.0.47d9b.1014b7b4.0000000002df__head_87C96F10__41
>>> 
>>> going by a short glance into the file, there are some readable
>>> syslog-entries, in both files.
>>> For the bad luck in this example, the shorter file contains the more current
>>> entries?!
> 
> It sounds like the larger one was at one point correct, but since they got 
> out of sync an update was applied to the other.  What fs is this (inside 
> the VM)?  If we're lucky the whole block is file data, in which case I 
> would extend the small one with more recent out to the full size by taking 
> the last chunk of the second one.  (Or, if the bytes look like an 
> unimportant file, just use truncate(1) to extend it, and get zeros for 
> that region.)  Make backups of the object first, and fsck inside the VM 
> afterwards.
> 
> --
> 
> We've seen this issue bite twice now, both times on argonaut.  So far 
> nobody using anything more recent..but that is a smaller pool of people, 
> so no real comform there.  Working on setting up a higher-stress long-term 
> testing cluster to trigger this.
> 
> Can you remind me what kernel version you are using?

one of the affected nodes are driven by 3.5.4, the newer ones are nowadays Ubtuntu 12.04.1 LTS with self-compiled 3.6.6.
Inside the VM's you can imagine all flavors, less forgiving CentOS 5.8, some debian5.0 ( ext3)… mostly ext3, I think. Not optimum, at least.

Couple of problems caused by slow requests, I can see in some log-files customers pressing the "RESET" button, implemented via qemu-monitor.
Destructive as can be, with having some megs of cache with the rbd-device.

Thnx n regards,

Oliver.

> 
> sage
> 
> 
>>> 
>>> What exactly happens, if I try to copy or export the file? Which block will
>>> be chosen?
>>> VM is running as I'm writing, so flexibility reduced.
>>> 
>>> Regards,
>>> 
>>> Oliver.
>>> 
>>>> If this worked our correctly, you can repeat for the other 5 cases.
>>>> 
>>>> Let me know if you have any questions.
>>>> -Sam
>>>> 
>>>> On Fri, Dec 7, 2012 at 11:09 AM, Oliver Francke <Oliver.Francke@xxxxxxxx>
>>>> wrote:
>>>>> Hi Sam,
>>>>> 
>>>>> Am 07.12.2012 um 19:37 schrieb Samuel Just <sam.just@xxxxxxxxxxx>:
>>>>> 
>>>>>> That is very likely to be one of the merge_log bugs fixed between 0.48
>>>>>> and 0.55.  I could confirm with a stacktrace from gdb with line
>>>>>> numbers or the remainder of the logging dumped when the daemon
>>>>>> crashed.
>>>>>> 
>>>>>> My understanding of your situation is that currently all pgs are
>>>>>> active+clean but you are missing some rbd image headers and some rbd
>>>>>> images appear to be corrupted.  Is that accurate?
>>>>>> -Sam
>>>>>> 
>>>>> thnx for droppig in.
>>>>> 
>>>>> Uhm almost correct, there are now 6 pg in state inconsistent:
>>>>> 
>>>>> HEALTH_WARN 6 pgs inconsistent
>>>>> pg 65.da is active+clean+inconsistent, acting [1,33]
>>>>> pg 65.d7 is active+clean+inconsistent, acting [13,42]
>>>>> pg 65.10 is active+clean+inconsistent, acting [12,40]
>>>>> pg 65.f is active+clean+inconsistent, acting [13,31]
>>>>> pg 65.75 is active+clean+inconsistent, acting [1,33]
>>>>> pg 65.6a is active+clean+inconsistent, acting [13,31]
>>>>> 
>>>>> I know which images are affected, but does a repair help?
>>>>> 
>>>>> 0 log [ERR] : 65.10 osd.40: soid
>>>>> 87c96f10/rb.0.47d9b.1014b7b4.0000000002df/head//65 size 4194304 != known
>>>>> size 699904
>>>>> 0 log [ERR] : 65.6a osd.31: soid
>>>>> 19a2526a/rb.0.2dcf2.1da2a31e.000000000737/head//65 size 4191744 != known
>>>>> size 2757632
>>>>> 0 log [ERR] : 65.75 osd.33: soid
>>>>> 20550575/rb.0.2d520.5c17a6e3.000000000339/head//65 size 4194304 != known
>>>>> size 1238016
>>>>> 0 log [ERR] : 65.d7 osd.42: soid
>>>>> fa3a5d7/rb.0.2c2a8.12ec359d.00000000205c/head//65 size 4194304 != known
>>>>> size 1382912
>>>>> 0 log [ERR] : 65.da osd.33: soid
>>>>> c2a344da/rb.0.2be17.cb4bd69.000000000081/head//65 size 4191744 != known
>>>>> size 1815552
>>>>> 0 log [ERR] : 65.f osd.31: soid
>>>>> e8d2430f/rb.0.2d1e9.1339c5dd.000000000c41/head//65 size 2424832 != known
>>>>> size 2331648
>>>>> 
>>>>> of make things worse?
>>>>> 
>>>>> I could only check 14 out of 20 OSD's so far, cause from two older nodes
>>>>> a scrub leads to slow-requests? > couple of minutes, so VM's got
>>>>> stalled? customers pressing the "reset-button", so losing caches?
>>>>> 
>>>>> Comments welcome,
>>>>> 
>>>>> Oliver.
>>>>> 
>>>>>> On Fri, Dec 7, 2012 at 6:39 AM, Oliver Francke
>>>>>> <Oliver.Francke@xxxxxxxx> wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> is the following a "known one", too? Would be good to get it out of
>>>>>>> my head:
>>>>>>> 
>>>>>>> 
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 1: /usr/bin/ceph-osd()
>>>>>>>> [0x706c59]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 2: (()+0xeff0)
>>>>>>>> [0x7f7f306c0ff0]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 3: (gsignal()+0x35)
>>>>>>>> [0x7f7f2f35f1b5]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 4: (abort()+0x180)
>>>>>>>> [0x7f7f2f361fc0]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 5:
>>>>>>>> (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f7f2fbf3dc5]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 6: (()+0xcb166)
>>>>>>>> [0x7f7f2fbf2166]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 7: (()+0xcb193)
>>>>>>>> [0x7f7f2fbf2193]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 8: (()+0xcb28e)
>>>>>>>> [0x7f7f2fbf228e]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 9:
>>>>>>>> (ceph::__ceph_assert_fail(char
>>>>>>>> const*, char const*, int, char const*)+0x793) [0x77e903]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 10:
>>>>>>>> (PG::merge_log(ObjectStore::Transaction&, pg_info_t&, pg_log_t&,
>>>>>>>> int)+0x1de3) [0x63db93]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 11:
>>>>>>>> (PG::RecoveryState::Stray::react(PG::RecoveryState::MLogRec
>>>>>>>> const&)+0x2cc)
>>>>>>>> [0x63e00c]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 12:
>>>>>>>> (boost::statechart::simple_state<PG::RecoveryState::Stray,
>>>>>>>> PG::RecoveryState::Started, boost::mpl::list<mpl_::na, mpl_::na,
>>>>>>>> mpl_::na,
>>>>>>>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>>>>>>>> mpl_::na,
>>>>>>>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>>>>>>>> mpl_::na,
>>>>>>>> mpl_::na, mpl_::na, mpl_::na>,
>>>>>>>> (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
>>>>>>>> const&, void const*)+0x203) [0x658a63]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 13:
>>>>>>>> (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, 
>>>>>>>> PG::RecoveryState::Initial, std::allocator<void>,
>>>>>>>> boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base 
>>>>>>>> const&)+0x6b) [0x650b4b]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 14:
>>>>>>>> (PG::RecoveryState::handle_log(int, MOSDPGLog*,
>>>>>>>> PG::RecoveryCtx*)+0x190)
>>>>>>>> [0x60a520]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 15:
>>>>>>>> (OSD::handle_pg_log(std::tr1::shared_ptr<OpRequest>)+0x666)
>>>>>>>> [0x5c62e6]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 16:
>>>>>>>> (OSD::dispatch_op(std::tr1::shared_ptr<OpRequest>)+0x11b)
>>>>>>>> [0x5c6f3b]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 17:
>>>>>>>> (OSD::_dispatch(Message*)+0x173)
>>>>>>>> [0x5d1983]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 18:
>>>>>>>> (OSD::ms_dispatch(Message*)+0x184)
>>>>>>>> [0x5d2254]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 19:
>>>>>>>> (SimpleMessenger::DispatchQueue::entry()+0x5e9) [0x7d3c09]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 20:
>>>>>>>> (SimpleMessenger::dispatch_entry()+0x15) [0x7d5195]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 21:
>>>>>>>> (SimpleMessenger::DispatchThread::entry()+0xd) [0x726bad]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 22: (()+0x68ca)
>>>>>>>> [0x7f7f306b88ca]
>>>>>>>> /var/log/ceph/ceph-osd.40.log.1.gz: 23: (clone()+0x6d)
>>>>>>>> [0x7f7f2f3fc92d]
>>>>>>>> 
>>>>>>> Thnx for looking,
>>>>>>> 
>>>>>>> 
>>>>>>> Oliver.
>>>>>>> 
>>>>>>> -- 
>>>>>>> 
>>>>>>> Oliver Francke
>>>>>>> 
>>>>>>> filoo GmbH
>>>>>>> Moltkestra?e 25a
>>>>>>> 33330 G?tersloh
>>>>>>> HRB4355 AG G?tersloh
>>>>>>> 
>>>>>>> Gesch?ftsf?hrer: S.Grewing | J.Rehp?hler | C.Kunz
>>>>>>> 
>>>>>>> Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
>>>>>>> 
>>>>>>> -- 
>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>> ceph-devel" in
>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>> -- 
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> 
>>> 
>> 
>> 
>> -- 
>> 
>> Oliver Francke
>> 
>> filoo GmbH
>> Moltkestra?e 25a
>> 33330 G?tersloh
>> HRB4355 AG G?tersloh
>> 
>> Gesch?ftsf?hrer: S.Grewing | J.Rehp?hler | C.Kunz
>> 
>> Folgen Sie uns auf Twitter: http://twitter.com/filoogmbh
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html