Re: Bounding OSD memory requirements during peering/recovery

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 13 Mar 2015 13:52:13 +0100

Hi Sage,

Losing a message would have been plausible given the network issue we had today.

I tried:

# ceph osd pg-temp 75.45 6689
set 75.45 pg_temp mapping to [6689]

then waited a bit. It's still incomplete -- the only difference is now
I see two more past_intervals in the pg. Full query here:
http://pastebin.com/TU7vVLpj

I didn't have debug_osd above zero when I did that. Should I try again
with debug_osd 20?

Thanks :)

Dan

On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> This looks a bit like a the osds may have lost a message, actually.  You can
> kick an individual pg to repeer with something like
>
> ceph osd pg-temp 75.45 6689
>
> See if that makes it go?
>
> sage
>
>
>
> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@xxxxxxxxxxxxxx>
> wrote:
>>
>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>
>>>  On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx>
>>> wrote:
>>>>
>>>>  Hi Sage,
>>>>
>>>>  On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>>>>
>>>>>  On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>
>>>>>>  On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>
>>>>>>>  So, memory
>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>  hosts. However, that memory can also grow based on at least one
>>>>>>> other
>>>>>>>  thing: the number of OSD Maps required to go through peering. It
>>>>>>>  *looks* to me like this is what you're running in to, not growth on
>>>>>>>  the number of state machines. In particular, those past_intervals
>>>>>>> you
>>>>>>>  mentioned. ;)
>>>>>>
>>>>>>
>>>>>>  Hi Greg,
>>>>>>
>>>>>>  Right, that sounds entirely plausible, and is very helpful.
>>>>>>
>>>>>>  In practice, that means I'll need to be careful to avoid this
>>>>>> situation
>>>>>>  occurring in production ? but given that's unlikely to occur except
>>>>>> in the
>>>>>>  case of non-trivial neglect, I don't think I need be particularly
>>>>>> concerned.
>>>>>>
>>>>>>  (Happily, I'm in the situation that my existing cluster is purely for
>>>>>> testing
>>>>>>  purposes; the data is expendable.)
>>>>>>
>>>>>>  That said, for my own peace of mind, it would be valuable to have a
>>>>>> procedure
>>>>>>  that can be used to recover from this
>>>>>> state, even if it's unlikely to occur in
>>>>>>  practice.
>>>>>
>>>>>
>>>>>  The best luck I've had recovering from situations is something like:
>>>>>
>>>>>  - stop all osds
>>>>>  - osd set nodown
>>>>>  - osd set nobackfill
>>>>>  - osd set noup
>>>>>  - set map cache size smaller to reduce memory footprint.
>>>>>
>>>>>    osd map cache size = 50
>>>>>    osd map max advance = 25
>>>>>    osd map share max epochs = 25
>>>>>    osd pg epoch persisted max stale = 25
>>>
>>>
>>>  It can cause extreme slowness if you get into a failure situation and
>>>  your OSDs need to calculate past intervals across more maps than will
>>>  fit in the cache. :(
>>
>>
>> .. extreme slowness or is it also possible to get into a situation
>> where the PGs are stuck incomplete forever?
>>
>> The reason I ask is because we actually had a network issue this
>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>> our network has
>> stabilized but 10 PGs are incomplete, even though all
>> the OSDs are up. One PG looks like this, for example:
>>
>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
>> last acting [6689,1919,2329]
>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>> last acting [6689,1919,2329]
>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>
>> 1919     3.62000                 osd.1919                      up
>> 1.00000          1.00000
>> 2329     3.62000                 osd.2329                      up
>> 1.00000          1.00000
>> 6689     3.62000                 osd.6689                      up
>> 1.00000          1.00000
>>
>> The pg query output here: http://pastebin.com/WyTAU69W
>>
>> Is that a result of these short map caches or could it be something
>> else?  (we're running 0.93-76-gc35f422)
>> WWGD (what would Greg do?) to activate these PGs?
>>
>> Thanks! Dan
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html