Re: Bounding OSD memory requirements during peering/recovery

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 13 Mar 2015 16:36:56 +0100

On Fri, Mar 13, 2015 at 1:52 PM, Dan van der Ster <dan@xxxxxxxxxxxxxx> wrote:
> Hi Sage,
>
> Losing a message would have been plausible given the network issue we had today.
>
> I tried:
>
> # ceph osd pg-temp 75.45 6689
> set 75.45 pg_temp mapping to [6689]
>
> then waited a bit. It's still incomplete -- the only difference is now
> I see two more past_intervals in the pg. Full query here:
> http://pastebin.com/TU7vVLpj
>
> I didn't have debug_osd above zero when I did that. Should I try again
> with debug_osd 20?

I tried again with logging. The pg goes like this:

incomplete -> inactive -> remapped -> remapped+peering -> remapped ->
inactive -> peering -> incomplete

The killer seems to be:

2015-03-13 16:15:43.476925 7f3c2e055700 10 osd.6689 pg_epoch: 67050
pg[75.45( v 66245'4028 (49044'1025,66245'4028] local-les=61515 n=3994
ec=48759 les/c 66791/66791 67037/67050/67037) [6689,1919,2329]/[6689]
r=0 lpr=67050 pi=66787-67049/13 crt=66226'4026 lcod 0'0 mlcod 0'0
remapped+peering] choose_acting no suitable info found (incomplete
backfills?), reverting to up

Full log is here: http://pastebin.com/hZUBD9NT

Do you have an idea what went wrong here? BTW, our firefly "prod"
cluster suffered from the same network problem today, but all of those
cluster's PGs recovered nicely.
Does the hammer RC have different peering logic that might apply here?

Thanks! Dan

>
> Thanks :)
>
> Dan
>
> On Fri, Mar 13, 2015 at 12:59 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> This looks a bit like a the osds may have lost a message, actually.  You can
>> kick an individual pg to repeer with something like
>>
>> ceph osd pg-temp 75.45 6689
>>
>> See if that makes it go?
>>
>> sage
>>
>>
>>
>> On March 13, 2015 7:24:48 AM EDT, Dan van der Ster <dan@xxxxxxxxxxxxxx>
>> wrote:
>>>
>>> On Mon, Mar 9, 2015 at 4:47 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>
>>>>  On Mon, Mar 9, 2015 at 8:42 AM, Dan van der Ster <dan@xxxxxxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>>  Hi Sage,
>>>>>
>>>>>  On Tue, Feb 10, 2015 at 2:51 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>>>>>>
>>>>>>  On Mon, 9 Feb 2015, David McBride wrote:
>>>>>>>
>>>>>>>  On 09/02/15 15:31, Gregory Farnum wrote:
>>>>>>>
>>>>>>>>  So, memory
>>>>>>>> usage of an OSD is usually linear in the number of PGs it
>>>>>>>>  hosts. However, that memory can also grow based on at least one
>>>>>>>> other
>>>>>>>>  thing: the number of OSD Maps required to go through peering. It
>>>>>>>>  *looks* to me like this is what you're running in to, not growth on
>>>>>>>>  the number of state machines. In particular, those past_intervals
>>>>>>>> you
>>>>>>>>  mentioned. ;)
>>>>>>>
>>>>>>>
>>>>>>>  Hi Greg,
>>>>>>>
>>>>>>>  Right, that sounds entirely plausible, and is very helpful.
>>>>>>>
>>>>>>>  In practice, that means I'll need to be careful to avoid this
>>>>>>> situation
>>>>>>>  occurring in production ? but given that's unlikely to occur except
>>>>>>> in the
>>>>>>>  case of non-trivial neglect, I don't think I need be particularly
>>>>>>> concerned.
>>>>>>>
>>>>>>>  (Happily, I'm in the situation that my existing cluster is purely for
>>>>>>> testing
>>>>>>>  purposes; the data is expendable.)
>>>>>>>
>>>>>>>  That said, for my own peace of mind, it would be valuable to have a
>>>>>>> procedure
>>>>>>>  that can be used to recover from this
>>>>>>> state, even if it's unlikely to occur in
>>>>>>>  practice.
>>>>>>
>>>>>>
>>>>>>  The best luck I've had recovering from situations is something like:
>>>>>>
>>>>>>  - stop all osds
>>>>>>  - osd set nodown
>>>>>>  - osd set nobackfill
>>>>>>  - osd set noup
>>>>>>  - set map cache size smaller to reduce memory footprint.
>>>>>>
>>>>>>    osd map cache size = 50
>>>>>>    osd map max advance = 25
>>>>>>    osd map share max epochs = 25
>>>>>>    osd pg epoch persisted max stale = 25
>>>>
>>>>
>>>>  It can cause extreme slowness if you get into a failure situation and
>>>>  your OSDs need to calculate past intervals across more maps than will
>>>>  fit in the cache. :(
>>>
>>>
>>> .. extreme slowness or is it also possible to get into a situation
>>> where the PGs are stuck incomplete forever?
>>>
>>> The reason I ask is because we actually had a network issue this
>>> morning that left OSDs flapping and a lot of osdmap epoch churn. Now
>>> our network has
>>> stabilized but 10 PGs are incomplete, even though all
>>> the OSDs are up. One PG looks like this, for example:
>>>
>>> pg 75.45 is stuck inactive for 87351.077529, current state incomplete,
>>> last acting [6689,1919,2329]
>>> pg 75.45 is stuck unclean for 87351.096198, current state incomplete,
>>> last acting [6689,1919,2329]
>>> pg 75.45 is incomplete, acting [6689,1919,2329]
>>>
>>> 1919     3.62000                 osd.1919                      up
>>> 1.00000          1.00000
>>> 2329     3.62000                 osd.2329                      up
>>> 1.00000          1.00000
>>> 6689     3.62000                 osd.6689                      up
>>> 1.00000          1.00000
>>>
>>> The pg query output here: http://pastebin.com/WyTAU69W
>>>
>>> Is that a result of these short map caches or could it be something
>>> else?  (we're running 0.93-76-gc35f422)
>>> WWGD (what would Greg do?) to activate these PGs?
>>>
>>> Thanks! Dan
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html