Re: backfill_toofull while OSDs are not full

Wido den Hollander <wido@xxxxxxxx> · Wed, 30 Jan 2019 21:26:44 +0100

On 1/30/19 9:08 PM, David Zafman wrote:
> 
> Strange, I can't reproduce this with v13.2.4.  I tried the following
> scenarios:
> 
> pg acting 1, 0, 2 -> up 1, 0 4 (osd.2 marked out).  The df on osd.2
> shows 0 space, but only osd.4 (backfill target) checks full space.
> 
> pg acting 1, 0, 2 -> up 4,3,5 (osd,1,0,2 all marked out).  The df for
> 1,0,2 show 0 space but osd.4,3,4 (backafill targets) check full space.
> 
> FYI, In a later release even when a backfill target is below
> backfillfull_ratio, if there isn't enough room for the pg to fit then
> backfill_toofull occurs.
> 
> 
> The question in your case is was any of  OSDs 999, 1900, or 145 above
> 90% (backfillfull_ratio) usage.

I triple-checked and this was not the case. I've had two Instances of
Mimic 13.2.4 where I ran into this and had somebody else report it to me.

In a few weeks I'll be performing an expansion with a customer where I'm
expecting this to show up again.

I'll check again and note the use on all OSDs and report back.

Wido

> 
> David
> 
> On 1/27/19 11:34 PM, Wido den Hollander wrote:
>>
>> On 1/25/19 8:33 AM, Gregory Farnum wrote:
>>> This doesn’t look familiar to me. Is the cluster still doing recovery so
>>> we can at least expect them to make progress when the “out” OSDs get
>>> removed from the set?
>> The recovery has already finished. It resolves itself, but in the
>> meantime I saw many PGs in the backfill_toofull state for a long time.
>>
>> This is new since Mimic.
>>
>> Wido
>>
>>> On Tue, Jan 22, 2019 at 2:44 PM Wido den Hollander <wido@xxxxxxxx
>>> <mailto:wido@xxxxxxxx>> wrote:
>>>
>>>      Hi,
>>>
>>>      I've got a couple of PGs which are stuck in backfill_toofull,
>>> but none
>>>      of them are actually full.
>>>
>>>        "up": [
>>>          999,
>>>          1900,
>>>          145
>>>        ],
>>>        "acting": [
>>>          701,
>>>          1146,
>>>          1880
>>>        ],
>>>        "backfill_targets": [
>>>          "145",
>>>          "999",
>>>          "1900"
>>>        ],
>>>        "acting_recovery_backfill": [
>>>          "145",
>>>          "701",
>>>          "999",
>>>          "1146",
>>>          "1880",
>>>          "1900"
>>>        ],
>>>
>>>      I checked all these OSDs, but they are all <75% utilization.
>>>
>>>      full_ratio 0.95
>>>      backfillfull_ratio 0.9
>>>      nearfull_ratio 0.9
>>>
>>>      So I started checking all the PGs and I've noticed that each of
>>> these
>>>      PGs has one OSD in the 'acting_recovery_backfill' which is
>>> marked as
>>>      out.
>>>
>>>      In this case osd.1880 is marked as out and thus it's capacity is
>>> shown
>>>      as zero.
>>>
>>>      [ceph@ceph-mgr ~]$ ceph osd df|grep 1880
>>>      1880   hdd 4.54599        0     0 B      0 B      0 B     0   
>>> 0  27
>>>      [ceph@ceph-mgr ~]$
>>>
>>>      This is on a Mimic 13.2.4 cluster. Is this expected or is this a
>>> unknown
>>>      side-effect of one of the OSDs being marked as out?
>>>
>>>      Thanks,
>>>
>>>      Wido
>>>      _______________________________________________
>>>      ceph-users mailing list
>>>      ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>>>      http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com