Re: backfill_toofull seen on cluster where the most full OSD is at 1%

Brad Hubbard <bhubbard@xxxxxxxxxx> · Mon, 19 Aug 2019 11:26:34 +1000



+dev@ceph

On Thu, Aug 15, 2019 at 10:42 PM Paul Emmerich <paul.emmerich@xxxxxxxx> wrote:
>
> We've also seen this bug several times since Mimic, it seems to happen
> whenever a backfill target goes down. Always resolves itself but is
> still annoying.
>
> The original fixmaking this a warning instead of an error
> unfortunately doesn't help on Nautilus because we often have clusters
> that would be HEALTH_OK without this bug on Nautilus (i.e., some PGs
> in remapped+backfill*) but they will show up as HEALTH_WARN with this
> fix (and HEALTH_ERR without it).
>
>
>
> Paul
>
>
>
> On Wed, Aug 14, 2019 at 11:44 PM Bryan Stillwell <bstillwell@xxxxxxxxxxx> wrote:
> >
> > We've run into this issue on the first two clusters after upgrading them to Nautilus (14.2.2).
> >
> > When marking a single OSD back in to the cluster some PGs will switch to the active+remapped+backfill_wait+backfill_toofull state for a while and then it goes away after some of the other PGs finish backfilling.  This is rather odd because all the data on the cluster could fit on a single drive, but we have over 100 of them:
> >
> > # ceph -s
> >   cluster:
> >     id:     XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
> >     health: HEALTH_ERR
> >             Degraded data redundancy (low space): 1 pg backfill_toofull
> >
> >   services:
> >     mon: 3 daemons, quorum a1cephmon002,a1cephmon003,a1cephmon004 (age 21h)
> >     mgr: a1cephmon002(active, since 21h), standbys: a1cephmon003, a1cephmon004
> >     mds: cephfs:2 {0=a1cephmon002=up:active,1=a1cephmon003=up:active} 1 up:standby
> >     osd: 143 osds: 142 up, 142 in; 106 remapped pgs
> >     rgw: 11 daemons active (radosgw.a1cephrgw008, radosgw.a1cephrgw009, radosgw.a1cephrgw010, radosgw.a1cephrgw011, radosgw.a1tcephrgw002, radosgw.a1tcephrgw003, radosgw.a1tcephrgw004, radosgw.a1tcephrgw005, radosgw.a1tcephrgw006, radosgw.a1tcephrgw007, radosgw.a1tcephrgw008)
> >
> >   data:
> >     pools:   19 pools, 5264 pgs
> >     objects: 1.45M objects, 148 GiB
> >     usage:   658 GiB used, 436 TiB / 437 TiB avail
> >     pgs:     44484/4351770 objects misplaced (1.022%)
> >              5158 active+clean
> >              104  active+remapped+backfill_wait
> >              1    active+remapped+backfilling
> >              1    active+remapped+backfill_wait+backfill_toofull
> >
> >   io:
> >     client:   19 MiB/s rd, 13 MiB/s wr, 431 op/s rd, 509 op/s wr
> >
> >
> > I searched the archives, but most of the other people had more full clusters where sometimes this state could be valid.  This bug report seems similar, but the fix was just to make it a warning instead of an error:
> >
> > https://tracker.ceph.com/issues/39555
> >
> >
> > So I've created a new tracker ticket to troubleshoot this issue:
> >
> > https://tracker.ceph.com/issues/4125
> >
> >
> > Let me know what you guys think,
> >
> > Bryan
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90


-- 
Cheers,
Brad