Re: backfill_toofull after adding new OSDs

Brad Hubbard <bhubbard@xxxxxxxxxx> · Thu, 7 Feb 2019 14:26:06 +1000

Let's try to restrict discussion to the original thread
"backfill_toofull while OSDs are not full" and get a tracker opened up
for this issue.

On Sat, Feb 2, 2019 at 11:52 AM Fyodor Ustinov <ufm@xxxxxx> wrote:
>
> Hi!
>
> Right now, after adding OSD:
>
> # ceph health detail
> HEALTH_ERR 74197563/199392333 objects misplaced (37.212%); Degraded data redundancy (low space): 1 pg backfill_toofull
> OBJECT_MISPLACED 74197563/199392333 objects misplaced (37.212%)
> PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull
>     pg 6.eb is active+remapped+backfill_wait+backfill_toofull, acting [21,0,47]
>
> # ceph pg ls-by-pool iscsi backfill_toofull
> PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES      LOG  STATE                                          STATE_STAMP                VERSION   REPORTED   UP         ACTING       SCRUB_STAMP                DEEP_SCRUB_STAMP
> 6.eb     645        0      1290       0 1645654016 3067 active+remapped+backfill_wait+backfill_toofull 2019-02-02 00:20:32.975300 7208'6567 9790:16214 [5,1,21]p5 [21,0,47]p21 2019-01-18 04:13:54.280495 2019-01-18 04:13:54.280495
>
> All OSD have less 40% USE.
>
> ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS
>  0   hdd 9.56149  1.00000 9.6 TiB 3.2 TiB 6.3 TiB 33.64 1.31 313
>  1   hdd 9.56149  1.00000 9.6 TiB 3.3 TiB 6.3 TiB 34.13 1.33 295
>  5   hdd 9.56149  1.00000 9.6 TiB 756 GiB 8.8 TiB  7.72 0.30 103
> 47   hdd 9.32390  1.00000 9.3 TiB 3.1 TiB 6.2 TiB 33.75 1.31 306
>
> (all other OSD also have less 40%)
>
> ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)
>
> Maybe the developers will pay attention to the letter and say something?
>
> ----- Original Message -----
> From: "Fyodor Ustinov" <ufm@xxxxxx>
> To: "Caspar Smit" <casparsmit@xxxxxxxxxxx>
> Cc: "Jan Kasprzak" <kas@xxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Sent: Thursday, 31 January, 2019 16:50:24
> Subject: Re:  backfill_toofull after adding new OSDs
>
> Hi!
>
> I saw the same several times when I added a new osd to the cluster. One-two pg in "backfill_toofull" state.
>
> In all versions of mimic.
>
> ----- Original Message -----
> From: "Caspar Smit" <casparsmit@xxxxxxxxxxx>
> To: "Jan Kasprzak" <kas@xxxxxxxxxx>
> Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
> Sent: Thursday, 31 January, 2019 15:43:07
> Subject: Re:  backfill_toofull after adding new OSDs
>
> Hi Jan,
>
> You might be hitting the same issue as Wido here:
>
> [ https://www.spinics.net/lists/ceph-users/msg50603.html | https://www.spinics.net/lists/ceph-users/msg50603.html ]
>
> Kind regards,
> Caspar
>
> Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ mailto:kas@xxxxxxxxxx | kas@xxxxxxxxxx ] >:
>
>
> Hello, ceph users,
>
> I see the following HEALTH_ERR during cluster rebalance:
>
> Degraded data redundancy (low space): 8 pgs backfill_toofull
>
> Detailed description:
> I have upgraded my cluster to mimic and added 16 new bluestore OSDs
> on 4 hosts. The hosts are in a separate region in my crush map, and crush
> rules prevented data to be moved on the new OSDs. Now I want to move
> all data to the new OSDs (and possibly decomission the old filestore OSDs).
> I have created the following rule:
>
> # ceph osd crush rule create-replicated on-newhosts newhostsroot host
>
> after this, I am slowly moving the pools one-by-one to this new rule:
>
> # ceph osd pool set test-hdd-pool crush_rule on-newhosts
>
> When I do this, I get the above error. This is misleading, because
> ceph osd df does not suggest the OSDs are getting full (the most full
> OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR
> disappears. Why am I getting this error?
>
> # ceph -s
> cluster:
> id: ...my UUID...
> health: HEALTH_ERR
> 1271/3803223 objects misplaced (0.033%)
> Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs degraded, 67 pgs undersized
> Degraded data redundancy (low space): 8 pgs backfill_toofull
>
> services:
> mon: 3 daemons, quorum mon1,mon2,mon3
> mgr: mon2(active), standbys: mon1, mon3
> osd: 80 osds: 80 up, 80 in; 90 remapped pgs
> rgw: 1 daemon active
>
> data:
> pools: 13 pools, 5056 pgs
> objects: 1.27 M objects, 4.8 TiB
> usage: 15 TiB used, 208 TiB / 224 TiB avail
> pgs: 40124/3803223 objects degraded (1.055%)
> 1271/3803223 objects misplaced (0.033%)
> 4963 active+clean
> 41 active+recovery_wait+undersized+degraded+remapped
> 21 active+recovery_wait+undersized+degraded
> 17 active+remapped+backfill_wait
> 5 active+remapped+backfill_wait+backfill_toofull
> 3 active+remapped+backfill_toofull
> 2 active+recovering+undersized+remapped
> 2 active+recovering+undersized+degraded+remapped
> 1 active+clean+remapped
> 1 active+recovering+undersized+degraded
>
> io:
> client: 6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr
> recovery: 2.0 MiB/s, 92 objects/s
>
> Thanks for any hint,
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak <kas at { [ http://fi.muni.cz/ | fi.muni.cz ] - work | [ http://yenya.net/ | yenya.net ] - private}> |
> | [ http://www.fi.muni.cz/~kas/ | http://www.fi.muni.cz/~kas/ ] GPG: 4096R/A45477D5 |
> This is the world we live in: the way to deal with computers is to google
> the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
> _______________________________________________
> ceph-users mailing list
> [ mailto:ceph-users@xxxxxxxxxxxxxx | ceph-users@xxxxxxxxxxxxxx ]
> [ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ]
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com