Re: backfill_toofull after adding new OSDs

Fyodor Ustinov <ufm@xxxxxx> · Sat, 2 Feb 2019 03:49:05 +0200 (EET)

Hi!

Right now, after adding OSD:

# ceph health detail
HEALTH_ERR 74197563/199392333 objects misplaced (37.212%); Degraded data redundancy (low space): 1 pg backfill_toofull
OBJECT_MISPLACED 74197563/199392333 objects misplaced (37.212%)
PG_DEGRADED_FULL Degraded data redundancy (low space): 1 pg backfill_toofull
    pg 6.eb is active+remapped+backfill_wait+backfill_toofull, acting [21,0,47]

# ceph pg ls-by-pool iscsi backfill_toofull
PG   OBJECTS DEGRADED MISPLACED UNFOUND BYTES      LOG  STATE                                          STATE_STAMP                VERSION   REPORTED   UP         ACTING       SCRUB_STAMP                DEEP_SCRUB_STAMP
6.eb     645        0      1290       0 1645654016 3067 active+remapped+backfill_wait+backfill_toofull 2019-02-02 00:20:32.975300 7208'6567 9790:16214 [5,1,21]p5 [21,0,47]p21 2019-01-18 04:13:54.280495 2019-01-18 04:13:54.280495

All OSD have less 40% USE.

ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE  VAR  PGS
 0   hdd 9.56149  1.00000 9.6 TiB 3.2 TiB 6.3 TiB 33.64 1.31 313
 1   hdd 9.56149  1.00000 9.6 TiB 3.3 TiB 6.3 TiB 34.13 1.33 295
 5   hdd 9.56149  1.00000 9.6 TiB 756 GiB 8.8 TiB  7.72 0.30 103
47   hdd 9.32390  1.00000 9.3 TiB 3.1 TiB 6.2 TiB 33.75 1.31 306

(all other OSD also have less 40%)

ceph version 13.2.4 (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable)

Maybe the developers will pay attention to the letter and say something?

----- Original Message -----
From: "Fyodor Ustinov" <ufm@xxxxxx>
To: "Caspar Smit" <casparsmit@xxxxxxxxxxx>
Cc: "Jan Kasprzak" <kas@xxxxxxxxxx>, "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Sent: Thursday, 31 January, 2019 16:50:24
Subject: Re:  backfill_toofull after adding new OSDs

Hi!

I saw the same several times when I added a new osd to the cluster. One-two pg in "backfill_toofull" state.

In all versions of mimic.

----- Original Message -----
From: "Caspar Smit" <casparsmit@xxxxxxxxxxx>
To: "Jan Kasprzak" <kas@xxxxxxxxxx>
Cc: "ceph-users" <ceph-users@xxxxxxxxxxxxxx>
Sent: Thursday, 31 January, 2019 15:43:07
Subject: Re:  backfill_toofull after adding new OSDs

Hi Jan, 

You might be hitting the same issue as Wido here: 

[ https://www.spinics.net/lists/ceph-users/msg50603.html | https://www.spinics.net/lists/ceph-users/msg50603.html ] 

Kind regards, 
Caspar 

Op do 31 jan. 2019 om 14:36 schreef Jan Kasprzak < [ mailto:kas@xxxxxxxxxx | kas@xxxxxxxxxx ] >: 

Hello, ceph users, 

I see the following HEALTH_ERR during cluster rebalance: 

Degraded data redundancy (low space): 8 pgs backfill_toofull 

Detailed description: 
I have upgraded my cluster to mimic and added 16 new bluestore OSDs 
on 4 hosts. The hosts are in a separate region in my crush map, and crush 
rules prevented data to be moved on the new OSDs. Now I want to move 
all data to the new OSDs (and possibly decomission the old filestore OSDs). 
I have created the following rule: 

# ceph osd crush rule create-replicated on-newhosts newhostsroot host 

after this, I am slowly moving the pools one-by-one to this new rule: 

# ceph osd pool set test-hdd-pool crush_rule on-newhosts 

When I do this, I get the above error. This is misleading, because 
ceph osd df does not suggest the OSDs are getting full (the most full 
OSD is about 41 % full). After rebalancing is done, the HEALTH_ERR 
disappears. Why am I getting this error? 

# ceph -s 
cluster: 
id: ...my UUID... 
health: HEALTH_ERR 
1271/3803223 objects misplaced (0.033%) 
Degraded data redundancy: 40124/3803223 objects degraded (1.055%), 65 pgs degraded, 67 pgs undersized 
Degraded data redundancy (low space): 8 pgs backfill_toofull 

services: 
mon: 3 daemons, quorum mon1,mon2,mon3 
mgr: mon2(active), standbys: mon1, mon3 
osd: 80 osds: 80 up, 80 in; 90 remapped pgs 
rgw: 1 daemon active 

data: 
pools: 13 pools, 5056 pgs 
objects: 1.27 M objects, 4.8 TiB 
usage: 15 TiB used, 208 TiB / 224 TiB avail 
pgs: 40124/3803223 objects degraded (1.055%) 
1271/3803223 objects misplaced (0.033%) 
4963 active+clean 
41 active+recovery_wait+undersized+degraded+remapped 
21 active+recovery_wait+undersized+degraded 
17 active+remapped+backfill_wait 
5 active+remapped+backfill_wait+backfill_toofull 
3 active+remapped+backfill_toofull 
2 active+recovering+undersized+remapped 
2 active+recovering+undersized+degraded+remapped 
1 active+clean+remapped 
1 active+recovering+undersized+degraded 

io: 
client: 6.6 MiB/s rd, 2.7 MiB/s wr, 75 op/s rd, 89 op/s wr 
recovery: 2.0 MiB/s, 92 objects/s 

Thanks for any hint, 

-Yenya 

-- 
| Jan "Yenya" Kasprzak <kas at { [ http://fi.muni.cz/ | fi.muni.cz ] - work | [ http://yenya.net/ | yenya.net ] - private}> | 
| [ http://www.fi.muni.cz/~kas/ | http://www.fi.muni.cz/~kas/ ] GPG: 4096R/A45477D5 | 
This is the world we live in: the way to deal with computers is to google 
the symptoms, and hope that you don't have to watch a video. --P. Zaitcev 
_______________________________________________ 
ceph-users mailing list 
[ mailto:ceph-users@xxxxxxxxxxxxxx | ceph-users@xxxxxxxxxxxxxx ] 
[ http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com | http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ] 

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com