Hi,
since we upgraded our cluster to Nautilus we also see those messages
sometimes when it's rebalancing. There are several reports about this
[1] [2], we didn't see it in Luminous. But eventually the rebalancing
finished and the error message cleared, so I'd say there's (probably)
nothing to worry about if there aren't any other issues.
Regards,
Eugen
[1] https://tracker.ceph.com/issues/39555
[2] https://tracker.ceph.com/issues/41255
Zitat von Simone Lazzaris <simone.lazzaris@xxxxxxx>:
Hi all;
Long story short, I have a cluster of 26 OSD in 3 nodes (8+9+9). One
of the disk is showing
some read error, so I''ve added an OSD in the faulty node (OSD.26)
and set the (re)weight of
the faulty OSD (OSD.12) to zero.
The cluster is now rebalancing, which is fine, but I have now 2 PG
in "backfill_toofull" state, so
the cluster health is "ERR":
cluster:
id: 9ec27b0f-acfd-40a3-b35d-db301ac5ce8c
health: HEALTH_ERR
Degraded data redundancy (low space): 2 pgs backfill_toofull
services:
mon: 3 daemons, quorum s1,s2,s3 (age 7d)
mgr: s1(active, since 7d), standbys: s2, s3
osd: 27 osds: 27 up (since 2h), 26 in (since 2h); 262 remapped pgs
rgw: 3 daemons active (s1, s2, s3)
data:
pools: 10 pools, 1200 pgs
objects: 11.72M objects, 37 TiB
usage: 57 TiB used, 42 TiB / 98 TiB avail
pgs: 2618510/35167194 objects misplaced (7.446%)
938 active+clean
216 active+remapped+backfill_wait
44 active+remapped+backfilling
2 active+remapped+backfill_wait+backfill_toofull
io:
recovery: 163 MiB/s, 50 objects/s
progress:
Rebalancing after osd.12 marked out
[=====.........................]
As you can see, there is plenty of space and none of my OSD is in
full or near full state:
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data |
state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | s1 | 2415G | 1310G | 0 | 0 | 0 | 0 |
exists,up |
| 1 | s2 | 2009G | 1716G | 0 | 0 | 0 | 0 |
exists,up |
| 2 | s3 | 2183G | 1542G | 0 | 0 | 0 | 0 |
exists,up |
| 3 | s1 | 2680G | 1045G | 0 | 0 | 0 | 0 |
exists,up |
| 4 | s2 | 2063G | 1662G | 0 | 0 | 0 | 0 |
exists,up |
| 5 | s3 | 2269G | 1456G | 0 | 0 | 0 | 0 |
exists,up |
| 6 | s1 | 2523G | 1202G | 0 | 0 | 0 | 0 |
exists,up |
| 7 | s2 | 1973G | 1752G | 0 | 0 | 0 | 0 |
exists,up |
| 8 | s3 | 2007G | 1718G | 0 | 0 | 1 | 0 |
exists,up |
| 9 | s1 | 2485G | 1240G | 0 | 0 | 0 | 0 |
exists,up |
| 10 | s2 | 2385G | 1340G | 0 | 0 | 0 | 0 |
exists,up |
| 11 | s3 | 2079G | 1646G | 0 | 0 | 0 | 0 |
exists,up |
| 12 | s1 | 2272G | 1453G | 0 | 0 | 0 | 0 |
exists,up |
| 13 | s2 | 2381G | 1344G | 0 | 0 | 0 | 0 |
exists,up |
| 14 | s3 | 1923G | 1802G | 0 | 0 | 0 | 0 |
exists,up |
| 15 | s1 | 2617G | 1108G | 0 | 0 | 0 | 0 |
exists,up |
| 16 | s2 | 2099G | 1626G | 0 | 0 | 0 | 0 |
exists,up |
| 17 | s3 | 2336G | 1389G | 0 | 0 | 0 | 0 |
exists,up |
| 18 | s1 | 2435G | 1290G | 0 | 0 | 0 | 0 |
exists,up |
| 19 | s2 | 2198G | 1527G | 0 | 0 | 0 | 0 |
exists,up |
| 20 | s3 | 2159G | 1566G | 0 | 0 | 0 | 0 |
exists,up |
| 21 | s1 | 2128G | 1597G | 0 | 0 | 0 | 0 |
exists,up |
| 22 | s3 | 2064G | 1661G | 0 | 0 | 0 | 0 |
exists,up |
| 23 | s2 | 1943G | 1782G | 0 | 0 | 0 | 0 |
exists,up |
| 24 | s3 | 2168G | 1557G | 0 | 0 | 0 | 0 |
exists,up |
| 25 | s2 | 2113G | 1612G | 0 | 0 | 0 | 0 |
exists,up |
| 26 | s1 | 68.9G | 3657G | 0 | 0 | 0 | 0 |
exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
root@s1:~# ceph pg dump|egrep 'toofull|PG_STAT'
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE
STATE_STAMP
VERSION REPORTED UP UP_PRIMARY ACTING
ACTING_PRIMARY LAST_SCRUB
SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP
SNAPTRIMQ_LEN
6.212 11110 0 0 22220 0
38145321727 0 0 3023 3023
active+remapped+backfill_wait+backfill_toofull 2019-12-09
11:11:39.093042 13598'212053
13713:1179718 [6,19,24] 6 [13,0,24] 13
13549'211985 2019-12-08 19:46:10.461113
11644'211779 2019-12-06 07:37:42.864325 0
6.bc 11057 0 0 22114 0
37733931136 0 0 3032 3032
active+remapped+backfill_wait+backfill_toofull 2019-12-09
10:42:25.534277 13549'212110
13713:1229839 [15,25,17] 15 [19,18,17] 19
13549'211983 2019-12-08 11:02:45.846031
11644'211854 2019-12-06 06:22:43.565313 0
Any hints? I'm not worried because I think that the cluster will
heal himself, but this is not
clear and logic.
--
*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazzaris@xxxxxxx[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook*[4]
--------
[1] mailto:simone.lazzaris@xxxxxxx
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com