Re: Cluster in ERR status when rebalancing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a (harmless) bug that existed since Mimic and will be fixed in 14.2.5 (I think?). The health error will clear up without any intervention.


Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Mon, Dec 9, 2019 at 12:03 PM Eugen Block <eblock@xxxxxx> wrote:
Hi,

since we upgraded our cluster to Nautilus we also see those messages 
sometimes when it's rebalancing. There are several reports about this 
[1] [2], we didn't see it in Luminous. But eventually the rebalancing 
finished and the error message cleared, so I'd say there's (probably) 
nothing to worry about if there aren't any other issues.

Regards,
Eugen


[1] https://tracker.ceph.com/issues/39555
[2] https://tracker.ceph.com/issues/41255


Zitat von Simone Lazzaris <simone.lazzaris@xxxxxxx>:

> Hi all;
> Long story short, I have a cluster of 26 OSD in 3 nodes (8+9+9). One 
> of the disk is showing
> some read error, so I''ve added an OSD in the faulty node (OSD.26) 
> and set the (re)weight of
> the faulty OSD (OSD.12) to zero.
>
> The cluster is now rebalancing, which is fine, but I have now 2 PG 
> in "backfill_toofull" state, so
> the cluster health is "ERR":
>
>   cluster:
>     id:     9ec27b0f-acfd-40a3-b35d-db301ac5ce8c
>     health: HEALTH_ERR
>             Degraded data redundancy (low space): 2 pgs backfill_toofull
>
>   services:
>     mon: 3 daemons, quorum s1,s2,s3 (age 7d)
>     mgr: s1(active, since 7d), standbys: s2, s3
>     osd: 27 osds: 27 up (since 2h), 26 in (since 2h); 262 remapped pgs
>     rgw: 3 daemons active (s1, s2, s3)
>
>   data:
>     pools:   10 pools, 1200 pgs
>     objects: 11.72M objects, 37 TiB
>     usage:   57 TiB used, 42 TiB / 98 TiB avail
>     pgs:     2618510/35167194 objects misplaced (7.446%)
>              938 active+clean
>              216 active+remapped+backfill_wait
>              44  active+remapped+backfilling
>              2   active+remapped+backfill_wait+backfill_toofull
>
>   io:
>     recovery: 163 MiB/s, 50 objects/s
>
>   progress:
>     Rebalancing after osd.12 marked out
>       [=====.........................]
>
> As you can see, there is plenty of space and none of my OSD  is in 
> full or near full state:
>
> +----+------+-------+-------+--------+---------+--------+---------+-----------+
> | id | host |  used | avail | wr ops | wr data | rd ops | rd data |   
>  state   |
> +----+------+-------+-------+--------+---------+--------+---------+-----------+
> | 0  |  s1  | 2415G | 1310G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 1  |  s2  | 2009G | 1716G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 2  |  s3  | 2183G | 1542G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 3  |  s1  | 2680G | 1045G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 4  |  s2  | 2063G | 1662G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 5  |  s3  | 2269G | 1456G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 6  |  s1  | 2523G | 1202G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 7  |  s2  | 1973G | 1752G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 8  |  s3  | 2007G | 1718G |    0   |     0   |    1   |     0   | 
> exists,up |
> | 9  |  s1  | 2485G | 1240G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 10 |  s2  | 2385G | 1340G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 11 |  s3  | 2079G | 1646G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 12 |  s1  | 2272G | 1453G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 13 |  s2  | 2381G | 1344G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 14 |  s3  | 1923G | 1802G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 15 |  s1  | 2617G | 1108G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 16 |  s2  | 2099G | 1626G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 17 |  s3  | 2336G | 1389G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 18 |  s1  | 2435G | 1290G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 19 |  s2  | 2198G | 1527G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 20 |  s3  | 2159G | 1566G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 21 |  s1  | 2128G | 1597G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 22 |  s3  | 2064G | 1661G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 23 |  s2  | 1943G | 1782G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 24 |  s3  | 2168G | 1557G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 25 |  s2  | 2113G | 1612G |    0   |     0   |    0   |     0   | 
> exists,up |
> | 26 |  s1  | 68.9G | 3657G |    0   |     0   |    0   |     0   | 
> exists,up |
> +----+------+-------+-------+--------+---------+--------+---------+-----------+
>
>
>
> root@s1:~# ceph pg dump|egrep 'toofull|PG_STAT'
> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
> OMAP_BYTES* OMAP_KEYS* LOG  DISK_LOG STATE                           
>                STATE_STAMP
> VERSION       REPORTED       UP         UP_PRIMARY ACTING     
> ACTING_PRIMARY LAST_SCRUB
> SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           
>  SNAPTRIMQ_LEN
> 6.212     11110                  0        0     22220       0 
> 38145321727           0          0 3023     3023
> active+remapped+backfill_wait+backfill_toofull 2019-12-09 
> 11:11:39.093042  13598'212053
> 13713:1179718  [6,19,24]          6  [13,0,24]             13   
> 13549'211985 2019-12-08 19:46:10.461113
> 11644'211779 2019-12-06 07:37:42.864325             0
> 6.bc      11057                  0        0     22114       0 
> 37733931136           0          0 3032     3032
> active+remapped+backfill_wait+backfill_toofull 2019-12-09 
> 10:42:25.534277  13549'212110
> 13713:1229839 [15,25,17]         15 [19,18,17]             19   
> 13549'211983 2019-12-08 11:02:45.846031
> 11644'211854 2019-12-06 06:22:43.565313             0
>
> Any hints? I'm not worried because I think that the cluster will 
> heal himself, but this is not
> clear and logic.
>
> --
> *Simone Lazzaris*
> *Qcom S.p.A.*
> simone.lazzaris@xxxxxxx[1] | www.qcom.it[2]
> * LinkedIn[3]* | *Facebook*[4]
>
>
>
> --------
> [1] mailto:simone.lazzaris@xxxxxxx
> [2] https://www.qcom.it
> [3] https://www.linkedin.com/company/qcom-spa
> [4] http://www.facebook.com/qcomspa



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux