Re: Cluster in ERR status when rebalancing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

since we upgraded our cluster to Nautilus we also see those messages sometimes when it's rebalancing. There are several reports about this [1] [2], we didn't see it in Luminous. But eventually the rebalancing finished and the error message cleared, so I'd say there's (probably) nothing to worry about if there aren't any other issues.

Regards,
Eugen


[1] https://tracker.ceph.com/issues/39555
[2] https://tracker.ceph.com/issues/41255


Zitat von Simone Lazzaris <simone.lazzaris@xxxxxxx>:

Hi all;
Long story short, I have a cluster of 26 OSD in 3 nodes (8+9+9). One of the disk is showing some read error, so I''ve added an OSD in the faulty node (OSD.26) and set the (re)weight of
the faulty OSD (OSD.12) to zero.

The cluster is now rebalancing, which is fine, but I have now 2 PG in "backfill_toofull" state, so
the cluster health is "ERR":

  cluster:
    id:     9ec27b0f-acfd-40a3-b35d-db301ac5ce8c
    health: HEALTH_ERR
            Degraded data redundancy (low space): 2 pgs backfill_toofull

  services:
    mon: 3 daemons, quorum s1,s2,s3 (age 7d)
    mgr: s1(active, since 7d), standbys: s2, s3
    osd: 27 osds: 27 up (since 2h), 26 in (since 2h); 262 remapped pgs
    rgw: 3 daemons active (s1, s2, s3)

  data:
    pools:   10 pools, 1200 pgs
    objects: 11.72M objects, 37 TiB
    usage:   57 TiB used, 42 TiB / 98 TiB avail
    pgs:     2618510/35167194 objects misplaced (7.446%)
             938 active+clean
             216 active+remapped+backfill_wait
             44  active+remapped+backfilling
             2   active+remapped+backfill_wait+backfill_toofull

  io:
    recovery: 163 MiB/s, 50 objects/s

  progress:
    Rebalancing after osd.12 marked out
      [=====.........................]

As you can see, there is plenty of space and none of my OSD is in full or near full state:

+----+------+-------+-------+--------+---------+--------+---------+-----------+
| id | host | used | avail | wr ops | wr data | rd ops | rd data | state |
+----+------+-------+-------+--------+---------+--------+---------+-----------+
| 0 | s1 | 2415G | 1310G | 0 | 0 | 0 | 0 | exists,up | | 1 | s2 | 2009G | 1716G | 0 | 0 | 0 | 0 | exists,up | | 2 | s3 | 2183G | 1542G | 0 | 0 | 0 | 0 | exists,up | | 3 | s1 | 2680G | 1045G | 0 | 0 | 0 | 0 | exists,up | | 4 | s2 | 2063G | 1662G | 0 | 0 | 0 | 0 | exists,up | | 5 | s3 | 2269G | 1456G | 0 | 0 | 0 | 0 | exists,up | | 6 | s1 | 2523G | 1202G | 0 | 0 | 0 | 0 | exists,up | | 7 | s2 | 1973G | 1752G | 0 | 0 | 0 | 0 | exists,up | | 8 | s3 | 2007G | 1718G | 0 | 0 | 1 | 0 | exists,up | | 9 | s1 | 2485G | 1240G | 0 | 0 | 0 | 0 | exists,up | | 10 | s2 | 2385G | 1340G | 0 | 0 | 0 | 0 | exists,up | | 11 | s3 | 2079G | 1646G | 0 | 0 | 0 | 0 | exists,up | | 12 | s1 | 2272G | 1453G | 0 | 0 | 0 | 0 | exists,up | | 13 | s2 | 2381G | 1344G | 0 | 0 | 0 | 0 | exists,up | | 14 | s3 | 1923G | 1802G | 0 | 0 | 0 | 0 | exists,up | | 15 | s1 | 2617G | 1108G | 0 | 0 | 0 | 0 | exists,up | | 16 | s2 | 2099G | 1626G | 0 | 0 | 0 | 0 | exists,up | | 17 | s3 | 2336G | 1389G | 0 | 0 | 0 | 0 | exists,up | | 18 | s1 | 2435G | 1290G | 0 | 0 | 0 | 0 | exists,up | | 19 | s2 | 2198G | 1527G | 0 | 0 | 0 | 0 | exists,up | | 20 | s3 | 2159G | 1566G | 0 | 0 | 0 | 0 | exists,up | | 21 | s1 | 2128G | 1597G | 0 | 0 | 0 | 0 | exists,up | | 22 | s3 | 2064G | 1661G | 0 | 0 | 0 | 0 | exists,up | | 23 | s2 | 1943G | 1782G | 0 | 0 | 0 | 0 | exists,up | | 24 | s3 | 2168G | 1557G | 0 | 0 | 0 | 0 | exists,up | | 25 | s2 | 2113G | 1612G | 0 | 0 | 0 | 0 | exists,up | | 26 | s1 | 68.9G | 3657G | 0 | 0 | 0 | 0 | exists,up |
+----+------+-------+-------+--------+---------+--------+---------+-----------+



root@s1:~# ceph pg dump|egrep 'toofull|PG_STAT'
PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES
OMAP_BYTES* OMAP_KEYS* LOG DISK_LOG STATE STATE_STAMP VERSION REPORTED UP UP_PRIMARY ACTING ACTING_PRIMARY LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB DEEP_SCRUB_STAMP SNAPTRIMQ_LEN 6.212 11110 0 0 22220 0 38145321727 0 0 3023 3023 active+remapped+backfill_wait+backfill_toofull 2019-12-09 11:11:39.093042 13598'212053 13713:1179718 [6,19,24] 6 [13,0,24] 13 13549'211985 2019-12-08 19:46:10.461113
11644'211779 2019-12-06 07:37:42.864325             0
6.bc 11057 0 0 22114 0 37733931136 0 0 3032 3032 active+remapped+backfill_wait+backfill_toofull 2019-12-09 10:42:25.534277 13549'212110 13713:1229839 [15,25,17] 15 [19,18,17] 19 13549'211983 2019-12-08 11:02:45.846031
11644'211854 2019-12-06 06:22:43.565313             0

Any hints? I'm not worried because I think that the cluster will heal himself, but this is not
clear and logic.

--
*Simone Lazzaris*
*Qcom S.p.A.*
simone.lazzaris@xxxxxxx[1] | www.qcom.it[2]
* LinkedIn[3]* | *Facebook*[4]



--------
[1] mailto:simone.lazzaris@xxxxxxx
[2] https://www.qcom.it
[3] https://www.linkedin.com/company/qcom-spa
[4] http://www.facebook.com/qcomspa



_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux