Stuck/confused ceph cluster after physical migration of servers.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello everyone,

So: we have a mimic cluster (on the most recent mimic release), 3 mons, 8 data nodes (160 OSDs in total). 

Recently, we had to physically migrate the cluster to a different location, and had to do this in one go (partly because the new location does not currently have direct network routes to the old one, so doing this server by server would not have been possible). 
The setup at the other side preserved the ip addresses and hostnames of all of the servers.

We followed the instructions here: 
https://ceph.io/planet/how-to-do-a-ceph-cluster-maintenance-shutdown/
to bring the system into a stable state for migration.

When we brought the system up again (again, following the above instructions), it seems to be in a weird state:

ceph health detail gives:


HEALTH_ERR 8862594/10690030 objects misplaced (82.905%); Degraded data redundancy: 571553/10690030 objects degraded (5.347%), 518 pgs degraded, 66 pgs undersized; Degraded data redundancy (low space): 30 pgs backfill_toofull; application not enabled on 3 pool(s)
OBJECT_MISPLACED 8862594/10690030 objects misplaced (82.905%)
PG_DEGRADED Degraded data redundancy: 571553/10690030 objects degraded (5.347%), 518 pgs degraded, 66 pgs undersized
    pg 11.70e is active+recovery_wait+degraded, acting [143,27,50,87,45,84,98,88,140,144]
    pg 11.711 is active+recovery_wait+degraded, acting [124,152,71,146,116,158,118,138,84,137]
    pg 11.712 is active+recovery_wait+degraded, acting [37,115,1,70,47,148,116,12,23,51]

    (snip a lot more pg 11.xxx entries which are in this state)

PG_DEGRADED_FULL Degraded data redundancy (low space): 30 pgs backfill_toofull
    pg 12.4e is active+remapped+backfill_wait+backfill_toofull, acting [103,49,81,111,86,33,7,109,65,60]
    pg 12.6b is active+remapped+backfill_wait+backfill_toofull, acting [130,101,5,45,40,9,93,119,128,145]
    pg 12.6f is active+remapped+backfill_wait+backfill_toofull, acting [99,69,18,86,28,3,100,159,127,80]
    pg 12.88 is active+remapped+backfill_wait+backfill_toofull, acting [102,20,37,150,12,135,149,18,159,10]
    pg 12.8a is active+remapped+backfill_wait+backfill_toofull, acting [144,39,157,145,4,153,129,100,150,131]
    
    (snip a lot more pg 12.xxx entries in this state)


Confusingly, the cluster is perfectly happy about the osds on the surface:

ceph osd status
+-----+-----------------------+-------+-------+--------+---------+--------+---------+-----------+
|  id |          host         |  used | avail | wr ops | wr data | rd ops | rd data |   state   |
+-----+-----------------------+-------+-------+--------+---------+--------+---------+-----------+
|  0  | localhost.localdomain |  518G | 10.1T |    0   |     0   |    0   |     0   | exists,up |
|  1  | localhost.localdomain |  519G | 10.1T |    0   |     0   |    0   |     0   | exists,up |
|  2  | localhost.localdomain |  513G | 10.1T |    0   |     0   |    0   |     0   | exists,up |
|  3  | localhost.localdomain |  520G | 10.1T |    0   |     0   |    0   |     0   | exists,up |
|  4  | localhost.localdomain |  517G | 10.1T |    0   |     0   |    0   |     0   | exists,up |
|  5  | localhost.localdomain |  517G | 10.1T |    0   |     0   |    0   |     0   | exists,up |
|  6  | localhost.localdomain |  515G | 10.1T |    0   |     0   |    0   |     0   | exists,up |
|  7  | localhost.localdomain |  515G | 10.1T |    0   |     0   |    0   |     0   | exists,up |
|  8  | localhost.localdomain |  517G | 10.1T |    0   |     0   |    0   |     0   | exists,up |
|  9  | localhost.localdomain |  515G | 10.1T |    0   |     0   |    0   |     0   | exists,up |
|  10 | localhost.localdomain |  518G | 10.1T |    0   |     0   |    0   |     0   | exists,up |

and all of the osd are between 513G and 526G in fullness, so barely full, marked as "exists,up" -none of them are declaring any issues.

So: what has happened to the cluster, and how do I fix it? (How can pgs think their backfull is too full, when all the OSDs are > 90% empty?)

Any help understanding this would be appreciated.

Sam


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux