Stuck pages and other bad things

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm presuming this is the correct list (rather than the -devel list) please correct me if I'm wrong there.

I setup ceph (0.56.4) a few months ago with two disk servers and one dedicated monitor. The disk servers also have monitors, so there are a total of 3 monitors for the cluster. Each of the disk servers has 8 OSDs.

I didn't actually get a 'ceph osd tree' output from that, but cutting-and-pasting relevant parts from what I have now, it probably looked like this:

# id weight type name up/down reweight
-1 16 root default
-3 16 rack unknownrack
-2 0 host leviathan
100 1 osd.100 up 1
101 1 osd.101 up 1
102 1 osd.102 up 1
103 1 osd.103 up 1
104 1 osd.104 up 1
105 1 osd.105 up 1
106 1 osd.106 up 1
107 1 osd.107 up 1
-4 8 host minotaur
200 1 osd.200 up 1
201 1 osd.201 up 1
202 1 osd.202 up 1
203 1 osd.203 up 1
204 1 osd.204 up 1
205 1 osd.205 up 1
206 1 osd.206 up 1
207 1 osd.207 up 1

A couple of weeks ago, for valid reasons that aren't relevant here, we decided to repurpose one of the disk servers (leviathan) and replace the ceph fileserver with some other hardware. I created a new server (aergia). That changed the 'ceph osd tree' to this:

# id weight type name up/down reweight
-1 16 root default
-3 16 rack unknownrack
-2 0 host leviathan
100 1 osd.100 up 1
101 1 osd.101 up 1
102 1 osd.102 up 1
103 1 osd.103 up 1
104 1 osd.104 up 1
105 1 osd.105 up 1
106 1 osd.106 up 1
107 1 osd.107 up 1
-4 8 host minotaur
200 1 osd.200 up 1
201 1 osd.201 up 1
202 1 osd.202 up 1
203 1 osd.203 up 1
204 1 osd.204 up 1
205 1 osd.205 up 1
206 1 osd.206 up 1
207 1 osd.207 up 1
0 1 osd.0 up 1
1 1 osd.1 up 1
2 1 osd.2 up 1
3 1 osd.3 up 1
4 1 osd.4 up 1
5 1 osd.5 up 1
6 1 osd.6 up 1
7 1 osd.7 up 1

Everything was looking happy, so I began removing the OSDs on leviathan. That's when the problems stared. 'ceph health detail' shows that there are several pages that either existed only on that disk server, e.g.
pg 0.312 is stuck unclean since forever, current state stale+active+degraded+remapped, last acting [103]
or pages that were only replicated back onto the same host, e.g.
pg 0.2f4 is stuck unclean since forever, current state stale+active+remapped, last acting [106,101]

I brought leviathan back up, and I *think* everything is at least responding now. But 'ceph health' still shows
HEALTH_WARN 302 pgs degraded; 810 pgs stale; 810 pgs stuck stale; 3562 pgs stuck unclean; recovery 44951/2289634 degraded (1.963%)
...and it's been stuck there for a long time.

So my question is, how do I force data off the to-be-decommissioned server safely and get back to "HEALTH_OK"?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux