Re: OSD recovery stuck

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 27 Jun 2013 08:43:05 -0700

On Thursday, June 27, 2013, Greg Chavez  wrote:
We set up a small ceph cluster of three nodes on top of an OpenStack

deployment of three nodes (that is, each compute node was also an

OSD/MON node).  Worked great until we started to expand the ceph

cluster once the OSDs started to fill up.  I added 4 OSDs two days ago

and the recovery went smoothly.  I added another four last night, but

the recovery is stuck:

root@kvm-sn-14i:~# ceph -s

   health HEALTH_WARN 22 pgs backfill_toofull; 19 pgs degraded; 1 pgs

recovering; 23 pgs stuck unclean; recovery 157614/1775814 degraded

(8.876%);  recovering 2 o/s, 8864KB/s; 1 near full osd(s)

   monmap e1: 3 mons at

{kvm-cs-sn-10i=192.168.241.110:6789/0,kvm-cs-sn-14i=192.168.241.114:6789/0,kvm-cs-sn-15i=192.168.241.115:6789/0},

election epoch 42, quorum 0,1,2

kvm-cs-sn-10i,kvm-cs-sn-14i,kvm-cs-sn-15i

   osdmap e512: 30 osds: 27 up, 27 in

    pgmap v1474651: 448 pgs: 425 active+clean, 1

active+recovering+remapped, 3 active+remapped+backfill_toofull, 11

active+degraded+backfill_toofull, 8

active+degraded+remapped+backfill_toofull; 3414 GB data, 6640 GB used,

7007 GB / 13647 GB avail; 0B/s rd, 2363B/s wr, 0op/s; 157614/1775814

degraded (8.876%);  recovering 2 o/s, 8864KB/s

   mdsmap e1: 0/0/1 up

Even after restarting the OSDs, it hangs at 8.876%.  Consequently,

many of our virts have crashed.

I'm hoping someone on this list can provide some suggestions.

Otherwise, I may have to blow this up.  Thanks!

"Backfill_toofull"
Right now your OSDs are trying to move data around, but one or more of your OSDs are getting full so it's paused the data transfer.
Now, given that all the PGs are active, the clients shouldn't really be noticing, but you might have hit an edge case we didn't account for. Do you have any logging enabled?

Anyway, at a guess your OSDs don't all have weights proportional to thei sizes. Check their disks and the output of "ceph osd tree" to make sure they match, and that the tree is set up properly compared to your crush map.
-Greg

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com