Re: OSD recovery stuck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Thursday, June 27, 2013, Greg Chavez wrote:
We set up a small ceph cluster of three nodes on top of an OpenStack
deployment of three nodes (that is, each compute node was also an
OSD/MON node).  Worked great until we started to expand the ceph
cluster once the OSDs started to fill up.  I added 4 OSDs two days ago
and the recovery went smoothly.  I added another four last night, but
the recovery is stuck:

root@kvm-sn-14i:~# ceph -s
   health HEALTH_WARN 22 pgs backfill_toofull; 19 pgs degraded; 1 pgs
recovering; 23 pgs stuck unclean; recovery 157614/1775814 degraded
(8.876%);  recovering 2 o/s, 8864KB/s; 1 near full osd(s)
   monmap e1: 3 mons at
{kvm-cs-sn-10i=192.168.241.110:6789/0,kvm-cs-sn-14i=192.168.241.114:6789/0,kvm-cs-sn-15i=192.168.241.115:6789/0},
election epoch 42, quorum 0,1,2
kvm-cs-sn-10i,kvm-cs-sn-14i,kvm-cs-sn-15i
   osdmap e512: 30 osds: 27 up, 27 in
    pgmap v1474651: 448 pgs: 425 active+clean, 1
active+recovering+remapped, 3 active+remapped+backfill_toofull, 11
active+degraded+backfill_toofull, 8
active+degraded+remapped+backfill_toofull; 3414 GB data, 6640 GB used,
7007 GB / 13647 GB avail; 0B/s rd, 2363B/s wr, 0op/s; 157614/1775814
degraded (8.876%);  recovering 2 o/s, 8864KB/s
   mdsmap e1: 0/0/1 up

Even after restarting the OSDs, it hangs at 8.876%.  Consequently,
many of our virts have crashed.

I'm hoping someone on this list can provide some suggestions.
Otherwise, I may have to blow this up.  Thanks!

"Backfill_toofull"
Right now your OSDs are trying to move data around, but one or more of your OSDs are getting full so it's paused the data transfer.
Now, given that all the PGs are active, the clients shouldn't really be noticing, but you might have hit an edge case we didn't account for. Do you have any logging enabled?

Anyway, at a guess your OSDs don't all have weights proportional to thei sizes. Check their disks and the output of "ceph osd tree" to make sure they match, and that the tree is set up properly compared to your crush map.
-Greg


--
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux