Adding OSD sometimes suspends cluster

Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> · Mon, 1 Apr 2013 10:23:40 +0300

Hi all,
We are currently in process of enlarging our bobtail cluster size by adding OSDs. We have 12 disks per machine and we are creating one OSD per disk, adding them one by one as recommended. Only thing we don't do is starting with a small weight and increasing it slowly. Weights are all 1.

In this scenario both rbd and radosgw are unable to respond only in the first two minutes of adding a new OSD. After that small hiccup, we have some pgs like active+remapped+wait_backfill, active+remapped+backfilling, active+recovery_wait+remapped, active+degraded+remapped+backfilling and everything works OK. After a few hours of backfilling and recovery all pgs come active+clean and we add another OSD.

But sometimes, that small hiccup takes longer than a few minutes. In that times status shows some pgs are stuck in active and some are stuck in peering. When we look at the pg dump we see all those active or peering pgs are on the same 2 OSDs and are unable to move forward. At this stage rbd works poorly and radosgw is completely stalled. Only after restarting one of those 2 OSDs, pg's start to backfill and clients continue with their operations.

Since this is a live cluster we don't want to wait too long and usually go restart the OSD in a hurry. That's why i cannot currently provide status or pg query outputs. We have some logs but i don't know what to look for or if they are verbose enough.

Can this be any kind of a known issue? If not, where should i look to get any ideas about what's happening when it occurs?

Thanks in advance

-- 
erdem agaoglu

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com