Re: Adding OSD sometimes suspends cluster

Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> · Thu, 4 Apr 2013 17:35:59 +0300

Thanks Sam,
I'll provide details if it keeps happening

On Thu, Apr 4, 2013 at 4:01 PM, Sam Lang <slang@xxxxxxxxxxx> wrote:

Hi Erdem,

This is likely a bug.  We've created a ticket to keep track:

http://tracker.ceph.com/issues/4645.

-slang [inktank dev | http://www.inktank.com | http://www.ceph.com]

On Mon, Apr 1, 2013 at 3:18 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote:

> In addition, i was able to extract some logs from the last time

> active/peering problem happened.

> http://pastebin.com/BakFREFP

> It ends with me restarting it.

>

>

> On Mon, Apr 1, 2013 at 10:23 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx>

> wrote:

>>

>> Hi all,

>>

>> We are currently in process of enlarging our bobtail cluster size by

>> adding OSDs. We have 12 disks per machine and we are creating one OSD per

>> disk, adding them one by one as recommended. Only thing we don't do is

>> starting with a small weight and increasing it slowly. Weights are all 1.

>>

>> In this scenario both rbd and radosgw are unable to respond only in the

>> first two minutes of adding a new OSD. After that small hiccup, we have some

>> pgs like active+remapped+wait_backfill, active+remapped+backfilling,

>> active+recovery_wait+remapped, active+degraded+remapped+backfilling and

>> everything works OK. After a few hours of backfilling and recovery all pgs

>> come active+clean and we add another OSD.

>>

>> But sometimes, that small hiccup takes longer than a few minutes. In that

>> times status shows some pgs are stuck in active and some are stuck in

>> peering. When we look at the pg dump we see all those active or peering pgs

>> are on the same 2 OSDs and are unable to move forward. At this stage rbd

>> works poorly and radosgw is completely stalled. Only after restarting one of

>> those 2 OSDs, pg's start to backfill and clients continue with their

>> operations.

>>

>> Since this is a live cluster we don't want to wait too long and usually go

>> restart the OSD in a hurry. That's why i cannot currently provide status or

>> pg query outputs. We have some logs but i don't know what to look for or if

>> they are verbose enough.

>>

>> Can this be any kind of a known issue? If not, where should i look to get

>> any ideas about what's happening when it occurs?

>>

>> Thanks in advance

>>

>> --

>> erdem agaoglu

>

>

>

>

> --

> erdem agaoglu

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

-- 
erdem agaoglu

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com