Re: Adding OSD sometimes suspends cluster

Sam Lang <slang@xxxxxxxxxxx> · Thu, 4 Apr 2013 08:01:02 -0500

Hi Erdem,

This is likely a bug.  We've created a ticket to keep track:
http://tracker.ceph.com/issues/4645.

-slang [inktank dev | http://www.inktank.com | http://www.ceph.com]

On Mon, Apr 1, 2013 at 3:18 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote:
> In addition, i was able to extract some logs from the last time
> active/peering problem happened.
> http://pastebin.com/BakFREFP
> It ends with me restarting it.
>
>
> On Mon, Apr 1, 2013 at 10:23 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx>
> wrote:
>>
>> Hi all,
>>
>> We are currently in process of enlarging our bobtail cluster size by
>> adding OSDs. We have 12 disks per machine and we are creating one OSD per
>> disk, adding them one by one as recommended. Only thing we don't do is
>> starting with a small weight and increasing it slowly. Weights are all 1.
>>
>> In this scenario both rbd and radosgw are unable to respond only in the
>> first two minutes of adding a new OSD. After that small hiccup, we have some
>> pgs like active+remapped+wait_backfill, active+remapped+backfilling,
>> active+recovery_wait+remapped, active+degraded+remapped+backfilling and
>> everything works OK. After a few hours of backfilling and recovery all pgs
>> come active+clean and we add another OSD.
>>
>> But sometimes, that small hiccup takes longer than a few minutes. In that
>> times status shows some pgs are stuck in active and some are stuck in
>> peering. When we look at the pg dump we see all those active or peering pgs
>> are on the same 2 OSDs and are unable to move forward. At this stage rbd
>> works poorly and radosgw is completely stalled. Only after restarting one of
>> those 2 OSDs, pg's start to backfill and clients continue with their
>> operations.
>>
>> Since this is a live cluster we don't want to wait too long and usually go
>> restart the OSD in a hurry. That's why i cannot currently provide status or
>> pg query outputs. We have some logs but i don't know what to look for or if
>> they are verbose enough.
>>
>> Can this be any kind of a known issue? If not, where should i look to get
>> any ideas about what's happening when it occurs?
>>
>> Thanks in advance
>>
>> --
>> erdem agaoglu
>
>
>
>
> --
> erdem agaoglu
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com