Hi Erdem, This is likely a bug. We've created a ticket to keep track: http://tracker.ceph.com/issues/4645. -slang [inktank dev | http://www.inktank.com | http://www.ceph.com] On Mon, Apr 1, 2013 at 3:18 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote: > In addition, i was able to extract some logs from the last time > active/peering problem happened. > http://pastebin.com/BakFREFP > It ends with me restarting it. > > > On Mon, Apr 1, 2013 at 10:23 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> > wrote: >> >> Hi all, >> >> We are currently in process of enlarging our bobtail cluster size by >> adding OSDs. We have 12 disks per machine and we are creating one OSD per >> disk, adding them one by one as recommended. Only thing we don't do is >> starting with a small weight and increasing it slowly. Weights are all 1. >> >> In this scenario both rbd and radosgw are unable to respond only in the >> first two minutes of adding a new OSD. After that small hiccup, we have some >> pgs like active+remapped+wait_backfill, active+remapped+backfilling, >> active+recovery_wait+remapped, active+degraded+remapped+backfilling and >> everything works OK. After a few hours of backfilling and recovery all pgs >> come active+clean and we add another OSD. >> >> But sometimes, that small hiccup takes longer than a few minutes. In that >> times status shows some pgs are stuck in active and some are stuck in >> peering. When we look at the pg dump we see all those active or peering pgs >> are on the same 2 OSDs and are unable to move forward. At this stage rbd >> works poorly and radosgw is completely stalled. Only after restarting one of >> those 2 OSDs, pg's start to backfill and clients continue with their >> operations. >> >> Since this is a live cluster we don't want to wait too long and usually go >> restart the OSD in a hurry. That's why i cannot currently provide status or >> pg query outputs. We have some logs but i don't know what to look for or if >> they are verbose enough. >> >> Can this be any kind of a known issue? If not, where should i look to get >> any ideas about what's happening when it occurs? >> >> Thanks in advance >> >> -- >> erdem agaoglu > > > > > -- > erdem agaoglu > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com