Re: two osd stack on peereng after start osd to recovery

Dominik Mostowiec <dominikmostowiec@xxxxxxxxx> · Fri, 28 Jun 2013 23:24:39 +0200

Today I have peereng problem not when I put osd.71 out, but in normal CEPH work.

Regards
Dominik

2013/6/28 Andrey Korolyov <andrey@xxxxxxx>:
> There is almost same problem with the 0.61 cluster, at least with same
> symptoms. Could be reproduced quite easily - remove an osd and then
> mark it as out and with quite high probability one of neighbors will
> be stuck at the end of peering process with couple of peering pgs with
> primary copy on it. Such osd process seems to be stuck in some kind of
> lock, eating exactly 100% of one core.
>
> On Thu, Jun 13, 2013 at 8:42 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>> On Thu, Jun 13, 2013 at 6:33 AM, Sławomir Skowron <szibis@xxxxxxxxx> wrote:
>>> Hi, sorry for late response.
>>>
>>> https://docs.google.com/file/d/0B9xDdJXMieKEdHFRYnBfT3lCYm8/view
>>>
>>> Logs in attachment, and on google drive, from today.
>>>
>>> https://docs.google.com/file/d/0B9xDdJXMieKEQzVNVHJ1RXFXZlU/view
>>>
>>> We have such problem today. And new logs are on google drive with today date.
>>>
>>> Strange is that problematic osd.71 have about 10-15%, more space used
>>> then other osd in cluster.
>>>
>>> Today in one hour osd.71 fails 3 times in mon log, and after third
>>> recovery has been stuck, and many 500 errors appears in http layer on
>>> top of rgw. When it's stuck, restarting osd71, osd.23, and osd.108,
>>> all from stucked pg, helps, but i run even repair on this osd, just in
>>> case.
>>>
>>> I have some theory, that on this pg is rgw index of objects, or one of
>>> osd in this pg, have some problems with local filesystem or drive
>>> bellow (raid controller reports nothing about that), but i do not see
>>> any problem in system.
>>>
>>> How can we find in which pg/osd index of objects in rgw bucket exist ??
>>
>> You can find the location of any named object by grabbing the OSD map
>> from the cluster and using the osdmaptool: "osdmaptool <mapfile>
>> --test-map-object <objname> --pool <poolid>".
>>
>> You're not providing any context for your issue though, so we really
>> can't help. What symptoms are you observing?
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Pozdrawiam
Dominik
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com