Re: two osd stack on peereng after start osd to recovery

Andrey Korolyov <andrey@xxxxxxx> · Sun, 30 Jun 2013 23:27:08 +0400

That`s not a loop as it looks, sorry  - I had reproduced issue many
times and there is no such cpu-eating behavior in most cases, only
locked pgs are presented. Also I may celebrate returning of 'wrong
down mark' bug, at least for the 0.61.4 tag. For first one, I`ll send
a link with core as quick as I will be able to reproduce it on my test
env, and second one linked with 100% disk utilization, so I`m not sure
if this is right behavior or wrong.

On Sat, Jun 29, 2013 at 1:28 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
> On Sat, 29 Jun 2013, Andrey Korolyov wrote:
>> There is almost same problem with the 0.61 cluster, at least with same
>> symptoms. Could be reproduced quite easily - remove an osd and then
>> mark it as out and with quite high probability one of neighbors will
>> be stuck at the end of peering process with couple of peering pgs with
>> primary copy on it. Such osd process seems to be stuck in some kind of
>> lock, eating exactly 100% of one core.
>
> Which version?
> Can you attach with gdb and get a backtrace to see what it is chewing on?
>
> Thanks!
> sage
>
>
>>
>> On Thu, Jun 13, 2013 at 8:42 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>> > On Thu, Jun 13, 2013 at 6:33 AM, S?awomir Skowron <szibis@xxxxxxxxx> wrote:
>> >> Hi, sorry for late response.
>> >>
>> >> https://docs.google.com/file/d/0B9xDdJXMieKEdHFRYnBfT3lCYm8/view
>> >>
>> >> Logs in attachment, and on google drive, from today.
>> >>
>> >> https://docs.google.com/file/d/0B9xDdJXMieKEQzVNVHJ1RXFXZlU/view
>> >>
>> >> We have such problem today. And new logs are on google drive with today date.
>> >>
>> >> Strange is that problematic osd.71 have about 10-15%, more space used
>> >> then other osd in cluster.
>> >>
>> >> Today in one hour osd.71 fails 3 times in mon log, and after third
>> >> recovery has been stuck, and many 500 errors appears in http layer on
>> >> top of rgw. When it's stuck, restarting osd71, osd.23, and osd.108,
>> >> all from stucked pg, helps, but i run even repair on this osd, just in
>> >> case.
>> >>
>> >> I have some theory, that on this pg is rgw index of objects, or one of
>> >> osd in this pg, have some problems with local filesystem or drive
>> >> bellow (raid controller reports nothing about that), but i do not see
>> >> any problem in system.
>> >>
>> >> How can we find in which pg/osd index of objects in rgw bucket exist ??
>> >
>> > You can find the location of any named object by grabbing the OSD map
>> > from the cluster and using the osdmaptool: "osdmaptool <mapfile>
>> > --test-map-object <objname> --pool <poolid>".
>> >
>> > You're not providing any context for your issue though, so we really
>> > can't help. What symptoms are you observing?
>> > -Greg
>> > Software Engineer #42 @ http://inktank.com | http://ceph.com
>> > _______________________________________________
>> > ceph-users mailing list
>> > ceph-users@xxxxxxxxxxxxxx
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com