Re: two osd stack on peereng after start osd to recovery

Sage Weil <sage@xxxxxxxxxxx> · Fri, 28 Jun 2013 15:16:54 -0700 (PDT)

> Ver. 0.56.6
> Hmm, osd not died, 1 or more pg stack on peereng on it.

Can you get a pgid from 'ceph health detail' and then do 'ceph pg <pgid>  
query' and attach that output?

Thanks!
sage

> 
> Regards
> Dominik
> 
> On Jun 28, 2013 11:28 PM, "Sage Weil" <sage@xxxxxxxxxxx> wrote:
>       On Sat, 29 Jun 2013, Andrey Korolyov wrote:
>       > There is almost same problem with the 0.61 cluster, at least
>       with same
>       > symptoms. Could be reproduced quite easily - remove an osd and
>       then
>       > mark it as out and with quite high probability one of
>       neighbors will
>       > be stuck at the end of peering process with couple of peering
>       pgs with
>       > primary copy on it. Such osd process seems to be stuck in some
>       kind of
>       > lock, eating exactly 100% of one core.
> 
>       Which version?
>       Can you attach with gdb and get a backtrace to see what it is
>       chewing on?
> 
>       Thanks!
>       sage
> 
> 
>       >
>       > On Thu, Jun 13, 2013 at 8:42 PM, Gregory Farnum
>       <greg@xxxxxxxxxxx> wrote:
>       > > On Thu, Jun 13, 2013 at 6:33 AM, S?awomir Skowron
>       <szibis@xxxxxxxxx> wrote:
>       > >> Hi, sorry for late response.
>       > >>
>       > >>
>       https://docs.google.com/file/d/0B9xDdJXMieKEdHFRYnBfT3lCYm8/view
>       > >>
>       > >> Logs in attachment, and on google drive, from today.
>       > >>
>       > >>
>       https://docs.google.com/file/d/0B9xDdJXMieKEQzVNVHJ1RXFXZlU/view
>       > >>
>       > >> We have such problem today. And new logs are on google
>       drive with today date.
>       > >>
>       > >> Strange is that problematic osd.71 have about 10-15%, more
>       space used
>       > >> then other osd in cluster.
>       > >>
>       > >> Today in one hour osd.71 fails 3 times in mon log, and
>       after third
>       > >> recovery has been stuck, and many 500 errors appears in
>       http layer on
>       > >> top of rgw. When it's stuck, restarting osd71, osd.23, and
>       osd.108,
>       > >> all from stucked pg, helps, but i run even repair on this
>       osd, just in
>       > >> case.
>       > >>
>       > >> I have some theory, that on this pg is rgw index of
>       objects, or one of
>       > >> osd in this pg, have some problems with local filesystem or
>       drive
>       > >> bellow (raid controller reports nothing about that), but i
>       do not see
>       > >> any problem in system.
>       > >>
>       > >> How can we find in which pg/osd index of objects in rgw
>       bucket exist ??
>       > >
>       > > You can find the location of any named object by grabbing
>       the OSD map
>       > > from the cluster and using the osdmaptool: "osdmaptool
>       <mapfile>
>       > > --test-map-object <objname> --pool <poolid>".
>       > >
>       > > You're not providing any context for your issue though, so
>       we really
>       > > can't help. What symptoms are you observing?
>       > > -Greg
>       > > Software Engineer #42 @ http://inktank.com | http://ceph.com
>       > > _______________________________________________
>       > > ceph-users mailing list
>       > > ceph-users@xxxxxxxxxxxxxx
>       > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>       > _______________________________________________
>       > ceph-users mailing list
>       > ceph-users@xxxxxxxxxxxxxx
>       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>       >
>       _______________________________________________
>       ceph-users mailing list
>       ceph-users@xxxxxxxxxxxxxx
>       http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com