Re: two osd stack on peereng after start osd to recovery

Dominik Mostowiec <dominikmostowiec@xxxxxxxxx> · Sat, 29 Jun 2013 01:18:30 +0200

I have only 'ceph healht details' from previous crash.

ceph health details
HEALTH_WARN 6 pgs peering; 9 pgs stuck unclean
pg 3.c62 is stuck unclean for 583.220063, current state active, last
acting [57,23,51]
pg 4.269 is stuck unclean for 4842.519837, current state peering, last
acting [23,57,106]
pg 3.26a is stuck unclean for 764.413502, current state peering, last
acting [23,57,106]
pg 3.556 is stuck unclean for 888.097879, current state peering, last
acting [108,57,14]
pg 4.555 is stuck unclean for 4842.518997, current state peering, last
acting [108,57,14]
pg 3.e59 is stuck unclean for 1036.717811, current state active, last
acting [57,8,108]
pg 3.78c is stuck unclean for 508.459454, current state peering, last
acting [23,47,57]
pg 4.54c is stuck unclean for 4842.365307, current state active, last
acting [57,108,23]
pg 3.ef0 is stuck unclean for 827.882363, current state active, last
acting [57,23,117]
pg 3.78c is peering, acting [23,47,57]
pg 3.556 is peering, acting [108,57,14]
pg 4.555 is peering, acting [108,57,14]
pg 3.54d is peering, acting [57,108,23]
pg 4.269 is peering, acting [23,57,106]
pg 3.26a is peering, acting [23,57,106]

ceph pg .. query from now for pg:
https://www.dropbox.com/s/xhdga2qvgygecav/query_pgid.txt.tar.gz

--
Regatds
Dominik

2013/6/29 Sage Weil <sage@xxxxxxxxxxx>:
>> Ver. 0.56.6
>> Hmm, osd not died, 1 or more pg stack on peereng on it.
>
> Can you get a pgid from 'ceph health detail' and then do 'ceph pg <pgid>
> query' and attach that output?
>
> Thanks!
> sage
>
>>
>> Regards
>> Dominik
>>
>> On Jun 28, 2013 11:28 PM, "Sage Weil" <sage@xxxxxxxxxxx> wrote:
>>       On Sat, 29 Jun 2013, Andrey Korolyov wrote:
>>       > There is almost same problem with the 0.61 cluster, at least
>>       with same
>>       > symptoms. Could be reproduced quite easily - remove an osd and
>>       then
>>       > mark it as out and with quite high probability one of
>>       neighbors will
>>       > be stuck at the end of peering process with couple of peering
>>       pgs with
>>       > primary copy on it. Such osd process seems to be stuck in some
>>       kind of
>>       > lock, eating exactly 100% of one core.
>>
>>       Which version?
>>       Can you attach with gdb and get a backtrace to see what it is
>>       chewing on?
>>
>>       Thanks!
>>       sage
>>
>>
>>       >
>>       > On Thu, Jun 13, 2013 at 8:42 PM, Gregory Farnum
>>       <greg@xxxxxxxxxxx> wrote:
>>       > > On Thu, Jun 13, 2013 at 6:33 AM, S?awomir Skowron
>>       <szibis@xxxxxxxxx> wrote:
>>       > >> Hi, sorry for late response.
>>       > >>
>>       > >>
>>       https://docs.google.com/file/d/0B9xDdJXMieKEdHFRYnBfT3lCYm8/view
>>       > >>
>>       > >> Logs in attachment, and on google drive, from today.
>>       > >>
>>       > >>
>>       https://docs.google.com/file/d/0B9xDdJXMieKEQzVNVHJ1RXFXZlU/view
>>       > >>
>>       > >> We have such problem today. And new logs are on google
>>       drive with today date.
>>       > >>
>>       > >> Strange is that problematic osd.71 have about 10-15%, more
>>       space used
>>       > >> then other osd in cluster.
>>       > >>
>>       > >> Today in one hour osd.71 fails 3 times in mon log, and
>>       after third
>>       > >> recovery has been stuck, and many 500 errors appears in
>>       http layer on
>>       > >> top of rgw. When it's stuck, restarting osd71, osd.23, and
>>       osd.108,
>>       > >> all from stucked pg, helps, but i run even repair on this
>>       osd, just in
>>       > >> case.
>>       > >>
>>       > >> I have some theory, that on this pg is rgw index of
>>       objects, or one of
>>       > >> osd in this pg, have some problems with local filesystem or
>>       drive
>>       > >> bellow (raid controller reports nothing about that), but i
>>       do not see
>>       > >> any problem in system.
>>       > >>
>>       > >> How can we find in which pg/osd index of objects in rgw
>>       bucket exist ??
>>       > >
>>       > > You can find the location of any named object by grabbing
>>       the OSD map
>>       > > from the cluster and using the osdmaptool: "osdmaptool
>>       <mapfile>
>>       > > --test-map-object <objname> --pool <poolid>".
>>       > >
>>       > > You're not providing any context for your issue though, so
>>       we really
>>       > > can't help. What symptoms are you observing?
>>       > > -Greg
>>       > > Software Engineer #42 @ http://inktank.com | http://ceph.com
>>       > > _______________________________________________
>>       > > ceph-users mailing list
>>       > > ceph-users@xxxxxxxxxxxxxx
>>       > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>       > _______________________________________________
>>       > ceph-users mailing list
>>       > ceph-users@xxxxxxxxxxxxxx
>>       > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>       >
>>       _______________________________________________
>>       ceph-users mailing list
>>       ceph-users@xxxxxxxxxxxxxx
>>       http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>

-- 
Pozdrawiam
Dominik
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com