Re: emperor -> firefly 0.80.7 upgrade problem

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 5 Nov 2014 09:41:41 -0800



On Wed, Nov 5, 2014 at 7:24 AM, Chad Seys <cwseys@xxxxxxxxxxxxxxxx> wrote:
> Hi Sam,
>
>> Incomplete usually means the pgs do not have any complete copies.  Did
>> you previously have more osds?
>
> No.  But could have OSDs quitting after hitting assert(0 == "we got a bad
> state machine event"), or interacting with kernel 3.14 clients have caused the
> incomplete copies?
>
> How can I probe the fate of one of the incomplete PGs? e.g.
> pg 4.152 is incomplete, acting [1,11]
>
> Also, how can I investigate why one osd has a blocked request?  The hardware
> appears normal and the OSD is performing other requests like scrubs without
> problems.  From its log:
>
> 2014-11-05 00:57:26.870867 7f7686331700  0 log [WRN] : 1 slow requests, 1
> included below; oldest blocked for > 61440.449534 secs
> 2014-11-05 00:57:26.870873 7f7686331700  0 log [WRN] : slow request
> 61440.449534 seconds old, received at 2014-11-04 07:53:26.421301:
> osd_op(client.11334078.1:592 rb.0.206609.238e1f29.0000000752e8 [read 512~512]
> 4.17df39a7 RETRY=1 retry+read e115304) v4 currently reached pg
> 2014-11-05 00:57:31.816534 7f7665e4a700  0 -- 192.168.164.187:6800/7831 >>
> 192.168.164.191:6806/30336 pipe(0x44a98780 sd=89 :6800 s=0 pgs=0 c
> s=0 l=0 c=0x42f482c0).accept connect_seq 14 vs existing 13 state standby
> 2014-11-05 00:59:10.749429 7f7666e5a700  0 -- 192.168.164.187:6800/7831 >>
> 192.168.164.191:6800/20375 pipe(0x44a99900 sd=169 :6800 s=2 pgs=44
> 3 cs=29 l=0 c=0x42528b00).fault with nothing to send, going to standby
> 2014-11-05 01:02:09.746857 7f7664d39700  0 -- 192.168.164.187:6800/7831 >>
> 192.168.164.192:6802/9779 pipe(0x44a98280 sd=63 :6800 s=0 pgs=0 cs
> =0 l=0 c=0x42f48c60).accept connect_seq 26 vs existing 25 state standby
>
> Greg, I attempted to copy/paste you 'ceph scrub' output.  Did I get the
> releveant bits?

Looks like you provided the monitor log, which is actually distinct
from the central log. I don't think it matters, though — I was looking
for a very specific type of corruption that would have put them into a
HEALTH_WARN or HEALTH_FAIL state if they detected it. At this point
Sam is going to be a lot more help than I am. :)
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com