Re: emperor -> firefly 0.80.7 upgrade problem

Samuel Just <sam.just@xxxxxxxxxxx> · Wed, 5 Nov 2014 13:47:54 -0800



Sounds like you needed osd 20.  You can mark osd 20 lost.
-Sam

On Wed, Nov 5, 2014 at 9:41 AM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> On Wed, Nov 5, 2014 at 7:24 AM, Chad Seys <cwseys@xxxxxxxxxxxxxxxx> wrote:
>> Hi Sam,
>>
>>> Incomplete usually means the pgs do not have any complete copies.  Did
>>> you previously have more osds?
>>
>> No.  But could have OSDs quitting after hitting assert(0 == "we got a bad
>> state machine event"), or interacting with kernel 3.14 clients have caused the
>> incomplete copies?
>>
>> How can I probe the fate of one of the incomplete PGs? e.g.
>> pg 4.152 is incomplete, acting [1,11]
>>
>> Also, how can I investigate why one osd has a blocked request?  The hardware
>> appears normal and the OSD is performing other requests like scrubs without
>> problems.  From its log:
>>
>> 2014-11-05 00:57:26.870867 7f7686331700  0 log [WRN] : 1 slow requests, 1
>> included below; oldest blocked for > 61440.449534 secs
>> 2014-11-05 00:57:26.870873 7f7686331700  0 log [WRN] : slow request
>> 61440.449534 seconds old, received at 2014-11-04 07:53:26.421301:
>> osd_op(client.11334078.1:592 rb.0.206609.238e1f29.0000000752e8 [read 512~512]
>> 4.17df39a7 RETRY=1 retry+read e115304) v4 currently reached pg
>> 2014-11-05 00:57:31.816534 7f7665e4a700  0 -- 192.168.164.187:6800/7831 >>
>> 192.168.164.191:6806/30336 pipe(0x44a98780 sd=89 :6800 s=0 pgs=0 c
>> s=0 l=0 c=0x42f482c0).accept connect_seq 14 vs existing 13 state standby
>> 2014-11-05 00:59:10.749429 7f7666e5a700  0 -- 192.168.164.187:6800/7831 >>
>> 192.168.164.191:6800/20375 pipe(0x44a99900 sd=169 :6800 s=2 pgs=44
>> 3 cs=29 l=0 c=0x42528b00).fault with nothing to send, going to standby
>> 2014-11-05 01:02:09.746857 7f7664d39700  0 -- 192.168.164.187:6800/7831 >>
>> 192.168.164.192:6802/9779 pipe(0x44a98280 sd=63 :6800 s=0 pgs=0 cs
>> =0 l=0 c=0x42f48c60).accept connect_seq 26 vs existing 25 state standby
>>
>> Greg, I attempted to copy/paste you 'ceph scrub' output.  Did I get the
>> releveant bits?
>
> Looks like you provided the monitor log, which is actually distinct
> from the central log. I don't think it matters, though — I was looking
> for a very specific type of corruption that would have put them into a
> HEALTH_WARN or HEALTH_FAIL state if they detected it. At this point
> Sam is going to be a lot more help than I am. :)
> -Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com