Re: fixing another remapped+incomplete EC 4+2 pg

Graham Allan <gta@xxxxxxx> · Mon, 8 Oct 2018 16:57:14 -0500

I'm still trying to find a way to reactivate this one pg which is 
incomplete. There are a lot of periods in its history based on a 
combination of a peering storm a couple of weeks ago, with min_size 
being set too low for safety. At this point I think there is no chance 
of bringing back the full set of most recent osds, so I'd like to 
understand the process to roll back to an earlier period no matter how 
long ago.

I understood the process is to set 
osd_find_best_info_ignore_history_les=1 for the primary osd, so 
something like:

ceph tell osd.448 injectargs --osd_find_best_info_ignore_history_les=1

then set that osd down to make it re-peer. But whenever I have tried 
this the osd never becomes active again. Possibly I have misunderstood 
or am doing something else wrong...

Output from pg query is here, if it adds any insight...

https://gist.githubusercontent.com/gtallan/e72b4461fb315983ae9a62cbbcd851d4/raw/0d30ceb315dd5567cb05fd0dc3e2e2c4975d8c01/pg70.b1c-query.txt

(Out of curiosity, is there any way to relate the first and last numbers 
in an interval to an actual timestamp?)

Thanks,

Graham

On 10/03/2018 12:18 PM, Graham Allan wrote:
Following on from my previous adventure with recovering pgs in the face 
of failed OSDs, I now have my EC 4+2 pool oeprating with min_size=5 
which is as things should be.

However I have one pg which is stuck in state remapped+incomplete 
because it has only 4 out of 6 osds running, and I have been unable to 
bring the missing two back into service.

PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
    pg 70.82d is remapped+incomplete, acting 
[2147483647,2147483647,190,448,61,315] (reducing pool 
.rgw.buckets.ec42 min_size from 5 may help; search ceph.com/docs for 
'incomplete')

I don't think I want to do anything with min_size as that would make all 
other pgs vulnerable to running dangerously undersized (unless there is 
any way to force that state for only a single pg). It seems to me that 
with 4/6 osds available, it should maybe be possible to force ceph to 
select one or two new osds to rebalance this pg to?

ceph pg query gives me (snippet):

            "down_osds_we_would_probe": [
                98,
                233,
                238,
                239
            ],
            "peering_blocked_by": [],
            "peering_blocked_by_detail": [
                {
                    "detail": "peering_blocked_by_history_les_bound"
                }
            ]

Of these, osd 98 appears to have a corrupt xfs filesystem

osd 239 was the original osd to hold a shard of this pg but would not 
keep running, exiting with:

/build/ceph-12.2.7/src/osd/ECBackend.cc: 619: FAILED 
assert(pop.data.length() == 
sinfo.aligned_logical_offset_to_chunk_offset( 
after_progress.data_recovered_to - 
op.recovery_progress.data_recovered_to))

osds 233 and 238 were otherwise evacuated (weight 0) osds to which I 
imported the pg shard from osd 239 (using ceph-objectstore-tool). After 
which they crash with the same assert. More specifically they seem to 
crash in the same way each time the pg becomes active and starts to 
backfill, on the same object:

    -9> 2018-10-03 11:30:28.174586 7f94ce9c4700  5 osd.233 pg_epoch: 
704441 pg[70.82ds1( v 704329'703106 (586066'698574,704329'703106] 
local-lis/les=704439/704440 n=102585 ec=21494/21494 lis/c 
704439/588565 les/c/f 704440/588566/0 68066
6/704439/704439) 
[820,761,105,789,562,485]/[2147483647,233,190,448,61,315]p233(1) r=1 
lpr=704439 pi=[21494,704439)/4 rops=1 
bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=704329'703106 lcod 
0'0 mlcod 0'0 active+undersized+remapped+ba
ckfilling] backfill_pos is 
70:b415ca14:::default.630943.7__shadow_Barley_GC_Project%2fBarley_GC_Project%2fRawdata%2fReads%2fCZOA.6150.7.38741.TGCTGG.fastq.gz.2~Vn8g0rMwpVY8eaW83TDzJ2mczLXAl3z.3_24:head 

    -8> 2018-10-03 11:30:28.174887 7f94ce9c4700  1 -- 
10.31.0.1:6854/2210291 --> 10.31.0.1:6854/2210291 -- 
MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, 
attrs_read=0)) v2 -- 0x7f9500472280 con 0
    -7> 2018-10-03 11:30:28.174902 7f94db9de700  1 -- 
10.31.0.1:6854/2210291 <== osd.233 10.31.0.1:6854/2210291 0 ==== 
MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, 
attrs_read=0)) v2 ==== 0+0+0 (0 0 0) 0x7f9500472280
 con 0x7f94fb72b000
    -6> 2018-10-03 11:30:28.176267 7f94ead66700  5 -- 
10.31.0.1:6854/2210291 >> 10.31.0.4:6880/2181727 conn(0x7f94ff2a6000 
:-1 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=946 cs=1 l=0). 
rx osd.61 seq 9 0x7f9500472500 MOSDECSubOpRe
adReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
    -5> 2018-10-03 11:30:28.176281 7f94ead66700  1 -- 
10.31.0.1:6854/2210291 <== osd.61 10.31.0.4:6880/2181727 9 ==== 
MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, 
attrs_read=0)) v2 ==== 786745+0+0 (875698380 0 0) 0x
7f9500472500 con 0x7f94ff2a6000
    -4> 2018-10-03 11:30:28.177723 7f94ead66700  5 -- 
10.31.0.1:6854/2210291 >> 10.31.0.9:6920/13427 conn(0x7f94ff2bc800 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=46152 cs=1 l=0). rx 
osd.448 seq 8 0x7f94fe9d5980 MOSDECSubOpR
eadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
    -3> 2018-10-03 11:30:28.177738 7f94ead66700  1 -- 
10.31.0.1:6854/2210291 <== osd.448 10.31.0.9:6920/13427 8 ==== 
MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, 
attrs_read=0)) v2 ==== 786745+0+0 (2772477454 0 0) 0x
7f94fe9d5980 con 0x7f94ff2bc800
    -2> 2018-10-03 11:30:28.185788 7f94ea565700  5 -- 
10.31.0.1:6854/2210291 >> 10.31.0.7:6868/2012671 conn(0x7f94ff5c3800 
:6854 s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4193 cs=1 
l=0). rx osd.190 seq 10 0x7f9500472780 MOSDECSu
bOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, 
attrs_read=0)) v2
    -1> 2018-10-03 11:30:28.185815 7f94ea565700  1 -- 
10.31.0.1:6854/2210291 <== osd.190 10.31.0.7:6868/2012671 10 ==== 
MOSDECSubOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, 
attrs_read=0)) v2 ==== 786745+0+0 (2670842780 0 0)
 0x7f9500472780 con 0x7f94ff5c3800
     0> 2018-10-03 11:30:28.194795 7f94ce9c4700 -1 
/build/ceph-12.2.7/src/osd/ECBackend.cc: In function 'void 
ECBackend::continue_recovery_op(ECBackend::RecoveryOp&, 
RecoveryMessages*)' thread 7f94ce9c4700 time 2018-10-03 11:30:28.19026
0
/build/ceph-12.2.7/src/osd/ECBackend.cc: 619: FAILED 
assert(pop.data.length() == 
sinfo.aligned_logical_offset_to_chunk_offset( 
after_progress.data_recovered_to - 
op.recovery_progress.data_recovered_to))

Is there anything I can do to help one of these osds (probably 233 or 
238) start, such as "ceph-objectstore-tool --op repair"...? There seems 
little to lose by trying but there isn't a lot of documentation on the 
operations available in ceph-objectstore-tool.

I also know of the option "osd_find_best_info_ignore_history_les" but 
little of what it actually does, other than being "dangerous". There are 
many past intervals listed by pg query, but no "maybe_went_rw" flags so 
perhaps it is safe to revert...?

--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com