As the osd crash implies, setting "nobackfill" appears to let all the
osds keep running and the pg stays active and can apparently serve data.
If I track down the object referenced below in the object store, I can
download it without error via s3... though as I can't generate a
matching etag, it may well be corrupt.
Still I do wonder if deleting this object - either via s3, or maybe more
likely directly within filestore, might permit backfill to continue.
Of other objects referenced in "last_backfill" for each osd in the pg,
both also download via s3, but one I have seen implicated in similar
crash logs for otehr OSDs and the etag again does not match; the other I
have not seen in crash logs and does generate a matching etag.
Opened a tracker issue for this: http://tracker.ceph.com/issues/36411
Graham
On 10/09/2018 06:55 PM, Graham Allan wrote:
On 10/09/2018 01:14 PM, Graham Allan wrote:
On 10/9/2018 12:19 PM, Gregory Farnum wrote:
I think unfortunately the easiest thing for you to fix this will be
to set the min_size back to 4 until the PG is recovered (or at least
has 5 shards done). This will be fixed in a later version of Ceph and
probably backported, but sadly it's not done yet.
-Greg
Thanks Greg, though sadly I've tried that; whatever I do, one of the 4
osds involved will simply crash (not just the ones I previously tried
to re-import via ceph-objectstore-tool). I just spend time chasing
them around but never succeeding in having a complete set run long
enough to make progress. They seem to crash when starting backfill on
the next object. There has to be something in the current set of
shards which it can't handle.
Since then I've been focusing on trying to get the pg to revert to an
earlier interval using osd_find_best_info_ignore_history_les, though
the information I find around it is minimal.
Since my experiments with osd_find_best_info_ignore_history_les have not
borne any results, I'm looking again at the osd crashes when I get
enough of them running for backfill to start.
They all crash in the same way; with "debug osd=10", the very last bit is:
-2> 2018-10-09 16:27:25.425004 7faa866bd700 10 osd.190 pg_epoch:
710808 pg[70.82ds2( v 710799'704745 (586066'698574,710799'704745]
local-lis/les=710807/710808 n=102929 ec=21494/21494 lis/c
710807/588565 les/c/f 710808/588566/0 710711/710807/710807)
[820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2
lpr=710807 pi=[588565,710807)/39 rops=1
bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=710799'70
4745 lcod 0'0 mlcod 0'0
active+undersized+degraded+remapped+backfilling] continue_recovery_op:
continuing
RecoveryOp(hoid=70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head
v=579167'695462 missing_on=105(2),485(5),562(4),761(1),789(3),820(0)
missing_on_shards=0,1,2,3,4,5 recovery_inf
o=ObjectRecoveryInfo(70:b415ca12:::default.10341087.1__shadow_Subjects%2fsub-NDARINV6BVVAY29%2fses-baselineYear1Arm1%2ffunc%2frun-04_AROMA%2fdenoised_func_data_nonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16_3:head@579167'695462,
size: 0, copy_subset: [], clone_subset: {}, snapset: 0=[]:[])
recovery_progress=ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:true, error
:false) obc refcount=3 state=READING waiting_on_pushes=
extent_requested=0,8388608)
-1> 2018-10-09 16:27:25.425105 7faa866bd700 10 osd.190 pg_epoch:
710808 pg[70.82ds2( v 710799'704745 (586066'698574,710799'704745]
local-lis/les=710807/710808 n=102929 ec=21494/21494 lis/c
710807/588565 les/c/f 710808/588566/0 710711/710807/710807)
[820,761,105,789,562,485]/[2147483647,2147483647,190,448,61,315]p190(2) r=2
lpr=710807 pi=[588565,710807)/39 rops=1
bft=105(2),485(5),562(4),761(1),789(3),820(0) crt=710799'70
4745 lcod 0'0 mlcod 0'0
active+undersized+degraded+remapped+backfilling] continue_recovery_op:
before_progress=ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:true,
error:false), after_progress=ObjectRecoveryProgress(!first,
data_recovered_to:0, data_complete:true, omap_recovered_to:,
omap_complete:true, error:false), pop.data.length()=786432, size=0
0> 2018-10-09 16:27:25.430239 7faa866bd700 -1
/build/ceph-12.2.8/src/osd/ECBackend.cc: In function 'void
ECBackend::continue_recovery_op(ECBackend::RecoveryOp&,
RecoveryMessages*)' thread 7faa866bd700 time 2018-10-09 16:27:25.425135
/build/ceph-12.2.8/src/osd/ECBackend.cc: 619: FAILED
assert(pop.data.length() ==
sinfo.aligned_logical_offset_to_chunk_offset(
after_progress.data_recovered_to -
op.recovery_progress.data_recovered_to))
It sounds like we're getting an unexpected data chunk size, presumably
due to some kind of data corruption.
Though if I search on all acting OSDs (up or down) for the object on
which backfill crashed, I do find all shards and they do at least have a
consistent size... perhaps that size itself is different from expected.
osd 233:
root@ceph01:/var/lib/ceph/osd/ceph-233/current# find 70.82ds1_head
-name '*NDARINV6BVVAY29*' -type f -ls 23624833062 772 -rw-r--r-- 1
root root 786432 Sep 26 15:32
70.82ds1_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/default.10341087.1\\u\\ushadow\\uSubjects\\ssub-NDARINV6BVVAY29\\sses-baselineYear1Arm1\\sfunc\\srun-04\\uAROMA\\sdenoised\\ufunc\\udata\\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\\u3__head_4853A82D__46_ffffffffffffffff_1
osd 190:
root@ceph07:/var/lib/ceph/osd/ceph-190/current# find 70.82ds2_head
-name '*NDARINV6BVVAY29*' -type f -ls 23710021684 772 -rw-r--r-- 1
root root 786432 Jun 27 07:13
70.82ds2_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/default.10341087.1\\u\\ushadow\\uSubjects\\ssub-NDARINV6BVVAY29\\sses-baselineYear1Arm1\\sfunc\\srun-04\\uAROMA\\sdenoised\\ufunc\\udata\\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\\u3__head_4853A82D__46_ffffffffffffffff_2
osd 448:
root@ceph09:/var/lib/ceph/osd/ceph-448/current# find 70.82ds3_head
-name '*NDARINV6BVVAY29*' -type f -ls 3259920428 772 -rw-r--r-- 1
root root 786432 Jun 27 07:13
70.82ds3_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/default.10341087.1\\u\\ushadow\\uSubjects\\ssub-NDARINV6BVVAY29\\sses-baselineYear1Arm1\\sfunc\\srun-04\\uAROMA\\sdenoised\\ufunc\\udata\\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\\u3__head_4853A82D__46_ffffffffffffffff_3
osd 315:
root@ceph03:/var/lib/ceph/osd/ceph-315/current# find 70.82ds5_head
-name '*NDARINV6BVVAY29*' -type f -ls 64439371813 772 -rw-r--r-- 1
root root 786432 Jun 27 07:13
70.82ds5_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/default.10341087.1\\u\\ushadow\\uSubjects\\ssub-NDARINV6BVVAY29\\sses-baselineYear1Arm1\\sfunc\\srun-04\\uAROMA\\sdenoised\\ufunc\\udata\\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\\u3__head_4853A82D__46_ffffffffffffffff_5
osd 98:
root@ceph05:/var/lib/ceph/osd/ceph-98/current# find 70.82ds0_head
-name '*NDARINV6BVVAY29*' -type f -ls 2428799806 768 -rw-r--r-- 1
root root 786432 Jun 27 07:13
70.82ds0_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/default.10341087.1\\u\\ushadow\\uSubjects\\ssub-NDARINV6BVVAY29\\sses-baselineYear1Arm1\\sfunc\\srun-04\\uAROMA\\sdenoised\\ufunc\\udata\\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\\u3__head_4853A82D__46_ffffffffffffffff_0
osd 61:
root@ceph04:~# ls -l
/var/lib/ceph/osd/ceph-61/current/70.82ds4_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/*NDARINV6BVVAY29*
-rw-r--r-- 1 root root 786432 Jun 27 07:13
/var/lib/ceph/osd/ceph-61/current/70.82ds4_head/DIR_D/DIR_2/DIR_8/DIR_A/DIR_3/DIR_5/default.10341087.1\u\ushadow\uSubjects\ssub-NDARINV6BVVAY29\sses-baselineYear1Arm1\sfunc\srun-04\uAROMA\sdenoised\ufunc\udata\unonaggr.nii.2~VfI6rAYnU4XtzUcFAoecJDU5TpVR8AP.16\u3__head_4853A82D__46_ffffffffffffffff_4
Highly motivated to keep looking for a solution... around 1PB of data in
this pool!
Graham
--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com