Hi everyone,
I'm seeing different results from reading files, depending on which OSDs
are running, including some incorrect reads with all OSDs running, in
CephFS from a pool with erasure coding. I'm running Ceph 17.2.6.
# More detail
In particular, I have a relatively large backup of some files, combined
with SHA-256 hashes of the files (which were verified when the backup
was created, approximately 7 months ago). Verifying these hashes
currently gives several errors, both in large and small files, but
somewhat tilted towards larger files.
Investigating which PGs stored the relevant files (using
cephfs-data-scan pg_files) didn't show the problem to be isolated to one
PG, but did show several PGs that contained OSD 15 as an active member.
Taking OSD 15 offline leads to *better* reads (more files with correct
SHA-256 hashes), but not completely correct reads. Further investigation
implicated OSD 34 as another potential issue, but taking it offline also
results in more correct files but not completely.
Bringing the stopped OSDs (15 and 34) back online results in the earlier
(incorrect) hashes when reading files, as might be expected, but this
seems to demonstrate that the correct information (or at least more
correct information) is still on the drives.
The hashes I receive for a given corrupted file are consistent from read
to read (including on different hosts, to avoid caching as an issue),
but obviously sometimes change if I take an affected OSD offline.
# Recent history
I have Ceph configured with a deep scrub interval of approximately 30
days, and they have completed regularly with no issues identified.
However, within the past two weeks I added two additional drives to the
cluster, and rebalancing took about two weeks to complete: the placement
groups I took notice of having issues were not deep scrubbed since the
replacement completed, so it is possible something got corrupted during
the rebalance.
Neither OSD 15 nor 34 is a new drive, and as far as I have experienced
(and Ceph's health indications have shown), all of the existing OSDs
have behaved correctly up to this point.
# Configuration
I created an erasure coding profile for the pool in question using the
following command:
ceph osd erasure-code-profile set erasure_k4_m2 \
plugin=jerasure \
k=4 m=2 \
technique=blaum_roth \
crush-device-class=hdd
And the following CRUSH rule is used for the pool:
rule erasure_k4_m2_hdd_rule {
id 3
type erasure
min_size 4
max_size 6
step take default class hdd
step choose indep 3 type host
step chooseleaf indep 2 type osd
step emit
}
# Questions
1. Does this behavior ring a bell to anyone? Is there something obvious
I'm missing or should do?
2. Is deep scrubbing likely to help the situation? Hurt it? (Hopefully
not hurt: I've prioritized deep scrubbing of the PGs on OSD 15 and 34,
and will likely follow up with the rest of the pool.)
3. Is there a way to force "full reads" or otherwise to use all of the
EC chunks (potentially in tandem with on-disk checksums) to identify the
correct data, rather than a combination of the data from the primary OSDs?
Thanks for any insights you might have,
aschmitz
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx