Hi,
since we had the last issue in our cluster.
https://marc.info/?t=150297924500005&r=1&w=2
we had all sorts of files being partially written to disk due to lots of
crashes, or not sure for what every reason. we reported it here:
http://tracker.ceph.com/issues/21173
and probably related or same as this
http://tracker.ceph.com/issues/14009
we had exactly the same trace and same assert
we had thousands of them, we could remove those corrupted shards
manually, but that was not practical.
and being hit by this previously, I decided to try and patch it (my
first time patching ceph, so excuse my mistakes) and any comments are
welcomed. I basically borrowed code from the redundant/fast read block,
and I m not sure if i satisfied the "TODO" above in the same function.
--- ceph-12.2.0/src/osd/ECBackend.cc 2017-08-28 12:30:20.000000000 -0400
+++ ceph-12.2.0/src/osd/ECBackend.cc 2017-09-01 08:55:38.209400108 -0400
@@ -402,10 +402,13 @@
target[*i] = &(op.returned_data[*i]);
}
map<int, bufferlist> from;
+ uint64_t size = to_read.get<2>().begin()->second.length();
for(map<pg_shard_t, bufferlist>::iterator i = to_read.get<2>().begin();
i != to_read.get<2>().end();
++i) {
- from[i->first.shard].claim(i->second);
+ if (size == i->second.length() ) {
+ from[i->first.shard].claim(i->second);
+ }
}
dout(10) << __func__ << ": " << from << dendl;
int r = ECUtil::decode(sinfo, ec_impl, from, target);
@@ -1247,7 +1250,93 @@
if (rop.in_progress.empty() || is_complete == rop.complete.size()) {
dout(20) << __func__ << " Complete: " << rop << dendl;
rop.trace.event("ec read complete");
- complete_read_op(rop, m);
+
+ // send more reads if recovery needs them
+ bool size_mismatched = false;
+ bool sent_for_more = false;
+ for (map<hobject_t, read_result_t>::const_iterator iter =
+ rop.complete.begin();
+ iter != rop.complete.end();
+ ++iter) {
+ uint64_t total_data_size =
iter->second.returned.front().get<2>().begin()->second.length();
+ set<int> have;
+ for (map<pg_shard_t, bufferlist>::const_iterator j =
+ iter->second.returned.front().get<2>().begin();
+ j != iter->second.returned.front().get<2>().end();
+ ++j) {
+
+ if ( j->second.length() == total_data_size ) {
+ have.insert(j->first.shard);
+ } else {
+ size_mismatched = true;
+ dout(0) << __func__ << " Checking Complete: [ERR] shard " <<
j->first.shard << " has mismatched size: " << j->second.length() << " !=
" << total_data_size << ". " << rop << dendl;
+ }
+ }
+
+ // one or more shards have different size. corrupted?! maybe a
crashed node or oom kill.
+ set<int> want_to_read, dummy_minimum;
+ int err;
+ if (size_mismatched || true) { // not sure if we always should
check this, may be check for read op errors too
+// if (size_mismatched) {
+ get_want_to_read_shards(&want_to_read);
+ if ((err = ec_impl->minimum_to_decode(want_to_read, have,
&dummy_minimum)) < 0) {
+ dout(20) << __func__ << " minimum_to_decode failed, we only
have shards " << have << dendl;
+ if (rop.in_progress.empty()) {
+ // If we don't have enough copies and we haven't sent reads
for all shards
+ // we can send the rest of the reads, if any.
+ if (!rop.do_redundant_reads) {
+ dout(10) << __func__ << ": sending for more reads" << dendl;
+ int r = send_all_remaining_reads(iter->first, rop);
+ if (r == 0) {
+ sent_for_more = true;
+ continue;
+ }
+ // Couldn't read any additional shards so handle as
completed with errors
+ }
+ // We don't want to confuse clients / RBD with objectstore error
+ // values in particular ENOENT. We may have different error
returns
+ // from different shards, so we'll return
minimum_to_decode() error
+ // (usually EIO) to reader. It is likely an error here is
due to a
+ // damaged pg.
+ rop.complete[iter->first].r = err;
+ }
+ } else {
+ dout(20) << __func__ << " minimum_to_decode succeeded " <<
dummy_minimum << dendl;
+ assert(rop.complete[iter->first].r == 0);
+ if (!rop.complete[iter->first].errors.empty()) {
+ if (cct->_conf->osd_read_ec_check_for_errors) {
+ dout(10) << __func__ << ": Not ignoring errors, use one
shard err=" << err << dendl;
+ err = rop.complete[iter->first].errors.begin()->second;
+ rop.complete[iter->first].r = err;
+ dout(10) << __func__ << ": Not ignoring errors 2, use one
shard err=" << err << dendl;
+ } else {
+ get_parent()->clog_warn() << "Error(s) ignored for "
+ << iter->first << " enough copies
available";
+ dout(10) << __func__ << " Error(s) ignored for " <<
iter->first
+ << " enough copies available" << dendl;
+ rop.complete[iter->first].errors.clear();
+ }
+ }
+// ++is_complete;
+ }
+ } else {
+ dout(10) << __func__ << " Checking Complete: all shards are the
same size" << dendl;
+ }
+ }
+ if (!sent_for_more) {
+ complete_read_op(rop, m);
+ } else {
+ dout(10) << __func__ << " readop needed more reads: " << rop <<
dendl;
+ }
} else {
dout(10) << __func__ << " readop not complete: " << rop << dendl;
}
the patch is not perfect, I had few improvements in mind after
deployment, but since i deployed the patch to our cluster and we do not
want to interrupt the recovery (been recovering for the past 3 days with
this patch). so i ll not have the chance to test it again since I could
not replicate the same error in a test cluster.
truncating any file in our test cluster, or touching it in any way,
generated a read error in the sub read op. while I can clearly see the
sub read op in the production cluster reporting success with incorrect
size! if any one had an idea on how to replicate this i d very much like
to finish this patch properly. if you need this in any other form or a
pull request i ll be more than happy to do it.
the caveats we saw in this patch until now; if the shard getting
compared to is the wrong/corrupted one, the op fail and we get
"failed_push" error. and if the primary osd had a corrupted shard, the
reconstruction of the chunk will fail and we get
FAILED assert(pop.data.length() ==
sinfo.aligned_logical_offset_to_chunk_offset(
after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))
but those are way more less than dealing with every shard.
may be it needs more sophisticated way of knowing which shard to compare
to (may be find distinct sizes and count them, then take the largest count)
thanks
ali
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html