RADOS EC: is it okay to reduce the number of commits required for reply to client?

Alex Xu <alexu4993@xxxxxxxxx> · Thu, 19 Sep 2019 15:05:52 +0800

Hi Cephers,

We are testing the write performance of Ceph EC (Luminous, 8 + 4), and
noticed that tail latency is extremly high. Say, avgtime of 10th
commit is 40ms, acceptable as it's an all HDD cluster; 11th is 80ms,
doubled; then 12th is 160ms, doubled again, which is not so good. Then
we made a small modification and tested again, and did get a much
better result. The patch is quite simple (for test only of course):

--- a/src/osd/ECBackend.cc
+++ b/src/osd/ECBackend.cc
@@ -1188,7 +1188,7 @@ void ECBackend::handle_sub_write_reply(
     i->second.on_all_applied = 0;
     i->second.trace.event("ec write all applied");
   }
-  if (i->second.pending_commit.empty() && i->second.on_all_commit) {
+  if (i->second.pending_commit.size() == 2 &&
i->second.on_all_commit) {  // 8 + 4 - 10 = 2
     dout(10) << __func__ << " Calling on_all_commit on " << i->second << dendl;
     i->second.on_all_commit->complete(0);
     i->second.on_all_commit = 0;

As far as what I see, everything still goes well (maybe because of the
rwlock in primary OSD? not sure though), but I'm afraid it might break
data consistency in some ways not aware of. So I'm writing to ask if
someone could kindly provide expertise comments on this or maybe share
any known drawbacks. Thank you!

PS: OSD is backended with filestore, not bluestore, if that matters.

Regards,
Alex