Hi Cephers, We are testing the write performance of Ceph EC (Luminous, 8 + 4), and noticed that tail latency is extremly high. Say, avgtime of 10th commit is 40ms, acceptable as it's an all HDD cluster; 11th is 80ms, doubled; then 12th is 160ms, doubled again, which is not so good. Then we made a small modification and tested again, and did get a much better result. The patch is quite simple (for test only of course): --- a/src/osd/ECBackend.cc +++ b/src/osd/ECBackend.cc @@ -1188,7 +1188,7 @@ void ECBackend::handle_sub_write_reply( i->second.on_all_applied = 0; i->second.trace.event("ec write all applied"); } - if (i->second.pending_commit.empty() && i->second.on_all_commit) { + if (i->second.pending_commit.size() == 2 && i->second.on_all_commit) { // 8 + 4 - 10 = 2 dout(10) << __func__ << " Calling on_all_commit on " << i->second << dendl; i->second.on_all_commit->complete(0); i->second.on_all_commit = 0; As far as what I see, everything still goes well (maybe because of the rwlock in primary OSD? not sure though), but I'm afraid it might break data consistency in some ways not aware of. So I'm writing to ask if someone could kindly provide expertise comments on this or maybe share any known drawbacks. Thank you! PS: OSD is backended with filestore, not bluestore, if that matters. Regards, Alex