Re: RADOS EC: is it okay to reduce the number of commits required for reply to client?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Sep 19, 2019 at 12:06 AM Alex Xu <alexu4993@xxxxxxxxx> wrote:
>
> Hi Cephers,
>
> We are testing the write performance of Ceph EC (Luminous, 8 + 4), and
> noticed that tail latency is extremly high. Say, avgtime of 10th
> commit is 40ms, acceptable as it's an all HDD cluster; 11th is 80ms,
> doubled; then 12th is 160ms, doubled again, which is not so good. Then
> we made a small modification and tested again, and did get a much
> better result. The patch is quite simple (for test only of course):
>
> --- a/src/osd/ECBackend.cc
> +++ b/src/osd/ECBackend.cc
> @@ -1188,7 +1188,7 @@ void ECBackend::handle_sub_write_reply(
>      i->second.on_all_applied = 0;
>      i->second.trace.event("ec write all applied");
>    }
> -  if (i->second.pending_commit.empty() && i->second.on_all_commit) {
> +  if (i->second.pending_commit.size() == 2 &&
> i->second.on_all_commit) {  // 8 + 4 - 10 = 2
>      dout(10) << __func__ << " Calling on_all_commit on " << i->second << dendl;
>      i->second.on_all_commit->complete(0);
>      i->second.on_all_commit = 0;
>
> As far as what I see, everything still goes well (maybe because of the
> rwlock in primary OSD? not sure though), but I'm afraid it might break
> data consistency in some ways not aware of. So I'm writing to ask if
> someone could kindly provide expertise comments on this or maybe share
> any known drawbacks. Thank you!

Unfortunately this is one of those things that will work okay in
everyday use but fail catastrophically if something else goes wrong.

Ceph assumes throughout its codebase that if a write committed to
disk, it can be retrieved as long as k OSDs are available from the
PG's acting set when that write was committed. But by letting writes
"commit" on less than the full acting set, you might lose some of the
OSDs which *did* ack the write and no longer be able to find out what
it was even though you have >k OSDs available!

This will result in a very, very confused Ceph recovery algorithm,
lots of badness, and unhappy cluster users. :(
-Greg

>
> PS: OSD is backended with filestore, not bluestore, if that matters.
>
> Regards,
> Alex





[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux