Re: Attempt to rethink log-based replication in Ceph on fast IO path

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 3 Feb 2020 12:06:25 +0000

On Mon, Jan 27, 2020 at 2:58 PM Roman Penyaev <rpenyaev@xxxxxxx> wrote:
> rados snapshots are based on clone, right? Clone operation should
> follow the sync path (through the single primary) but still can be
> a bit tricky (requires object to be identical on all replicas
> to the moment, when clone is done) and can be implemented by
> communicating between replicas, where primary is the main
> coordinator.  Here is what I imagine (some sort of lazy data
> migration)

Uh, they are clones, but not in the way you're thinking/hoping. Client
IO *can* (but does not need to) include a SnapSet, which contains data
about the snapshots that logically exist (but may or may not have been
seen by any particular object yet). When the client does a write with
a snapid the object doesn't already contain, the OSD does a clone
locally. And then any subsequent write applies to the new head, not
the old snapid.
So unfortunately neither the client nor the OSD has the slightest idea
whether a particular write operation requires any snapshot work until
it arrives at the OSD.

<snip>

>
> Everything which is not plain read/write is treated as management
> or metadata requests.
>
> > (Among other things, any write op has the
> > potential to change the object’s size and that needs to be ordered
> > with truncate operations.)
>
> Why these two operations have to be ordered? No, I will ask another
> way.  Why distributed storage should care about the order of these
> two operations? That what I do not understand.  Why client can't be
> responsible for proper waiting of IO and only then issuing a truncate?
> (direct analogy: you issue an IO to your file and do not wait for a
> completion, then you truncate your file, what is the result?)
>
> But if we are talking about concurrent clients case, when one of the
> clients issues write, meanwhile another one issues truncate, than
> I do not understand how does the sync log help, because the primary
> replica can receive these two requests at any order (we assume no
> proper locking is used, right?)

If a client dispatches multiple operations on a single object, we
guarantee they are ordered in the same order they were dispatched by
the client. So he can do async truncate, async write, async read,
async append, whatever from a single thread and we promise to process
them in that order.

Some of that could probably be maintained in the client library rather
than on the OSDs, but not all of it given the timestamp-based retries
you describe and the problem of snapshots I mentioned above.

Basically what I'm getting at is that given the way the RADOS protocol
works, we really don't have any ops which are plain read/write.
Especially including some of the ones you care about most -- for
instance, RBD volumes without an object map include an IO hint on
every write, to make sure that any newly-created objects get set to
the proper 4MB size. These IO hints mean they're all compound
operations, not single-operation writes!

Of course, all these issues are not the result of changing our
durability guarantees, but of trying to provide client-side
replication...

>
> > Now, making these changes isn’t necessarily bad if we want to develop
> > a faster but less ridiculously-consistent storage system to better
> > serve the needs of the interfaces that actually get deployed — I have
> > long found it a little weird that RADOS, a strictly-consistent
> > transactional object store, is one of the premier providers of virtual
> > block-device IO and S3 storage. But if that’s the goal, we should
> > embrace being not-RADOS and be willing to explicitly take much larger
> > departures from it than just “in the happy path we drop ordering and
> > fall back to backfill if there’s a problem”.
>
> Current constraints are blockers for the IO performance.  It does not
> matter how much we squeeze from the CPU (crimson project), unless we
> can't relax IO ordering or reduce journaling effects, the overall
> CPU cycles improvements can be not so impressive.
>
> so I hope Ceph can make a step forward and be less conservative,
> especially when we have a hardware, which breaks all the possible
> rules.
>
> > The second big point is that if you want to have a happy path and a
> > fallback ordered path, you’ll need to map out in a lot more detail how
> > those interact and how the clients and OSDs switch between them. Ideas
> > like this have come up before but almost every one (or literally every
> > one?) has had a fatal flaw that prevented it actually being safe.
>
> Here I rely on a fact, that replicas know the PG state (as it is right
> now).  If PG is active and clean then replica accepts IO.  If not -
> IO is rejected with the proper error: "dear client, go to the primary,
> I'm not in the condition to serve your request, but primary can".
>
> Here several scenarios are possible. Client was the first one who
> observes a replica in not a healthy state.  We can expect all other
> replicas will observe the same not healthy state sooner, but client
> can propagate this information to other replicas in PG (need to be
> discussed in detail).

"Not a healthy state" isn't really meaningful in Ceph — we make
decisions based on the settings of the newest OSDMap we have access
to, but peer OSDs might or might not match that state. When the
primary processes an OSDMap marking one of the peers down and he sets
it to degraded, there's a window where the peers haven't seen that
update. The clients will probably take even longer. And merely not
getting an op reply as fast as the client wants isn't indicative of
anything that RADOS cares about. Those states and the transitions
between them and the recovery logic for ops-in-flight are all very
hard to get right, and having the solutions mapped out in detail is a
requirement for merging any kind of change in RADOS.

<snip>

> *Hybrid* client-side replication :) When client is responsible for
> fanning
> out write requests only in case of healthy pg.
>
> > It is frequently undesirable
> > since OSDs tend to have lower latency and more bandwidth to their
> > peers than the clients do to the OSDs;
>
> Latency is the answer.  I want to squeeze everything from RDMA.  For
> current
> Ceph RDMA is dead.  Basically for current implementation any per-client
> improvements on transport side bring nothing. (I spent some time poking
> the protocol v1 and had a good speed up on transport side,  which is
> unnoticed for the whole per-client IO performance. sigh)

Can you explain why client-side replication over RDMA is a better idea
than over ethernet IP? Like I said with math, I think in most cases it
is actually slower, and it DEFINITELY makes harder all the other kinds
of changes you want to make. I think you will be a lot happier if you
drop that.

(Also: we are doing a lot of work where read-from-replica will become
desirable for things like rack-local reads and not being able to do
that would be sad.)
-Greg
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx