Re: Attempt to rethink log-based replication in Ceph on fast IO path

Roman Penyaev <rpenyaev@xxxxxxx> · Mon, 03 Feb 2020 17:07:43 +0100

On 2020-02-03 13:06, Gregory Farnum wrote:
On Mon, Jan 27, 2020 at 2:58 PM Roman Penyaev <rpenyaev@xxxxxxx> wrote:
rados snapshots are based on clone, right? Clone operation should
follow the sync path (through the single primary) but still can be
a bit tricky (requires object to be identical on all replicas
to the moment, when clone is done) and can be implemented by
communicating between replicas, where primary is the main
coordinator.  Here is what I imagine (some sort of lazy data
migration)

Uh, they are clones, but not in the way you're thinking/hoping. Client
IO *can* (but does not need to) include a SnapSet, which contains data
about the snapshots that logically exist (but may or may not have been
seen by any particular object yet). When the client does a write with
a snapid the object doesn't already contain, the OSD does a clone
locally. And then any subsequent write applies to the new head, not
the old snapid.
So unfortunately neither the client nor the OSD has the slightest idea
whether a particular write operation requires any snapshot work until
it arrives at the OSD.

Yes, here what I've missed is the fact, that there is no any special
command for doing a local clone, the IO with different snapid itself
is the initiator of a clone.  But even this does not change the main
logic: since each replica receives the same IO with different snap
id which should spawns a local clone, it should not be difficult
to make this cloned object in sync on all replicas and then reply
to the client with IO completion.

But this extra logic will be needed only in the case of client-based
replication. Let's forget about it.  I feel that this is quite a
popular proposal for Ceph and people tend to defend exactly this
particular case.  I still do not refuse the idea to have a client
responsible for fanning out the write IOs, but what I want to stress,
that it is not that important.  It is just the option which I dream
to have in my-perfect-storage.  But what is important in what I proposed
is relaxation of strong ordering.

<snip>

Everything which is not plain read/write is treated as management
or metadata requests.

> (Among other things, any write op has the
> potential to change the object’s size and that needs to be ordered
> with truncate operations.)

Why these two operations have to be ordered? No, I will ask another
way.  Why distributed storage should care about the order of these
two operations? That what I do not understand.  Why client can't be
responsible for proper waiting of IO and only then issuing a truncate?
(direct analogy: you issue an IO to your file and do not wait for a
completion, then you truncate your file, what is the result?)

But if we are talking about concurrent clients case, when one of the
clients issues write, meanwhile another one issues truncate, than
I do not understand how does the sync log help, because the primary
replica can receive these two requests at any order (we assume no
proper locking is used, right?)

If a client dispatches multiple operations on a single object, we
guarantee they are ordered in the same order they were dispatched by
the client. So he can do async truncate, async write, async read,
async append, whatever from a single thread and we promise to process
them in that order.

Yes, exactly. RADOS guarantees that a single thread can throw out
many async requests and these requests have exactly the same sequence
on each replica as these requests were sent from the thread. This is
what imposes bunch of restrictions on major layers: PGLog and
ObjectStore.

So my question is: what are exactly the cases when rbd or cephfs
need to throw bunch of overlapping requests? (under overlapping
I mean pairs like: read and write to the same offset, create and
delete of the same object, truncate and write, etc).

Since I don't see the whole complicated picture I can speculate only
using the simple facts: vfs layer in the kernel distinguishes
metadata path from data path, so fs has to be sure that metadata is
durable on the disk, before any IO comes, thus you have to wait for
completion of metadata updates; block layer in its turn does not care
about ordering, thus requests ordering and flushes is the
responsibility of fs on top. So RBD (since it is a block device)
should not care about ordering either. Of course metadata path is
different (rbd layering, snapshots, etc), but again why RBD can't
wait for a completion of metadata requests?

And the crucial thing which I want to understand is would it be possible
to change clients (cephfs and rbd) in order to relax ordering and
make an explicit wait for metadata requests?

The negative answer on this question makes the whole proposal and
discussion on this topic senseless, because client-based replication
(on which we were concentrated so far) is in my wish list and does
not influence (or at least not so much) the whole picture.

Some of that could probably be maintained in the client library rather
than on the OSDs, but not all of it given the timestamp-based retries
you describe and the problem of snapshots I mentioned above.

Not all of them, that's why those which can not, go through the
primary.  But I got your point.

Basically what I'm getting at is that given the way the RADOS protocol
works, we really don't have any ops which are plain read/write.
Especially including some of the ones you care about most -- for
instance, RBD volumes without an object map include an IO hint on
every write, to make sure that any newly-created objects get set to
the proper 4MB size. These IO hints mean they're all compound
operations, not single-operation writes!

Seems in RBD you already have two paths: w/ and w/o object map, right?
Then why a new path with a raw CEPH_OSD_OP_WRITE can't be a modification
which let's IO go straight to the replica and then straight to the disk?

Of course, all these issues are not the result of changing our
durability guarantees, but of trying to provide client-side
replication...

Ok, let's remove client-side replication from a discussion for a
while.  RADOS ordering constraints - this is what seems important.

> Now, making these changes isn’t necessarily bad if we want to develop
> a faster but less ridiculously-consistent storage system to better
> serve the needs of the interfaces that actually get deployed — I have
> long found it a little weird that RADOS, a strictly-consistent
> transactional object store, is one of the premier providers of virtual
> block-device IO and S3 storage. But if that’s the goal, we should
> embrace being not-RADOS and be willing to explicitly take much larger
> departures from it than just “in the happy path we drop ordering and
> fall back to backfill if there’s a problem”.

Current constraints are blockers for the IO performance.  It does not
matter how much we squeeze from the CPU (crimson project), unless we
can't relax IO ordering or reduce journaling effects, the overall
CPU cycles improvements can be not so impressive.

so I hope Ceph can make a step forward and be less conservative,
especially when we have a hardware, which breaks all the possible
rules.

> The second big point is that if you want to have a happy path and a
> fallback ordered path, you’ll need to map out in a lot more detail how
> those interact and how the clients and OSDs switch between them. Ideas
> like this have come up before but almost every one (or literally every
> one?) has had a fatal flaw that prevented it actually being safe.

Here I rely on a fact, that replicas know the PG state (as it is right
now).  If PG is active and clean then replica accepts IO.  If not -
IO is rejected with the proper error: "dear client, go to the primary,
I'm not in the condition to serve your request, but primary can".

Here several scenarios are possible. Client was the first one who
observes a replica in not a healthy state.  We can expect all other
replicas will observe the same not healthy state sooner, but client
can propagate this information to other replicas in PG (need to be
discussed in detail).

"Not a healthy state" isn't really meaningful in Ceph — we make
decisions based on the settings of the newest OSDMap we have access
to, but peer OSDs might or might not match that state. When the
primary processes an OSDMap marking one of the peers down and he sets
it to degraded, there's a window where the peers haven't seen that
update. The clients will probably take even longer.

Since clients are the request initiators they will immediately observe
the difference in osdmap version just because one of replicas replies
back with osdmap version > than client knows about. Here there is
no any difference between client which acts as the primary in client-
driven replication and a primary in primary-copy model.

And merely not
getting an op reply as fast as the client wants isn't indicative of
anything that RADOS cares about. Those states and the transitions
between them and the recovery logic for ops-in-flight are all very
hard to get right, and having the solutions mapped out in detail is a
requirement for merging any kind of change in RADOS.

In a client-driven case I talk about raw CEPH_OSD_OP_WRITE* requests
only (others follow the primary path).  So recovery for this particular
case (OP_WRITE* requests) is block based and does not imply any
log-based replication.

And yes, this brakes RADOS guarantees: WRITE requests are not ordered
in respect to other requests.

<snip>

*Hybrid* client-side replication :) When client is responsible for
fanning
out write requests only in case of healthy pg.

> It is frequently undesirable
> since OSDs tend to have lower latency and more bandwidth to their
> peers than the clients do to the OSDs;

Latency is the answer.  I want to squeeze everything from RDMA.  For
current
Ceph RDMA is dead.  Basically for current implementation any 
per-client
improvements on transport side bring nothing. (I spent some time 
poking
the protocol v1 and had a good speed up on transport side,  which is
unnoticed for the whole per-client IO performance. sigh)

Can you explain why client-side replication over RDMA is a better idea
than over ethernet IP?

Seems there is a confusion introduced by me. I do not want to mix RDMA
and client-side replication cases, so let me rephrase my previous 
answer:
if we have a client-driven replication and no ordering on IO then it 
would
be great to have an RDMA transport, because then we would have a
microseconds latency from a client to all replicas.

If we do not have a client-drive replication, but read/write IO ordering
is relaxed and writes can go directly to disk bypassing journaling on
objectstore side, then it would be also great to have an RDMA transport,
but with one extra network hop, so latency is increased, but ok, I can
deal with it.

But if we have log-based replication for IO (what we have just right
now), strict ordering for the whole PG, full journaling on all the data,
then it does not matter what transport to use - transport is not a
bottleneck, so of course in this case TCP (UDP) socket perfectly copes
with its task.  No doubts.

So again: I just do not want an extra hop (client-driven replication).
And for me latency matters (thus RDMA).

Like I said with math, I think in most cases it
is actually slower,

My equations are much simpler :) Having one extra hop on the fast IO
path doubles network latency.

and it DEFINITELY makes harder all the other kinds
of changes you want to make. I think you will be a lot happier if you
drop that.

Let's stop discussing it for a while.  RADOS API ordering constraints
are of much greater importance to me.  So if is possible to modify 
clients
in order to teach them to wait for completions properly and never let
overlapping IO happen (not-RADOS anymore), then would be great to 
continue
discussion on that topic.  If this is an impossible task - well, then
cephfs/rbd are not good candidates and probably other cluster file
systems can fit better to cope with more relaxed API.

(Also: we are doing a lot of work where read-from-replica will become
desirable for things like rack-local reads and not being able to do
that would be sad.)

What is this exactly rack-local reads?  Could you please elaborate or
give me the link?

--
Roman
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx