Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)

Gregory Farnum <gfarnum@xxxxxxxxxx> · Fri, 15 Apr 2016 15:09:57 -0700

On Fri, Apr 15, 2016 at 2:05 PM, Adam C. Emerson <aemerson@xxxxxxxxxx> wrote:
> Ceph Developers,
>
> We've put together a few of the main ideas from our previous work in a
> brief form that we hope people will be able to digest, consider, and
> debate. We'd also like to discuss them with you at Ceph Next this
> Tuesday.
>
> Thank you.
>
>
> ---8<---
>
>
> We have been looking at improvements to Ceph, particularly RADOS,
> while focusing on flexibility (allowing users to do more things)
> and performance. We have come up with a few proposals with these two
> things in mind. Sessions and read-write transactions aim to allow
> clients to batch up multiple operations in a way that is safe and
> correct, while allowing clients to gain the advantages of atomic
> read-write operations without having to lock. Sessions also provide
> a foundation for flow-control which ultimately improves performance
> by preventing an OSD from being ground into uselessness under a
> storm of impossible requests. The CLS proposal is a logical follow-on
> from the read-write proposal, as we attempt to address some problems
> of correctness that exist now and consider how to integrate the
> facility into an asynchronous world.
>
> Flexible Placement, as you would expect from the name, is about
> allowing users more control, as are Flexible Semantics. They both
> have profound performance implications, as tuning placement to better
> match a workload can increase throughput, and relaxed consistency can
> decrease latency. The proposed Interfaces are meant to support both as
> well as work currently being done to allow an asynchronous OSD and to
> hide details like locking and thread pools so that backends can be
> written with different forms of concurrency and load-balancing
> across processors.
>
> Finally, Map Partitioning is not directly related to code paths within
> the OSD itself, but does affect everything that can be done with Ceph.
> People are beginning to run into limits on how large a Ceph cluster can
> grow and how many ways they can be partitioned, and both these problems
> fundamentally derive from the way the OSD map is handled by the monitors.
>
> There are also some notes at the end. They are not critical, but if you
> find yourself asking "What were they thinking?" the notes might help.
>
> # Sessions and Read-Write #
>
> From `ReplicatedPG.cc`.
>
> ```c++
> // Write operations aren't allowed to return a data payload because
> // we can't do so reliably. If the client has to resend the request
> // and it has already been applied, we will return 0 with no
> // payload.  Non-deterministic behavior is no good.  However, it is
> // possible to construct an operation that does a read, does a guard
> // check (e.g., CMPXATTR), and then a write.  Then we either succeed
> // with the write, or return a CMPXATTR and the read value.
> …
> if (ctx->op_t->empty() || result < 0) {
>   …
>   if (ctx->pending_async_reads.empty()) {
>     complete_read_ctx(result, ctx);
>   } else {
>     in_progress_async_reads.push_back(make_pair(op, ctx));
>     ctx->start_async_reads(this);
>   }
>   return;
> }
> …
> // issue replica writes
> ceph_tid_t rep_tid = osd->get_tid();
>
> RepGather *repop = new_repop(ctx, obc, rep_tid);
>
> issue_repop(repop, ctx);
> eval_repop(repop);
> ```
>
> As you can see, if we have any writes (all mutations end up in the
> `op_t` transaction), we just flat out don't do the requested read
> operations. If we don't have any writes, we perform the read
> operations and return.  This is justified in the comment above because
> of the non-deterministic behavior of resent read-write operations.
>
> This is not an unsolved problem and we can bootstrap a solution on our
> existing `Session` infrastructure.
>
> ## An upgraded session ##
>
> Behold, `OSDSession`:
> ```c++
> struct Session : public RefCountedObject {
>   EntityName entity_name;
>   OSDCap caps;
>   int64_t auid;
>   ConnectionRef con;
>   WatchConState wstate;
>   …
> };
> ```
>
> This structure exists once for every connection to the OSD. Where they
> are created depends on who is doing the creation. In the case of
> clients (what we're interested in) it occurs in `ms_handle_authorizeri`
> ```c++
> …
> isvalid = authorize_handler->verify_authorizer(cct, monc->rotating_secrets,
>                                                authorizer_data, authorizer_reply, name, global_id, caps_info, session_key, &auid);
>
> if (isvalid) {
>   Session *s = static_cast<Session *>(con->get_priv());
>   if (!s) {
>     s = new Session(cct);
>     con->set_priv(s->get());
>     s->con = con;
>     dout(10) << " new session " << s << " con=" << s->con << " addr=" << s->con->get_peer_addr() << dendl;
>   }
>
>   s->entity_name = name;
>   if (caps_info.allow_all)
>     s->caps.set_allow_all();
>   s->auid = auid;
>   …
> }
> ```
>
> In order to solve this problem, we propose a new data structure,
> modelled on NFSv4.1
> ```c++
> struct OpSlot {
>   uint64_t seq;
>   int r;
>   MOSDOpReplyRef cached; // Nullable
>   bool completed;
> };
> ```
>
> We do not want to give the OSD an unbounded obligation to hang on to
> old message replies: that way lies madness. So, the additions to
> `Session` we might make are:
>
> ```c++
> struct Session : public RefCountedObject {
>   …
>   uint32_t maxslots; // The maximum number of operations this client
>                      // may have in flight at once;
>   std::vector<OpSlot> slots // The vector of in-progress operations
>   ceph::timespan slots_expire; // How long we wait to hear from a
>                                // client before the OSD is free to
>                                // drop session resources
>   cepu::coarse_mono_time last_contact; // When (by our measure) we
>                                        // last received an operation
>                                        // from the client.
> };
> ```
>
> ## Message Additions ##
>
> The OSD needs to communicate this information to the client. The most
> useful way to do this is with an addition to `MOSDOpReply`.
>
> ```c++
> class MOSDOpReply : public Message {
>   …
>   uint32_t this_slot;
>   uint64_t this_seq;
>   uint32_t max_slot;
>   ceph::timespan timeout;
>   …
> };
> ```
>
> This overlaps with the function of the transaction ID, since the
> slot/sequence/OSD triple uniquely identifies an operation. Unlike the
> transaction ID, this provides consistent semantics and a measure of
> flow control.
>
> To match our reply, the `MOSDOp` would need to be amended.
> ```c++
> class MOSDOp : public Message {
>   …
>   uint32_t this_slot;
>   uint64_t this_seq;
>   bool please_cache;
>   …
> };
> ```
>
> ## Operations ##
>
> ### Connecting ###
>
> A client, upon connecting to an OSD for the first time should send a
> `this_slot` of 0 and a `this_seq` of 0. If it reconnects to an OSD it
> should use the `this_slot` and `this_seq` values from before it lost
> its connection. If an OSD has state for a client and receives a
> `(slot,seq) = (0,0)` then it should feel free to free any saved state
> and start anew.
>
> ### OSD Feedback ###
>
> In every `MOSDOpReply` the OSD should send `this_slot` and `this_seq` to
> the value from the `MOSDOp` to which we're replying.
>
> More usefully, the OSD can inform the client how many operations it is
> allowed to send concurrently with `max_slot`. The client must **not**
> send a slot value higher than `max_slot`. (The OSD should error if it
> does.)
>
> The OSD may increase the number of operations allowed in-flight
> if it has capacity by increasing `max_slot`. If it finds itself
> lacking capacity, it may decrease `max_slot`. If it does, the client
> should respect the new bound. (The OSD should feel free to free the
> rescinded slots as soon as the client sends another `MOSDOp` with a
> slot value equal to one on which the new `max_slot` has been sent.)
>
> If the client sends a `this_seq` lower than the one held for a slot by
> the OSD, the OSD should error. If it is more than one greater than the
> current `this_seq`, the OSD should error.
>
> ### Caching ###
>
> The client is in an excellent position to know whether it **requires**
> the output of a previous operation of mixed reads and writes on
> resend, or whether it merely needs the status on resend. Thus, we let
> the client set `please_cache` to request that the OSD store a
> reference to the sent message in the appropriate `OpSlot`.
>
> The OSD is in an excellent position to know how loaded it is. It can
> calculate a bound on how large a given reply will be before executing
> it. Thus, the OSD can send an error if the client has requested it
> cache something larger than it feels comfortable caching.
>
> Assuming no errors, the behavior, for any slot, is this: If the client
> sends an `MOSDOp` with a `this_seq` one greater than the current value
> of `OpSlot::seq`, that represents a new operation. Increment
> `OpSlot::seq`, clear `OpSlot::completed` and begin the operation. When
> the operation finishes, set `OpSlot::completed`. If `please_cache` has been
> set, store the `MOSDOpReply` in `OpSlot::cached`. Otherwise simply store the
> result code in `OpSlot::r`.
>
> If the client sends an `MOSDOp` with a `this_seq` equal to
> `OpSlot::seq` and `OpSlot::completed` is false, drop the request. (We
> will reply when it completes.) If it has completed, send the stored
> `OpSlot::MOSDOpReply` if there is one, otherwise send just a replay
> with just `OpSlot::r`.
>
> ### Reconnection ###
>
> Currently the `Session` is destroyed on reset and a new one is created
> on authorization. In our proposed system the `Session` will not be
> destroyed on reset, it will be moved to a structure where it can be
> looked up and destroyed after `timeout` since the last message
> received.
>
> On connection, the OSD should first look up a `Session` keyed
> on the entity name and create one if that fails.

So the most common time we really get replay operations is when one of
the OSDs crash or a PG's acting set changes for some other reason.
Which means these "cached" operation results need to be persisted to
disk and then cleaned up, a la the pglog.
I don't see anything in these data structures that explains how we do
that efficiently, which is the biggest problem and the reason we don't
already do reply caching. Am I missing something?

And do you think maybe you could split this up into a thread for each
topic? I'm having trouble digesting it as such a wall of text. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html