Re: Ann Arbor Team's Flexible I/O Proposals (Ceph Next)

Milosz Tanski <milosz@xxxxxxxxx> · Fri, 15 Apr 2016 17:25:18 -0400

On Fri, Apr 15, 2016 at 5:05 PM, Adam C. Emerson <aemerson@xxxxxxxxxx> wrote:
> Ceph Developers,
>
> We've put together a few of the main ideas from our previous work in a
> brief form that we hope people will be able to digest, consider, and
> debate. We'd also like to discuss them with you at Ceph Next this
> Tuesday.
>
> Thank you.
>
>
> ---8<---
>
>
> We have been looking at improvements to Ceph, particularly RADOS,
> while focusing on flexibility (allowing users to do more things)
> and performance. We have come up with a few proposals with these two
> things in mind. Sessions and read-write transactions aim to allow
> clients to batch up multiple operations in a way that is safe and
> correct, while allowing clients to gain the advantages of atomic
> read-write operations without having to lock. Sessions also provide
> a foundation for flow-control which ultimately improves performance
> by preventing an OSD from being ground into uselessness under a
> storm of impossible requests. The CLS proposal is a logical follow-on
> from the read-write proposal, as we attempt to address some problems
> of correctness that exist now and consider how to integrate the
> facility into an asynchronous world.
>
> Flexible Placement, as you would expect from the name, is about
> allowing users more control, as are Flexible Semantics. They both
> have profound performance implications, as tuning placement to better
> match a workload can increase throughput, and relaxed consistency can
> decrease latency. The proposed Interfaces are meant to support both as
> well as work currently being done to allow an asynchronous OSD and to
> hide details like locking and thread pools so that backends can be
> written with different forms of concurrency and load-balancing
> across processors.
>
> Finally, Map Partitioning is not directly related to code paths within
> the OSD itself, but does affect everything that can be done with Ceph.
> People are beginning to run into limits on how large a Ceph cluster can
> grow and how many ways they can be partitioned, and both these problems
> fundamentally derive from the way the OSD map is handled by the monitors.
>
> There are also some notes at the end. They are not critical, but if you
> find yourself asking "What were they thinking?" the notes might help.
>
> # Sessions and Read-Write #
>
> From `ReplicatedPG.cc`.
>
> ```c++
> // Write operations aren't allowed to return a data payload because
> // we can't do so reliably. If the client has to resend the request
> // and it has already been applied, we will return 0 with no
> // payload.  Non-deterministic behavior is no good.  However, it is
> // possible to construct an operation that does a read, does a guard
> // check (e.g., CMPXATTR), and then a write.  Then we either succeed
> // with the write, or return a CMPXATTR and the read value.
> …
> if (ctx->op_t->empty() || result < 0) {
>   …
>   if (ctx->pending_async_reads.empty()) {
>     complete_read_ctx(result, ctx);
>   } else {
>     in_progress_async_reads.push_back(make_pair(op, ctx));
>     ctx->start_async_reads(this);
>   }
>   return;
> }
> …
> // issue replica writes
> ceph_tid_t rep_tid = osd->get_tid();
>
> RepGather *repop = new_repop(ctx, obc, rep_tid);
>
> issue_repop(repop, ctx);
> eval_repop(repop);
> ```
>
> As you can see, if we have any writes (all mutations end up in the
> `op_t` transaction), we just flat out don't do the requested read
> operations. If we don't have any writes, we perform the read
> operations and return.  This is justified in the comment above because
> of the non-deterministic behavior of resent read-write operations.
>
> This is not an unsolved problem and we can bootstrap a solution on our
> existing `Session` infrastructure.
>
> ## An upgraded session ##
>
> Behold, `OSDSession`:
> ```c++
> struct Session : public RefCountedObject {
>   EntityName entity_name;
>   OSDCap caps;
>   int64_t auid;
>   ConnectionRef con;
>   WatchConState wstate;
>   …
> };
> ```
>
> This structure exists once for every connection to the OSD. Where they
> are created depends on who is doing the creation. In the case of
> clients (what we're interested in) it occurs in `ms_handle_authorizeri`
> ```c++
> …
> isvalid = authorize_handler->verify_authorizer(cct, monc->rotating_secrets,
>                                                authorizer_data, authorizer_reply, name, global_id, caps_info, session_key, &auid);
>
> if (isvalid) {
>   Session *s = static_cast<Session *>(con->get_priv());
>   if (!s) {
>     s = new Session(cct);
>     con->set_priv(s->get());
>     s->con = con;
>     dout(10) << " new session " << s << " con=" << s->con << " addr=" << s->con->get_peer_addr() << dendl;
>   }
>
>   s->entity_name = name;
>   if (caps_info.allow_all)
>     s->caps.set_allow_all();
>   s->auid = auid;
>   …
> }
> ```
>
> In order to solve this problem, we propose a new data structure,
> modelled on NFSv4.1
> ```c++
> struct OpSlot {
>   uint64_t seq;
>   int r;
>   MOSDOpReplyRef cached; // Nullable
>   bool completed;
> };
> ```
>
> We do not want to give the OSD an unbounded obligation to hang on to
> old message replies: that way lies madness. So, the additions to
> `Session` we might make are:
>
> ```c++
> struct Session : public RefCountedObject {
>   …
>   uint32_t maxslots; // The maximum number of operations this client
>                      // may have in flight at once;
>   std::vector<OpSlot> slots // The vector of in-progress operations
>   ceph::timespan slots_expire; // How long we wait to hear from a
>                                // client before the OSD is free to
>                                // drop session resources
>   cepu::coarse_mono_time last_contact; // When (by our measure) we
>                                        // last received an operation
>                                        // from the client.
> };
> ```
>
> ## Message Additions ##
>
> The OSD needs to communicate this information to the client. The most
> useful way to do this is with an addition to `MOSDOpReply`.
>
> ```c++
> class MOSDOpReply : public Message {
>   …
>   uint32_t this_slot;
>   uint64_t this_seq;
>   uint32_t max_slot;
>   ceph::timespan timeout;
>   …
> };
> ```
>
> This overlaps with the function of the transaction ID, since the
> slot/sequence/OSD triple uniquely identifies an operation. Unlike the
> transaction ID, this provides consistent semantics and a measure of
> flow control.
>
> To match our reply, the `MOSDOp` would need to be amended.
> ```c++
> class MOSDOp : public Message {
>   …
>   uint32_t this_slot;
>   uint64_t this_seq;
>   bool please_cache;
>   …
> };
> ```
>
> ## Operations ##
>
> ### Connecting ###
>
> A client, upon connecting to an OSD for the first time should send a
> `this_slot` of 0 and a `this_seq` of 0. If it reconnects to an OSD it
> should use the `this_slot` and `this_seq` values from before it lost
> its connection. If an OSD has state for a client and receives a
> `(slot,seq) = (0,0)` then it should feel free to free any saved state
> and start anew.
>
> ### OSD Feedback ###
>
> In every `MOSDOpReply` the OSD should send `this_slot` and `this_seq` to
> the value from the `MOSDOp` to which we're replying.
>
> More usefully, the OSD can inform the client how many operations it is
> allowed to send concurrently with `max_slot`. The client must **not**
> send a slot value higher than `max_slot`. (The OSD should error if it
> does.)
>
> The OSD may increase the number of operations allowed in-flight
> if it has capacity by increasing `max_slot`. If it finds itself
> lacking capacity, it may decrease `max_slot`. If it does, the client
> should respect the new bound. (The OSD should feel free to free the
> rescinded slots as soon as the client sends another `MOSDOp` with a
> slot value equal to one on which the new `max_slot` has been sent.)
>
> If the client sends a `this_seq` lower than the one held for a slot by
> the OSD, the OSD should error. If it is more than one greater than the
> current `this_seq`, the OSD should error.
>
> ### Caching ###
>
> The client is in an excellent position to know whether it **requires**
> the output of a previous operation of mixed reads and writes on
> resend, or whether it merely needs the status on resend. Thus, we let
> the client set `please_cache` to request that the OSD store a
> reference to the sent message in the appropriate `OpSlot`.
>
> The OSD is in an excellent position to know how loaded it is. It can
> calculate a bound on how large a given reply will be before executing
> it. Thus, the OSD can send an error if the client has requested it
> cache something larger than it feels comfortable caching.
>
> Assuming no errors, the behavior, for any slot, is this: If the client
> sends an `MOSDOp` with a `this_seq` one greater than the current value
> of `OpSlot::seq`, that represents a new operation. Increment
> `OpSlot::seq`, clear `OpSlot::completed` and begin the operation. When
> the operation finishes, set `OpSlot::completed`. If `please_cache` has been
> set, store the `MOSDOpReply` in `OpSlot::cached`. Otherwise simply store the
> result code in `OpSlot::r`.
>
> If the client sends an `MOSDOp` with a `this_seq` equal to
> `OpSlot::seq` and `OpSlot::completed` is false, drop the request. (We
> will reply when it completes.) If it has completed, send the stored
> `OpSlot::MOSDOpReply` if there is one, otherwise send just a replay
> with just `OpSlot::r`.
>
> ### Reconnection ###
>
> Currently the `Session` is destroyed on reset and a new one is created
> on authorization. In our proposed system the `Session` will not be
> destroyed on reset, it will be moved to a structure where it can be
> looked up and destroyed after `timeout` since the last message
> received.
>
> On connection, the OSD should first look up a `Session` keyed
> on the entity name and create one if that fails.
>
> # Read as a part of Transaction #
>
> We don't have code examples here since most of the obvious interface
> changes are obvious. Codes and parameters would be added to
> `PGBackend::Transaction` and executing a transaction would have to
> return data.
>
> ## Motivation ##
>
> -   Mixed reads and writes are an efficiency win, since a client can
>     save round trips by batching up operations in a single request.
>     Current Ceph does not allow them for reasons which are quoted and
>     addressed in the preceding section.
> -   Mixed reads and writes are a semantic win. If an `MOSDOp` is
>     atomic (it is in current Ceph), read-after-write can often remove
>     the need for explicit locking.
> -   Transactional reads may seem complicated, but the Erasure Coding
>     backend already has to execute complex read transactions to
>     reassemble or recover data. We want an asynchronous read capability
>     in the Store anyway and there's no reason not to have it be shared
>     with our asynchronous write path.
> -   While it might seem that separating reads and writes, as we do
>     now, allows us to simplify code and rule out edge cases, we would
>     like to point out the existence of CLS, which can have problems if
>     two method calls occur in the same `MOSDOp`.
>
> ## Sketch ##
>
> The main problem with mixed read-write transactions is that replicas
> need to write but not read. The key to handling this is dependency
> checking. Outside CLS (which will be discussed below) it is very easy
> to see whether reads and writes are independent. (Simply go down the
> ops and see if their ranges overlap and whether getattrs and setattrs
> have keys in common.) Reads coming after overlapping writes depend on
> the previous writes. Then:
> -   If an op that's all reads, simply do all the reads. We don't have
>     to get write locks or anything.
> -   If an op is all writes, it's no different than a replicated
>     operation now.
> -   For mixed reads and writes, if the reads aren't dependent on the
>     writes, dispatch the writes and do the reads before, after, or
>     concurrently with the writes on the primary.  (So long as we
>     prevent writes from other transactions from intervening.)
> -   Dependent reads are the difficult case. For erasure coding it
>     shouldn't any difference since we'd have to dispatch reads and
>     writes to all stripes anyway. For replication, we would want to
>     execute the mixed read-write transaction on the local store in
>     strict order and dispatch one consisting of only writes to the
>     remotes.
>
> # CLS #
>
> ## Current Problem ##
>
> The CLS API works by making an ops vector and handing it to
> `do_osd_ops`.
>
> ```c++
> int cls_cxx_getxattr(cls_method_context_t hctx, const char *name,
>                      bufferlist *outbl)
> {
>   ReplicatedPG::OpContext **pctx = (ReplicatedPG::OpContext **)hctx;
>   bufferlist name_data;
>   vector<OSDOp> nops(1);
>   OSDOp& op = nops[0];
>   int r;
>
>   op.op.op = CEPH_OSD_OP_GETXATTR;
>   op.indata.append(name);
>   op.op.xattr.name_len = strlen(name);
>   r = (*pctx)->pg->do_osd_ops(*pctx, nops);
>   if (r < 0)
>     return r;
>
>   outbl->claim(op.outdata);
>   return outbl->length();
> }
>
> int cls_cxx_setxattr(cls_method_context_t hctx, const char *name,
>                      bufferlist *inbl)
> {
>   ReplicatedPG::OpContext **pctx = (ReplicatedPG::OpContext **)hctx;
>   bufferlist name_data;
>   vector<OSDOp> nops(1);
>   OSDOp& op = nops[0];
>   int r;
>
>   op.op.op = CEPH_OSD_OP_SETXATTR;
>   op.indata.append(name);
>   op.indata.append(*inbl);
>   op.op.xattr.name_len = strlen(name);
>   op.op.xattr.value_len = inbl->length();
>   r = (*pctx)->pg->do_osd_ops(*pctx, nops);
>
>   return r;
> }
> ```
>
> The `do_osd_ops` function performs reads inline, synchronously, right
> then and there for replicated pools. (Erasure coded pools are more
> limited.) Writes are batched up and added to the transaction
> associated with the current `OpContext`.
>
> This is bad. If one has a CLS method that performs a read-modify-write
> and one calls it twice in the same `MOSDOp`, it becomes a
> read-modify-read-modify-write-write which may produce incorrect
> results.
>
> ## Desiderata ##
>
> -   CLS operations should be composable. We should be able to have many
>     of them in a single operation.
> -   They should remain transactional. If a CLS operation does some
>     reads and hits an error, it stops and nothing is written to the
>     store. We should not allow situations where a CLS method can write
>     a partial result to the store then error.
> -   They should be capable. We should not put too many restrictions on
>     what an operation is allowed to do. It should be possible to run
>     them on Erasure Coded Pools once ECOverwrite is in place. (At
>     least some subset of them).
> -   They should be consistent. A CLS operation should be able to call
>     rand or generate a UUID without each replica holding a different
>     value. (This rules out solutions like calling the method on each
>     replica.)
> -   They should be efficient and optimizable.
> -   They should work in an asynchronous framework.
>
> There are several ways we could change their implementation to address
> these.
>
> ## Futures ##
>
> This is an attractive way to think about CLS. It allows things to
> proceed asynchronously and would solve the RMRMWW problem. One would
> simply make every I/O operation in the CLS API a call returning a
> future and write each method in continuation passing style. Executing
> the transaction in the primary OSD (on a replicated pool) would create
> a write-only that could then be sent to replicas. (Having the
> execution of a CLS method also compile a write-only transaction is a
> propery of any composable design.)
> -   Tracking dependencies before the operation is executed would be
>     problematic. There would be no way to know whether later reads
>     overlapped with previous writes before doing them. This could lead
>     to an unbounded obligation on the OSD to maintain state to
>     evaluate OSDOp, including potentially large writes, before
>     actually committing in order for CLS methods to remain
>     transactional.
>
> Futures are, on their own, insufficient to provide everything we need
> from CLS, largely because they are opaque to the OSD. They could be
> combined with…
>
> ## Pre-declaration ##
>
> We could remove some of the generality. A simple way of doing so would
> be to have methods declare, as a part of their signature, everything
> that they may ever read or write, with the expectation that methods
> will name the fewest resources required. This doesn't mean that every
> method will always write to and read from everything it mentions,
> merely that we have a known bound of the maximum it will ever use.
>
> This makes analysis easier for the OSD, and in the composition case,
> it could go in two passes. In the first, it would execute CLS calls
> and pre-stage results and in the second it would pass its compiled
> write transaction into the store.
>
> This is the most attractive solution, but depends on pre-declaration
> being done well on the resources used in pre-staging.
>
> One could make things easier by being even more restrictive and
> imposing ordering:
> 1.  The method declares in advance all read operations it might ever
>     perform.
> 2.  The method declares in advance all write operations it might ever
>     perform.
> 3.  The method examines the parameters passed by the client and
>     indicates which subset of the named inputs and outputs it will
>     use.
> 4.  The method performs its read operations and denotes exactly which
>     output operations it will perform. (not the data to be output, but
>     ranges and names.)
> 5.  The method performs write operations.
>
> The most restrictive form of this would operate in two phases. First,
> the CLS method would be presented with its parameters and all of the
> things it plans to read or write (objects with ranges and attribute
> keys.) In the second it would be called with the contents of all the
> reads it requested and supply the data for all the writes it requested.
>
> This would obviate the need for futures or other asynchronous I/O,
> and make evaluation very easy. This approach would disallow some
> operations, like indirecting through an attribute key to read another,
> but is very appealing.
>
> ## Be Transactional ##
>
> Our transactions are pre-checked and must succeed. If we want the most
> expressive version of CLS consistent with our other goals, then we
> should add commit and rollback. EC Overwrite will already require some
> form of commit and rollback, so it's not beyond the realm of thought.
>
> It could also be a foundation for some future multi-object-transaction
> supporting backend.
>
> This idea might have appeal on its own, but the concerns of CLS are
> not sufficient to motivate it.
>
> ## Domain Specific Language ##
>
> One could make a domain specific language, based on something simple,
> that the OSD can execute to perform CLS methods. The OSD could then
> analyze each method to see what I/O operations a method calls and try
> to track them
> -   Dependency tracking for compilers is a major area of research. It
>     would be a whole lot of fun, but as a short term solution it is
>     not really practical.
> -   We still wouldn't be able to rule out problems in the general
>     case.
>
> This approach would be interesting as a long term academic research
> project, but is not suitable for a short-range improvement.
>
> # Flexible Placement #
>
> This is a large topic which should be discussed on its own, but it
> motivates the interface designs below, so we shall briefly mention why
> it's interesting.
>
> CRUSH/PG is a fine placement system for several workloads, but it has
> two well-known limitations.
>
> ## Motivation ##
>
> -   Data distribution can be much less uniform than one might like,
>     giving uneven use of disks. This has caused some Ceph developers
>     to experiment with Monte Carlo based placement algorithms.
> -   Data distribution can be much more uniform than one would
>     like. This is the fundamental cause of Ceph's slow sequential read
>     performance. More generally, unrelated workloads contend
>     with each other due to a lack of affinity for related data. The effects are
>     especially pronounced on spinning disk (due to seek times), but
>     still exist on Flash (due to bus/network contention.)  This is a
>     tension between competing goods. CRUSH gains wide dispersion and
>     uniformity to defend against correlated failures but this imposes
>     a tradeoff.
>
> ## Goal ##
>
> Ceph should support placement methods other than CRUSH/PG. Currently,
> the OSD dispatches operations based on placement group ID, which will
> need to be varied,
>
> We also need some way to get new types of functions into the cluster.
>
> ## Proposal ##
>
> Our proposal is, in a way, CRUSH taken to its logical
> conclusion. Instead of distributing CRUSH rules, we propose to
> distribute general computable functions from (oid, volume/dataset) pairs to
> sequences of OSDs with their supporting data structures.  One of our
> ongoing research projects has been an in-process executor for these
> functions based on Google's NaCl. The benefits are:
> -   Administrators can fine-tune placement functions to fit their
>     workloads well.
> -   They can also experiment easily without having to recompile all of
>     Ceph and make heavy architectural changes.
> -   Entirely new placement strategies can be deployed without having
>     to upgrade every machine in the cluster. Or any machine in the
>     cluster, once they've been upgraded to a Flexible Placement
>     capable version.
> -   Possibilities for annealing and machine learning to gradually
>     adapt placement in response to load data become available
> -   NaCl builds on LLVM which has a rich set of tools for optimizations
>     like partial evaluation.
> -   NaCl is fast.
>
> # Flexible Semantics #
>
> Another motivating example. Originally, Ceph did replication and only
> replication under a very specific consistency model. There has been
> desire for more flexibility.
> -   Erasure Coding. it still follows the Ceph consistency model
>     (though leaves out many operations) but is very different in
>     back-end dispatch, enough so that it inspired a major rewrite of
>     the OSD's bottom half.
> -   Append-only immutable objects have been discussed.
> -   Many people have asked for relaxed consistency to improve
>     performance. This is not be suitable for all workloads, but people
>     have repeatedly asked for the ability to set up low-latency,
>     relaxed-consistency volumes that still provide Ceph's ability to
>     easily use new storage and scale well.
> -   Transactional storage. As mentioned above, cross-object
>     transactional semantics are a thing people may have desired.
>
> # Interfaces #
>
> Right now our class hierarchy is a bit of a mess. Eventually we'll do
> something about `PG` and `ReplicatedPG`, refactor, support
> asynchronous I/O, reduce lock contention, support in core affinity,
> and build Jerusalem here in England's green and pleasant land.
>
> While we're stringing up our bows of burning gold, we should support
> non-PG based placement and flexible semantics. Right now, parts of the
> PG and the OSD (since the OSD manages the collection of PGs, spins
> them up, and manages thread pools shared by sets of PGs) are
> intertwined. Thus, we need to abstract out both pieces.
>
> As we also want to support having multiple "logical" OSDs running in a
> single `ceph-osd` process, this would be a natural time to add that
> capability.
>
> Both these are sketches and should be considered a work in progress.
>
> ## `DataSetInterface` ##
>
> Here is a sketch of what a flexible abstraction based on PG could look
> like, at least parts of one. Not being informed about Scrub,
> Recovery, or Cache Tiering, having only focused on the object
> operation path, we won't include those details here.
>
> We also leave out functions called from the PG itself or other objects
> invoked from ownstack.
>
> ```c++
> class DataSetInterface {
> protected:
>   LogicalOSD& losd; // LogicalOSD is a means to have different
>                     // stores/semantics run in the same process.
>
>   MapPartRef curmap; // Subset of map relevant to this DSI
> public:
>   // The OSD (things Up the Stack, generally) should not call 'lock'
>   // on us. If we have locking of some sort things down the stack that
>   // we have some relationship with (friend or whatever) could lock or
>   // unlock us, but that should not be baked in as part of the interface.
>
>   // Things like the info struct and details about loading the Place
>   // wouldn't actually be here. As there is an intimate relation
>   // between the LogicalOSD and an implementation of DataSetInterface (it
>   // holds all those loaded in memory and controls dispatch), they
>   // would not need to be part of the generic interface.
>
>   const coll_t coll; // The subdivision of the Store we control
>
>   // In the PG case we always know we're the primary or not for
>   // anything within the same pgid. That is not expected to be the
>   // case generally.
>   bool is_primary(const OpRequest&) = 0;
>   // No 'is_replica' since 'replica' may not be applicable
>   // generally. It's a bit off even in the erasure coded case.
>   bool is_acting(const OpRequest&) = 0;
>   bool is_inactive() = 0;
>
>  public:
>   // No identifier. The descendent will take that.
>   DataSetInterface(LogicalOSD& o, OSDMapRef curmap);
>   virtual ~DataSetInterface();
>
>   DataSetInterface(const DataSetInterface&) = delete;
>   DataSetInterface& operator =(const DataSetInterface&) = delete;
>   DataSetInterface(DataSetInterface&&) = delete;
>   DataSetInterface& operator =(DataSetInterface&&) = delete;
>
>   virtual void on_removal(ObjectStore::Transaction *t) = 0;
>
>   // Yes, there's no 'queue' and no 'do_op' or any of
>   // that. This is intentional. There's no dequeue or do_op because
>   // those functions are either called only by the PG currently OR
>   // they're called in OSD functions called by the PG as part of the
>   // thread switch. They should not be part of the public interface.
>
>   // There's no queue because we can either put queue here or we can
>   // put queue in LogicalOSD. (We could do both, but that seems bad to
>   // me.) If there is some combination of locking and checking that
>   // must be done before queueing an operation, it seems that it's
>   // better to do it in LogicalOSD so that it doesn't leak out and
>   // become part of the abstraction for other implementations.
> };
> ```
>
> ## `LogicalOSD` ##
>
> The OSD class itself (representing the single OSD process) should have
> a map (*perhaps* a Boost.Intrusive.Set?) mapping OSD IDs to to
> `LogicalOSD` instances.
>
> ```c++
> class LogicalOSD {
>   OSD& osd;
>   ObjectStore& store;
>
>   // Look up the DataSetInterface instance appropriate to the given
>   // OpRequest.
>   virtual future<DataSetInterface,int> get_place_for(const OpRequest&) = 0;
>
>   // Every logical OSD will have its own watchers as well as slot
>   // cache. Someone familiar with flow control should check this
>   // idea. Since LogicalOSDs will, ideally, share messengers we might
>   // want them to share the same slot cache. In that case we should
>   // just re-dimension watchers within Session
>   SessionRef session_for(const entity_name_t& name);
>
>   void queue(DataSetInterfaceRef&& pi, OpRequestRef&& to_queue);
>   void queue_front(DataSetInterfaceRef&& pi, OpRequestRef&& to_queue);
>
>   // Dequeue and the like are currently called in the PG itself and so
>   // have no place in the interface presented to the OSD.
>
>   void pause();
>   void resume();
>   void drain();
> };
> ```
>
> ## Library ##
>
> Both these interfaces are quite thin and intentionally so. Scrubbing
> and recovery have not been addressed at all, as mentioned, so those
> parts will be expanded.  Asynchrony should allow us simpler interfaces
> since some complexity of requeing will be handled by futures and
> continuations.
>
> We obviously do not want to rewrite all our existing code. Instead
> most of the existing work on `PG` and `ReplicatedPG` should be
> refactored into a templated library from which implementations of
> `LogicalOSD` and `DataSetInterface` can be constructed.
>
> # Map Partitioning #
>
> There are two huge problems with scalability in Ceph.
> 1.  The OSDMap knows too many things
> 2.  A single monitor manages all updates of everything and replicates them to
>     other monitors.
>
> ## Too Big to Not Fail ##
>
> The monitor map and MDS maps are fine. Each holds data needed to
> locate servers and that's it. It would be very hard to put enough data
> in them to cause problems. The OSD map however contains a trove of data that
> must be updated serially in Paxos and propagated to every OSD,
> monitor, MDS, and client in the cluster.
>
> Pools are a notorious example. We can't create as many pools as users
> would like. Pools are heavyweight, and while they depend on other
> items in the OSD map (like erasure code profiles), it would be nice if
> we divide them between several monitor clusters, each of which would
> hold a subset of pools. We would need to make sure that clients had up
> to date versions of whatever pools they are using along with the
> status of the OSDs they're speaking to, but that's not
> impossible. Likewise, we should split placement rules out of the OSD
> map, especially once we get into larger numbers of potentially larger
> Flexible Placement style functions.
>
> Nodes should then only need to subscribe to the set of pools and
> placement functions they need to access their data. Changes like these
> should allow users to create the number of pools they want without
> causing the cluster difficulty.
>
> ### Consistency ###
>
> Partitioning makes consistency harder. A simple remedy might be to
> stop referring to data by name or integer. An erasure code profile
> should be specified by UUID and version. So should pools and placement
> functions. When sending a request to the OSD, a client should send the
> versions of the pool, the ruleset, and the OSDMap it used and the OSD
> should check that all three are current.
>
> ## The OSD Set ##
>
> The complicating case here is the OSD status set.  Running this
> through a single Paxos limits the number of OSDs that can coexist in a
> cluster.  We ought split the set of OSDs between multiple masters to
> distribute the load. Each 'Up' or 'Down' event is independent of
> others, so all we require is that events get propagated into the
> correct OSDs and primaries and followers act as they're supposed to.
>
> Versioning is a bigger problem here. We might have all masters
> increment their version when one increments its version if that could
> be managed without inefficiency. We might send a compound version with
> `MOSDOp`s, but combining that with the compound version above might be
> unwieldly. (Feedback on this issue would be greatly appreciated.)
>
> ### Subscription ###
>
> For a large number of OSDs, it would be nice if not everyone were
> notified of all state changes.
>
> For a pool whose placement rule spans only a subset of all OSDs,
> clients using that pool should be able to subscribe to a subset of the
> OSD set corresponding to that pool. This should be fairly easy so long
> as the subset is explicit.
>
> In the case of pools not providing an explicit subset, a monitor (or
> perhaps a proxy in front of a set of monitors) could look at common
> patterns of subscription requests and merge those with significant
> overlap together, so as to give clients a subset without being
> destroyed by the irresistible force of combinatorial explosion.
>
> # Notes #
>
> These are notes taken when reviewing the code and thinking out
> ideas. You don't have to read them, but they are provided as a
> supplement in case you wanted to know what we were thinking and why.
>
> ## ShardedOpWQ ##
>
> -   What is the purpose of `sdata_op_ordering_lock`? A shard is not a
>     PG, so why do things need to be ordered within shards as well as
>     within PGs?
> -   `sdata_lock` pairs up with the condition variable
>
> ## OSD Upper Half ##
>
> ### Regular Dispatch ###
>
> -   Does not overlap with `fast_dispatch`. Operations in
>     `ms_can_fast_dispatch` are not handled in `_dispatch` and vice versa.
> -   Lock the entire OSD
> -   If another dispatch is executing, go to sleep and wait for it to
>     finish. What the heck?
> -   Do Waiters
>     * Waiters are a list of `OpRequestRef`s called `finished` for some
>           reason
>     * Whenever we activate an `OSDMap` the requests waiting for the
>       map get put onto 'finished'
> -   Call `_dispatch`
> -   Do some more waiters
> -   Wake up other dispatch threads
> -   Unlock the entire OSD
>
> #### `_dispatch`? ####
>
> A giant case statement that does a bunch of things.
>
> In the case of `OSDOp`, if we have an `OSDMap`, create an `OpRequest` and
> pass it to `dispatch_op`. This is for things like PG commands, not
> actual object operations.
>
> #### `dispatch_op` ####
>
> Another giant case statement.
>
> ### Fast Dispatch ###
>
> #### `ms_fast_preprocess` ####
>
> Update the map epoch if an OSD sends us an OSDMap.
>
> #### `ms_fast_dispatch` ####
>
> -   Make an `OpRequest`
> -   A bit weird and convoluted, it looks like we use the 'op waiting
>     for map' stuff to queue up an op on a reserved map and remove the
>     reservation preventing it from running before we return.
> -   Specifically we mark the op as waiting for its PG in the `Session`
>     and then mark the `Session` as waiting for the new map.
> -   Ultimately things end up in `dispatch_op_fast`
>
> #### `dispatch_op_fast` ####
>
> Shovels operations into type specific calls like…
>
> #### `handle_op` ####
>
> -   Set up map share (if needed)
> -   Calculate the True PGID and Pool (sanity check against the client?)
> -   Either get the pointer to the PG (a base class) or, if it hasn't
>     been loaded in, queue the session to wait for it
> -   If we have the PG, `enqueue_op` (which just calls `PG::queue_op`)
>
> ## OSD Lower Half (Currently PG) ##
>
> `ReplicatedPG` and `PG` are separate for historical reasons and actual
> differentiation occurs in choice of backend according to Word of Sam.
>
> PGs with different consistency properties are explicitly a goal
> now. The idea of a `PGInterface` has been floated to facilitate their
> creation and `ReplicatedPG` would become a child of that.
>
> ### `PG::queue_op` ###
>
> -   Delay if other people are waiting for maps (to preserve the PG Ordering)
> -   Enqueue in `op_wq` (owned by the OSD)
> -   (Why call into the PG then? Just to enforce the map ordering?)
> -   The work queue gathers operations which, during `_process` are later
>     reassembled into a list of work to be done.
> -   `_process` is called by a worker thread in the thread pool, so the
>     call to `dequeue_op` is in worker thread. Since it's sharded, we
>     get multiple groups of threads each serving some subset of PGs.
>
> ### `OSD::dequeue_op` ###
>
> -   After a bit of fiddling about sharing maps, call `PG::do_op`
>
> ### `PG::do_op` ###
>
> -   Sam says he plans to rewrite this to allow read asynchrony
> -   We want to see reads and writes share the same transaction
>     structure and similar semantics.
> -   We also want to allow reads and writes in the same operation and
>     to use a session mechanism to allow that.
> -   We'll need transaction transforms to, for example, filter out
>     reads before sending an operation to a replicating OSD. This
>     shouldn't be too hard, since the output of read operations can't
>     be the input for write operations. (Except in CLS?)
> -   `do_op` is a virtual function, but the only implementation is in
>     ReplicatedPG.
> -   Here looks to be where we apply ordering to Writes
> -   `execute_ctx` actually performs the operations after `do_op` has set
>     everything up
>
> ### `execute_ctx` ###
>
> -   May be called multiple times on the same `OpContext`
> -   In the case of clone operations (that's the only thing that takes
>     `src_obc`?), get a read-lock for the object context
> -   it's called `ondisk` but I'm not sure why, it doesn't look like they get serialized
> -   Then we have a brief detour into `prepare_transatcion`
> -   Here's the read-write restriction. ReplicatedPG.cc:2975. Later we can
>     create a better session abstraction to fix that.
> -   For reads
>     *   `do_osd_op_effects`!
>     *   If all our reads were synchronous (or there were none)
>         `complete_read_ctx`, which creates and sends the reply
>     -   Otherwise, `start_async_reads`, which passes the pending reads off
>         to `objects_read_async`
>     -   Once the backend completes, we go to `finish_read`, which calls
>         `complete_read_ctx`
> -   Trim the PG Log
> -   Hey, cool, there's a lambda! Register an `onack` closure that sends a reply
> -   And `oncommit`. And `onsuccess`. And `finish`.
> -   Package up the `OpContext` and its transaction and whatnot into a
>     `RepOp`. This is where all the mutations get done.
> -   Call `issue_repop`
> -   Call `eval_repop`
> -   Adam really wishes we would use `boost::intrusive_ptr` everywhere
>     and stop using explicit gets and puts.
>
> ### `prepare_transaction` ###
>
> -   `do_osd_ops`!
> -   If we're not full, `finish_ctx`.
>
> ### `do_osd_ops` ###
>
> -   Loop over the ops in a gigantic case statement
> -   If we hit any modification ops set the `user_modify` flag. This is
>     used to update the object version as part of the transaction
> -   On EC pools, do reads asynchronously, pushing them onto a list of
>     reads to complete.
> -   Otherwise do the reads synchronously
> -   CLS calls can be tricky since they read or write depending on the
>     method invoked
> -   It looks like operations performed by CLS are done by calling each
>     operation individually with `do_osd_ops` with reads being done
>     immediately and writes being queued up as part of the transaction
> -   Making the CLS API futures-based interface may be a good thing to do.
> -   Cache ops like flushing seem to be about shovelling triggers to do
>     perform actions into the `onack`/`oncomplete` lists.
> -   For write operations, stuff them into the Transaction
> -   In the case of CLS operations which do both reads and writes
>     (which some of them do), it appears that putting two CLS operations
>     in the same OSDOp might lead to weird results since all the reads
>     will happen then all the writes.
>
> ### `finish_ctx` ###
>
> -   Fiddle with object state and logs to update snapshot foo and to make
>     sure the object exists in the form we need it
> -   Update user version if we modified the object
> -   Save the updated `object_info_t`
> -   Append the updated object info to the `PGLog`
> -   Apply context stats
>
> ### `do_osd_op_effects`
>
> -   Add watches if we need to add watches
> -   If there's notifies, notify the watchers
> -   Why do we ack notifies?
>
> ### `issue_repop` ###
>
> -   Acquire locks (I'm still not clear why they're called `ondisk`. Is
>     it a lock acquired to use the store and thus it locks the on-disk
>     representation?)
> -   Apply built up attributes (likely verions and things that had been
>     stuck in the PGLog before.)
> -   Submit transaction to the PG Backend. Which is where it gets
>     divided up for Erasure Coding or sent out for replication. I'll
>     count that as Bottom End for the moment alongside the Store,
>     Changes to the backend will be for new consistency models.
>
>     We might be able to get a separation of concerns by varying what
>     is now ReplicatedPG to support differnet 'gridding' of objects on
>     the OSD and rejigger things so the consistency model is purely a
>     property of the backend. That's appealing from a maintenance
>     perspective, but breaks down if we want things like explicitly
>     marked transactions across multiple for some volumes while not
>     paying for them on others. It might not be workable in the general
>     case.
> -   That's also where local application takes place.
>
> ### `eval_repop` ###
>
> -   This function just sends notifications and cleans up when we finish.
> -   Its name is not very appropriate for what it does.
> -   If we're already done, return.
>     *   This isn't bad, but it's specifically necessary because `eval_repop`
>         gets called from several places including the handlers for our
>         subservient OSDs completing an operation.
> -   If everyone's ack'd, fire off our ack handlers. If everyone's
>     completed, fire off our completion handlers.
> -   Notify anyone waiting for the version we've committed…
> -   And for those waiting on the one we've applied
> -   If we've done everything, update usage stats
>     *   Fire off `on_success` callbacks
>     *   Remove ourselves
>
> ## Flex Points ##
>
> ### PlacementGroup/FlexiblePlacement/OtherConsistencyStrategy ###
>
> -   Fast Dispatch currently shoves requests into a PG.
> -   `handle_op` calculates a pgid and actually gets the pointer to or
>     queues the session to wait on the associated PG
> -   If we implement `queue_op` in FlexiblePlacement we can do whatever we
>     want with it. We can ignore the WorkQueue.
> -   Much of the code in `ReplicatedPG` is useful even with other
>     semantic models than PG-ordered replication
> -   We might want to make `ReplicatedPG` a template and
>     supply the `PG` specific parts as a class instantiation. Then we
>     could create more classes for other partition/dispatch models.
> -   We will want a consitency/semantic variation orthogonal to the
>     partition/dispatch model.
>         * In this divide dividing objects into PGs where every all
>           operations are dispatched into the PG for whatever objet they
>       effect would be partition/gridding
>     * Whereas the total ordering on PG operations and constraints on
>           when a request blocks versus being served are the
>           consistency/semantics
>
> ### Allocation/Locking/Dispatch ###
>
> -   `OpRequest` (currently allocated in `do_op` and other structures
>     might be allocated at various points. IN our earlier prototype we
>     allocated OpRequest and another structure alongside the MOSDOp and
>     reused MOSDOps rather than deallocating them to cut down on
>     allocator use in the fast path.
>
>         That might fight with also promising designs using core-affine
>     memory management, unless we can determine core affinity quickly
>     before allocating the message. (Maybe peeking into the undecoded
>     bytes?)
> -   Lock freedom should be orthogonal to flexible placement. There may
>     be situations where we want lockful systems in flexible placement
>     (since flexible placement can have a variety of sync behaviors.)
>     and we know that Sam and others are interested in pursuing
>     lock-free designs in in PG-placement.
> -   In a lock-free design, if PGs are core-affine,
>     `enqueue_op` could just submit a message to a core without locking
>     or some of the thread/worker complexity.
> -   For Volumes, where the volume itself may be partitioned across cores
>     `enqueue_op` would have to look at the object name to find its target.
> -   Thus, we would want to pull that logic into a separate function
>     giving our dispatch target.
>
> ### Read-Write Symmetry ###
>
> -   Thankfully, `init_op_flags` is happy to set both read and write
> -   CLS in particular falls afoul of this. Futures might be the best
>     way to deal with it.
>
> ### Things we know we had to do anyway from previous work ###
>
> -   Use `std::map` less as a parameter/return type, same for std::set
> -   Objecter improvements
>     *   Less allocation, change data structures. A dual to some of the
>         work we want to do to make the EC interface less memory
>         intensive.
>     -   If we have zero copy there should be a way to materialize that
>         at the level of the client.
> -   See about bootstrapping client-side EC from EC overwrite
> -   Librados4 should be more like Objecter than it is like librados3
>
> ## Sam and `do_op` (♪ Doo-Wop? ♪) ##
>
> ### Discussion ###
>
> Notes taken during a BlueJeans call between Adam Emerson and Sam
> Just. (Sorry for any mistakes, recording a conversation while having
> it is tricky.)
>
> -   We should never have to block for I/O
> -   It's not `do_op` per se, though we are rewriting that to put it into a
>     continuation passing style with trampolines
> -   Various bits should be allowed to block, but whether they do or
>     don't should not effect the caller's code-flow.
> -   Once we've got to that point, everything after is easier
> -   We have to make sure we don't introduce so much overhead that it's measurable
> -   Eventually plans to go to a lock-free/sharded/partitioned style like Seastar
> -   We are not using Seastar's system because, when you fulfil a
>     promise you don't want to have the promise fulfilled in that
>     thread, it should be easy to fulfill it in a different thread.
> -   Also adapting an existing codebase to Seastar is much harder than
>     writing one from scratch to use it.
> -   It should also allow us to run all the OSDs in the same process
> -   We might want to have one messenger per logical OSD and have those share
>     threads (loses some efficiency gains but is backwards compatible.)
> -   These sorts of changes will also make EC overwrites much easier.
> -   Any refactors in the code should move us in this direction as a side effect
> -   The sooner the better, so if it does cause performance problems we
>     can find out soon
> -   Branch is wip-do-op in athanatos
>
> ### Brief Exploration of the code ###
>
> Adam Emerson looked briefly through the `wip-do-op` branch in
> `https://github.com/athanatos/ceph.git` to see what the general design
> looked like and how it matched up with our goals.
>
> -   Getting rid of the 'ondisk lock' looks good, someone good at
>     scheduling (Matt?) should review the queue. It should not use
>     `std::list`, though.
> -   The `do_replica_safe_reads` refactor isn't bad but doesn't seem to
>     have an immediate effect. Sam described it as providing safety
>     shunting things replicas could do into their own function, so
>     should make future development and refactor easier.
> -   It reinforces the idea that reads inhabit a separate magisterium
>     with its own law and dispensation from writes and is the oposite
>     direction from the read/write transactions we want. At least
>     potentially, we could use it as a fast/safe path and have it do a
>     more specialized transaction dispatch for reads, maybe.
> -   The `do_op`/`do_replica_op` split seems reasonable for the
>     replicated case, since in that one we want to transform the
>     transaction before sending it to the replicas. If we want to allow
>     CLS methods on EC pools (which we do, in principle) or mixed
>     read-write, then the distinction between primary and replica might
>     break down.
> -   Not sure if the error channel is better pe se, but since we
>     currently have a bunch of functions that return `int` to indicate
>     errors, it might be easier to integrate.
> -   C++ should have a `void` type a bit more like unit so you could
>     explicitly return `void()` from void functions. You'd think they
>     could put *that* in C++17 since their list of things to add to the
>     standard now consists entirely of "3 to the version number".
> -   The `future` implementation looks promising. I'll need to review
>     how it's put together in more detail later, how it's used is more
>     pertinent at the moment.
> -   Things make sense from a gradualist position. Given the desire for
>     a progression from from here to _A Really Fast OSD_ where we have
>     _A Working OSD_ at every point along the way, this approach makes
>     sense. Restructuring everything around a blocking-agnostic futures
>     design then opens the way to introducing asynchronous, lock-free code.
> -   This is also compatible with flexing, since we can have multiple
>     `LogicalOSD` implementations with different locking strategies or
>     core affinity.
> -   `aio_read` looks to be less aio than the name would suggest. This
>     isn't bad, it's reasonable to do a transform by having things call
>     blocking procedures in a way that will work if they become non-blocking.
> -   Reimplementing the blocking calls in terms of nonblocking calls is
>     smart.
> -   `OSDReactor` looks like it could be adapted, at least the public
>     interface, into LogicalOSD once we made it less PG specific.
> -   In principle it's a good idea. A LogicalOSD would have to be bound
>     closely to the DataSetInterface it worked with since they're two
>     halves of a queueing mechanism.
> -   The futures stuff definitely isn't naïve. We need to understand
>     the blockers and other details.  The idea of having a future yield
>     when it needs to wait for something is a good one.
> -   It uses `std::list` though.
>
> ## Why librados is not wonderful ##
>
> Not that we hate RADOS, we just like Objecter way more
> -   Does not support read and write in same op. Neither does RADOS, to
>     be fair, but we plan to fix that.
> -   Takes a giant lock with every operation. Yuck.
> -   Has its own 'callback' interface
> -   Its handing of asynchronous operations seems very heavyweight and
>     not natural.
> -   Hides the internal structure of RADOS operations
> -   Does not expose object locator in a useful way
> -   Does way too many allocations
> -   The dimensioning of the interface is weird, like binding the IoCtx
>     to a pool

As librados is today, it encapsulates it's own networking and event
code. This is pretty frustrating to apps that already have their own
eventloop that want to integrate it.

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html