Ann Arbor Team's Flexible I/O Proposals (Ceph Next)

"Adam C. Emerson" <aemerson@xxxxxxxxxx> · Fri, 15 Apr 2016 17:05:37 -0400

Ceph Developers,

We've put together a few of the main ideas from our previous work in a
brief form that we hope people will be able to digest, consider, and
debate. We'd also like to discuss them with you at Ceph Next this
Tuesday.

Thank you.

---8<---

We have been looking at improvements to Ceph, particularly RADOS,
while focusing on flexibility (allowing users to do more things)
and performance. We have come up with a few proposals with these two
things in mind. Sessions and read-write transactions aim to allow
clients to batch up multiple operations in a way that is safe and
correct, while allowing clients to gain the advantages of atomic
read-write operations without having to lock. Sessions also provide
a foundation for flow-control which ultimately improves performance
by preventing an OSD from being ground into uselessness under a
storm of impossible requests. The CLS proposal is a logical follow-on
from the read-write proposal, as we attempt to address some problems
of correctness that exist now and consider how to integrate the
facility into an asynchronous world.

Flexible Placement, as you would expect from the name, is about
allowing users more control, as are Flexible Semantics. They both
have profound performance implications, as tuning placement to better
match a workload can increase throughput, and relaxed consistency can 
decrease latency. The proposed Interfaces are meant to support both as
well as work currently being done to allow an asynchronous OSD and to
hide details like locking and thread pools so that backends can be
written with different forms of concurrency and load-balancing
across processors.

Finally, Map Partitioning is not directly related to code paths within
the OSD itself, but does affect everything that can be done with Ceph.
People are beginning to run into limits on how large a Ceph cluster can
grow and how many ways they can be partitioned, and both these problems
fundamentally derive from the way the OSD map is handled by the monitors.

There are also some notes at the end. They are not critical, but if you
find yourself asking "What were they thinking?" the notes might help.

# Sessions and Read-Write #

>From `ReplicatedPG.cc`.

```c++
// Write operations aren't allowed to return a data payload because
// we can't do so reliably. If the client has to resend the request
// and it has already been applied, we will return 0 with no
// payload.  Non-deterministic behavior is no good.  However, it is
// possible to construct an operation that does a read, does a guard
// check (e.g., CMPXATTR), and then a write.  Then we either succeed
// with the write, or return a CMPXATTR and the read value.
…
if (ctx->op_t->empty() || result < 0) {
  …
  if (ctx->pending_async_reads.empty()) {
    complete_read_ctx(result, ctx);
  } else {
    in_progress_async_reads.push_back(make_pair(op, ctx));
    ctx->start_async_reads(this);
  }
  return;
}
…
// issue replica writes
ceph_tid_t rep_tid = osd->get_tid();

RepGather *repop = new_repop(ctx, obc, rep_tid);

issue_repop(repop, ctx);
eval_repop(repop);
```

As you can see, if we have any writes (all mutations end up in the
`op_t` transaction), we just flat out don't do the requested read
operations. If we don't have any writes, we perform the read
operations and return.  This is justified in the comment above because
of the non-deterministic behavior of resent read-write operations.

This is not an unsolved problem and we can bootstrap a solution on our
existing `Session` infrastructure.

## An upgraded session ##

Behold, `OSDSession`:
```c++
struct Session : public RefCountedObject {
  EntityName entity_name;
  OSDCap caps;
  int64_t auid;
  ConnectionRef con;
  WatchConState wstate;
  …
};
```

This structure exists once for every connection to the OSD. Where they
are created depends on who is doing the creation. In the case of
clients (what we're interested in) it occurs in `ms_handle_authorizeri`
```c++
…
isvalid = authorize_handler->verify_authorizer(cct, monc->rotating_secrets,
                                               authorizer_data, authorizer_reply, name, global_id, caps_info, session_key, &auid);

if (isvalid) {
  Session *s = static_cast<Session *>(con->get_priv());
  if (!s) {
    s = new Session(cct);
    con->set_priv(s->get());
    s->con = con;
    dout(10) << " new session " << s << " con=" << s->con << " addr=" << s->con->get_peer_addr() << dendl;
  }

  s->entity_name = name;
  if (caps_info.allow_all)
    s->caps.set_allow_all();
  s->auid = auid;
  …
}
```

In order to solve this problem, we propose a new data structure,
modelled on NFSv4.1
```c++
struct OpSlot {
  uint64_t seq;
  int r;
  MOSDOpReplyRef cached; // Nullable
  bool completed;
};
```

We do not want to give the OSD an unbounded obligation to hang on to
old message replies: that way lies madness. So, the additions to
`Session` we might make are:

```c++
struct Session : public RefCountedObject {
  …
  uint32_t maxslots; // The maximum number of operations this client
                     // may have in flight at once;
  std::vector<OpSlot> slots // The vector of in-progress operations
  ceph::timespan slots_expire; // How long we wait to hear from a
                               // client before the OSD is free to
                               // drop session resources
  cepu::coarse_mono_time last_contact; // When (by our measure) we
                                       // last received an operation
                                       // from the client.
};
```

## Message Additions ##

The OSD needs to communicate this information to the client. The most
useful way to do this is with an addition to `MOSDOpReply`.

```c++
class MOSDOpReply : public Message {
  …
  uint32_t this_slot;
  uint64_t this_seq;
  uint32_t max_slot;
  ceph::timespan timeout;
  …
};
```

This overlaps with the function of the transaction ID, since the
slot/sequence/OSD triple uniquely identifies an operation. Unlike the
transaction ID, this provides consistent semantics and a measure of
flow control.

To match our reply, the `MOSDOp` would need to be amended.
```c++
class MOSDOp : public Message {
  …
  uint32_t this_slot;
  uint64_t this_seq;
  bool please_cache;
  …
};
```

## Operations ##

### Connecting ###

A client, upon connecting to an OSD for the first time should send a
`this_slot` of 0 and a `this_seq` of 0. If it reconnects to an OSD it
should use the `this_slot` and `this_seq` values from before it lost
its connection. If an OSD has state for a client and receives a
`(slot,seq) = (0,0)` then it should feel free to free any saved state
and start anew.

### OSD Feedback ###

In every `MOSDOpReply` the OSD should send `this_slot` and `this_seq` to
the value from the `MOSDOp` to which we're replying.

More usefully, the OSD can inform the client how many operations it is
allowed to send concurrently with `max_slot`. The client must **not**
send a slot value higher than `max_slot`. (The OSD should error if it
does.)

The OSD may increase the number of operations allowed in-flight
if it has capacity by increasing `max_slot`. If it finds itself
lacking capacity, it may decrease `max_slot`. If it does, the client
should respect the new bound. (The OSD should feel free to free the
rescinded slots as soon as the client sends another `MOSDOp` with a
slot value equal to one on which the new `max_slot` has been sent.)

If the client sends a `this_seq` lower than the one held for a slot by
the OSD, the OSD should error. If it is more than one greater than the
current `this_seq`, the OSD should error.

### Caching ###

The client is in an excellent position to know whether it **requires**
the output of a previous operation of mixed reads and writes on
resend, or whether it merely needs the status on resend. Thus, we let
the client set `please_cache` to request that the OSD store a
reference to the sent message in the appropriate `OpSlot`.

The OSD is in an excellent position to know how loaded it is. It can
calculate a bound on how large a given reply will be before executing
it. Thus, the OSD can send an error if the client has requested it
cache something larger than it feels comfortable caching.

Assuming no errors, the behavior, for any slot, is this: If the client
sends an `MOSDOp` with a `this_seq` one greater than the current value
of `OpSlot::seq`, that represents a new operation. Increment
`OpSlot::seq`, clear `OpSlot::completed` and begin the operation. When
the operation finishes, set `OpSlot::completed`. If `please_cache` has been
set, store the `MOSDOpReply` in `OpSlot::cached`. Otherwise simply store the
result code in `OpSlot::r`.

If the client sends an `MOSDOp` with a `this_seq` equal to
`OpSlot::seq` and `OpSlot::completed` is false, drop the request. (We
will reply when it completes.) If it has completed, send the stored
`OpSlot::MOSDOpReply` if there is one, otherwise send just a replay
with just `OpSlot::r`.

### Reconnection ###

Currently the `Session` is destroyed on reset and a new one is created
on authorization. In our proposed system the `Session` will not be
destroyed on reset, it will be moved to a structure where it can be
looked up and destroyed after `timeout` since the last message
received.

On connection, the OSD should first look up a `Session` keyed
on the entity name and create one if that fails.

# Read as a part of Transaction #

We don't have code examples here since most of the obvious interface
changes are obvious. Codes and parameters would be added to
`PGBackend::Transaction` and executing a transaction would have to
return data.

## Motivation ##

-   Mixed reads and writes are an efficiency win, since a client can
    save round trips by batching up operations in a single request.
    Current Ceph does not allow them for reasons which are quoted and
    addressed in the preceding section.
-   Mixed reads and writes are a semantic win. If an `MOSDOp` is
    atomic (it is in current Ceph), read-after-write can often remove
    the need for explicit locking.
-   Transactional reads may seem complicated, but the Erasure Coding
    backend already has to execute complex read transactions to
    reassemble or recover data. We want an asynchronous read capability
    in the Store anyway and there's no reason not to have it be shared
    with our asynchronous write path.
-   While it might seem that separating reads and writes, as we do
    now, allows us to simplify code and rule out edge cases, we would
    like to point out the existence of CLS, which can have problems if
    two method calls occur in the same `MOSDOp`.

## Sketch ##

The main problem with mixed read-write transactions is that replicas
need to write but not read. The key to handling this is dependency
checking. Outside CLS (which will be discussed below) it is very easy
to see whether reads and writes are independent. (Simply go down the
ops and see if their ranges overlap and whether getattrs and setattrs
have keys in common.) Reads coming after overlapping writes depend on
the previous writes. Then:
-   If an op that's all reads, simply do all the reads. We don't have
    to get write locks or anything.
-   If an op is all writes, it's no different than a replicated
    operation now.
-   For mixed reads and writes, if the reads aren't dependent on the
    writes, dispatch the writes and do the reads before, after, or
    concurrently with the writes on the primary.  (So long as we
    prevent writes from other transactions from intervening.)
-   Dependent reads are the difficult case. For erasure coding it
    shouldn't any difference since we'd have to dispatch reads and
    writes to all stripes anyway. For replication, we would want to
    execute the mixed read-write transaction on the local store in
    strict order and dispatch one consisting of only writes to the
    remotes.

# CLS #

## Current Problem ##

The CLS API works by making an ops vector and handing it to
`do_osd_ops`.

```c++
int cls_cxx_getxattr(cls_method_context_t hctx, const char *name,
                     bufferlist *outbl)
{
  ReplicatedPG::OpContext **pctx = (ReplicatedPG::OpContext **)hctx;
  bufferlist name_data;
  vector<OSDOp> nops(1);
  OSDOp& op = nops[0];
  int r;

  op.op.op = CEPH_OSD_OP_GETXATTR;
  op.indata.append(name);
  op.op.xattr.name_len = strlen(name);
  r = (*pctx)->pg->do_osd_ops(*pctx, nops);
  if (r < 0)
    return r;

  outbl->claim(op.outdata);
  return outbl->length();
}

int cls_cxx_setxattr(cls_method_context_t hctx, const char *name,
                     bufferlist *inbl)
{
  ReplicatedPG::OpContext **pctx = (ReplicatedPG::OpContext **)hctx;
  bufferlist name_data;
  vector<OSDOp> nops(1);
  OSDOp& op = nops[0];
  int r;

  op.op.op = CEPH_OSD_OP_SETXATTR;
  op.indata.append(name);
  op.indata.append(*inbl);
  op.op.xattr.name_len = strlen(name);
  op.op.xattr.value_len = inbl->length();
  r = (*pctx)->pg->do_osd_ops(*pctx, nops);

  return r;
}
```

The `do_osd_ops` function performs reads inline, synchronously, right
then and there for replicated pools. (Erasure coded pools are more
limited.) Writes are batched up and added to the transaction
associated with the current `OpContext`.

This is bad. If one has a CLS method that performs a read-modify-write
and one calls it twice in the same `MOSDOp`, it becomes a
read-modify-read-modify-write-write which may produce incorrect
results.

## Desiderata ##

-   CLS operations should be composable. We should be able to have many
    of them in a single operation.
-   They should remain transactional. If a CLS operation does some
    reads and hits an error, it stops and nothing is written to the
    store. We should not allow situations where a CLS method can write
    a partial result to the store then error.
-   They should be capable. We should not put too many restrictions on
    what an operation is allowed to do. It should be possible to run
    them on Erasure Coded Pools once ECOverwrite is in place. (At
    least some subset of them).
-   They should be consistent. A CLS operation should be able to call
    rand or generate a UUID without each replica holding a different
    value. (This rules out solutions like calling the method on each
    replica.)
-   They should be efficient and optimizable.
-   They should work in an asynchronous framework.

There are several ways we could change their implementation to address
these.

## Futures ##

This is an attractive way to think about CLS. It allows things to
proceed asynchronously and would solve the RMRMWW problem. One would
simply make every I/O operation in the CLS API a call returning a
future and write each method in continuation passing style. Executing
the transaction in the primary OSD (on a replicated pool) would create
a write-only that could then be sent to replicas. (Having the
execution of a CLS method also compile a write-only transaction is a
propery of any composable design.)
-   Tracking dependencies before the operation is executed would be
    problematic. There would be no way to know whether later reads
    overlapped with previous writes before doing them. This could lead
    to an unbounded obligation on the OSD to maintain state to
    evaluate OSDOp, including potentially large writes, before
    actually committing in order for CLS methods to remain
    transactional.

Futures are, on their own, insufficient to provide everything we need
from CLS, largely because they are opaque to the OSD. They could be
combined with…

## Pre-declaration ##

We could remove some of the generality. A simple way of doing so would
be to have methods declare, as a part of their signature, everything
that they may ever read or write, with the expectation that methods
will name the fewest resources required. This doesn't mean that every
method will always write to and read from everything it mentions,
merely that we have a known bound of the maximum it will ever use.

This makes analysis easier for the OSD, and in the composition case,
it could go in two passes. In the first, it would execute CLS calls
and pre-stage results and in the second it would pass its compiled
write transaction into the store.

This is the most attractive solution, but depends on pre-declaration
being done well on the resources used in pre-staging.

One could make things easier by being even more restrictive and
imposing ordering:
1.  The method declares in advance all read operations it might ever
    perform.
2.  The method declares in advance all write operations it might ever
    perform.
3.  The method examines the parameters passed by the client and
    indicates which subset of the named inputs and outputs it will
    use.
4.  The method performs its read operations and denotes exactly which
    output operations it will perform. (not the data to be output, but
    ranges and names.)
5.  The method performs write operations.

The most restrictive form of this would operate in two phases. First,
the CLS method would be presented with its parameters and all of the
things it plans to read or write (objects with ranges and attribute
keys.) In the second it would be called with the contents of all the
reads it requested and supply the data for all the writes it requested.

This would obviate the need for futures or other asynchronous I/O,
and make evaluation very easy. This approach would disallow some
operations, like indirecting through an attribute key to read another,
but is very appealing.

## Be Transactional ##

Our transactions are pre-checked and must succeed. If we want the most
expressive version of CLS consistent with our other goals, then we
should add commit and rollback. EC Overwrite will already require some
form of commit and rollback, so it's not beyond the realm of thought.

It could also be a foundation for some future multi-object-transaction
supporting backend.

This idea might have appeal on its own, but the concerns of CLS are
not sufficient to motivate it.

## Domain Specific Language ##

One could make a domain specific language, based on something simple,
that the OSD can execute to perform CLS methods. The OSD could then
analyze each method to see what I/O operations a method calls and try
to track them
-   Dependency tracking for compilers is a major area of research. It
    would be a whole lot of fun, but as a short term solution it is
    not really practical.
-   We still wouldn't be able to rule out problems in the general
    case.

This approach would be interesting as a long term academic research
project, but is not suitable for a short-range improvement.

# Flexible Placement #

This is a large topic which should be discussed on its own, but it
motivates the interface designs below, so we shall briefly mention why
it's interesting.

CRUSH/PG is a fine placement system for several workloads, but it has
two well-known limitations.

## Motivation ##

-   Data distribution can be much less uniform than one might like,
    giving uneven use of disks. This has caused some Ceph developers
    to experiment with Monte Carlo based placement algorithms.
-   Data distribution can be much more uniform than one would
    like. This is the fundamental cause of Ceph's slow sequential read
    performance. More generally, unrelated workloads contend
    with each other due to a lack of affinity for related data. The effects are
    especially pronounced on spinning disk (due to seek times), but
    still exist on Flash (due to bus/network contention.)  This is a
    tension between competing goods. CRUSH gains wide dispersion and
    uniformity to defend against correlated failures but this imposes
    a tradeoff.

## Goal ##

Ceph should support placement methods other than CRUSH/PG. Currently,
the OSD dispatches operations based on placement group ID, which will
need to be varied,

We also need some way to get new types of functions into the cluster.

## Proposal ##

Our proposal is, in a way, CRUSH taken to its logical
conclusion. Instead of distributing CRUSH rules, we propose to
distribute general computable functions from (oid, volume/dataset) pairs to
sequences of OSDs with their supporting data structures.  One of our
ongoing research projects has been an in-process executor for these
functions based on Google's NaCl. The benefits are:
-   Administrators can fine-tune placement functions to fit their
    workloads well.
-   They can also experiment easily without having to recompile all of
    Ceph and make heavy architectural changes.
-   Entirely new placement strategies can be deployed without having
    to upgrade every machine in the cluster. Or any machine in the
    cluster, once they've been upgraded to a Flexible Placement
    capable version.
-   Possibilities for annealing and machine learning to gradually
    adapt placement in response to load data become available
-   NaCl builds on LLVM which has a rich set of tools for optimizations
    like partial evaluation.
-   NaCl is fast.

# Flexible Semantics #

Another motivating example. Originally, Ceph did replication and only
replication under a very specific consistency model. There has been
desire for more flexibility.
-   Erasure Coding. it still follows the Ceph consistency model
    (though leaves out many operations) but is very different in
    back-end dispatch, enough so that it inspired a major rewrite of
    the OSD's bottom half.
-   Append-only immutable objects have been discussed.
-   Many people have asked for relaxed consistency to improve
    performance. This is not be suitable for all workloads, but people
    have repeatedly asked for the ability to set up low-latency,
    relaxed-consistency volumes that still provide Ceph's ability to
    easily use new storage and scale well.
-   Transactional storage. As mentioned above, cross-object
    transactional semantics are a thing people may have desired.

# Interfaces #

Right now our class hierarchy is a bit of a mess. Eventually we'll do
something about `PG` and `ReplicatedPG`, refactor, support
asynchronous I/O, reduce lock contention, support in core affinity,
and build Jerusalem here in England's green and pleasant land.

While we're stringing up our bows of burning gold, we should support
non-PG based placement and flexible semantics. Right now, parts of the
PG and the OSD (since the OSD manages the collection of PGs, spins
them up, and manages thread pools shared by sets of PGs) are
intertwined. Thus, we need to abstract out both pieces.

As we also want to support having multiple "logical" OSDs running in a
single `ceph-osd` process, this would be a natural time to add that
capability.

Both these are sketches and should be considered a work in progress.

## `DataSetInterface` ##

Here is a sketch of what a flexible abstraction based on PG could look
like, at least parts of one. Not being informed about Scrub,
Recovery, or Cache Tiering, having only focused on the object
operation path, we won't include those details here.

We also leave out functions called from the PG itself or other objects
invoked from ownstack.

```c++
class DataSetInterface {
protected:
  LogicalOSD& losd; // LogicalOSD is a means to have different
                    // stores/semantics run in the same process.

  MapPartRef curmap; // Subset of map relevant to this DSI
public:
  // The OSD (things Up the Stack, generally) should not call 'lock'
  // on us. If we have locking of some sort things down the stack that
  // we have some relationship with (friend or whatever) could lock or
  // unlock us, but that should not be baked in as part of the interface.

  // Things like the info struct and details about loading the Place
  // wouldn't actually be here. As there is an intimate relation
  // between the LogicalOSD and an implementation of DataSetInterface (it
  // holds all those loaded in memory and controls dispatch), they
  // would not need to be part of the generic interface.

  const coll_t coll; // The subdivision of the Store we control

  // In the PG case we always know we're the primary or not for
  // anything within the same pgid. That is not expected to be the
  // case generally.
  bool is_primary(const OpRequest&) = 0;
  // No 'is_replica' since 'replica' may not be applicable
  // generally. It's a bit off even in the erasure coded case.
  bool is_acting(const OpRequest&) = 0;
  bool is_inactive() = 0;

 public:
  // No identifier. The descendent will take that.
  DataSetInterface(LogicalOSD& o, OSDMapRef curmap);
  virtual ~DataSetInterface();

  DataSetInterface(const DataSetInterface&) = delete;
  DataSetInterface& operator =(const DataSetInterface&) = delete;
  DataSetInterface(DataSetInterface&&) = delete;
  DataSetInterface& operator =(DataSetInterface&&) = delete;

  virtual void on_removal(ObjectStore::Transaction *t) = 0;

  // Yes, there's no 'queue' and no 'do_op' or any of
  // that. This is intentional. There's no dequeue or do_op because
  // those functions are either called only by the PG currently OR
  // they're called in OSD functions called by the PG as part of the
  // thread switch. They should not be part of the public interface.

  // There's no queue because we can either put queue here or we can
  // put queue in LogicalOSD. (We could do both, but that seems bad to
  // me.) If there is some combination of locking and checking that
  // must be done before queueing an operation, it seems that it's
  // better to do it in LogicalOSD so that it doesn't leak out and
  // become part of the abstraction for other implementations.
};
```

## `LogicalOSD` ##

The OSD class itself (representing the single OSD process) should have
a map (*perhaps* a Boost.Intrusive.Set?) mapping OSD IDs to to
`LogicalOSD` instances.

```c++
class LogicalOSD {
  OSD& osd;
  ObjectStore& store;

  // Look up the DataSetInterface instance appropriate to the given
  // OpRequest.
  virtual future<DataSetInterface,int> get_place_for(const OpRequest&) = 0;

  // Every logical OSD will have its own watchers as well as slot
  // cache. Someone familiar with flow control should check this
  // idea. Since LogicalOSDs will, ideally, share messengers we might
  // want them to share the same slot cache. In that case we should
  // just re-dimension watchers within Session
  SessionRef session_for(const entity_name_t& name);

  void queue(DataSetInterfaceRef&& pi, OpRequestRef&& to_queue);
  void queue_front(DataSetInterfaceRef&& pi, OpRequestRef&& to_queue);

  // Dequeue and the like are currently called in the PG itself and so
  // have no place in the interface presented to the OSD.

  void pause();
  void resume();
  void drain();
};
```

## Library ##

Both these interfaces are quite thin and intentionally so. Scrubbing
and recovery have not been addressed at all, as mentioned, so those
parts will be expanded.  Asynchrony should allow us simpler interfaces
since some complexity of requeing will be handled by futures and
continuations.

We obviously do not want to rewrite all our existing code. Instead
most of the existing work on `PG` and `ReplicatedPG` should be
refactored into a templated library from which implementations of
`LogicalOSD` and `DataSetInterface` can be constructed.

# Map Partitioning #

There are two huge problems with scalability in Ceph.
1.  The OSDMap knows too many things
2.  A single monitor manages all updates of everything and replicates them to
    other monitors.

## Too Big to Not Fail ##

The monitor map and MDS maps are fine. Each holds data needed to
locate servers and that's it. It would be very hard to put enough data
in them to cause problems. The OSD map however contains a trove of data that
must be updated serially in Paxos and propagated to every OSD,
monitor, MDS, and client in the cluster.

Pools are a notorious example. We can't create as many pools as users
would like. Pools are heavyweight, and while they depend on other
items in the OSD map (like erasure code profiles), it would be nice if
we divide them between several monitor clusters, each of which would
hold a subset of pools. We would need to make sure that clients had up
to date versions of whatever pools they are using along with the
status of the OSDs they're speaking to, but that's not
impossible. Likewise, we should split placement rules out of the OSD
map, especially once we get into larger numbers of potentially larger
Flexible Placement style functions.

Nodes should then only need to subscribe to the set of pools and
placement functions they need to access their data. Changes like these
should allow users to create the number of pools they want without
causing the cluster difficulty.

### Consistency ###

Partitioning makes consistency harder. A simple remedy might be to
stop referring to data by name or integer. An erasure code profile
should be specified by UUID and version. So should pools and placement
functions. When sending a request to the OSD, a client should send the
versions of the pool, the ruleset, and the OSDMap it used and the OSD
should check that all three are current.

## The OSD Set ##

The complicating case here is the OSD status set.  Running this
through a single Paxos limits the number of OSDs that can coexist in a
cluster.  We ought split the set of OSDs between multiple masters to
distribute the load. Each 'Up' or 'Down' event is independent of
others, so all we require is that events get propagated into the
correct OSDs and primaries and followers act as they're supposed to.

Versioning is a bigger problem here. We might have all masters
increment their version when one increments its version if that could
be managed without inefficiency. We might send a compound version with
`MOSDOp`s, but combining that with the compound version above might be
unwieldly. (Feedback on this issue would be greatly appreciated.)

### Subscription ###

For a large number of OSDs, it would be nice if not everyone were
notified of all state changes.

For a pool whose placement rule spans only a subset of all OSDs,
clients using that pool should be able to subscribe to a subset of the
OSD set corresponding to that pool. This should be fairly easy so long
as the subset is explicit.

In the case of pools not providing an explicit subset, a monitor (or
perhaps a proxy in front of a set of monitors) could look at common
patterns of subscription requests and merge those with significant
overlap together, so as to give clients a subset without being
destroyed by the irresistible force of combinatorial explosion.

# Notes #

These are notes taken when reviewing the code and thinking out
ideas. You don't have to read them, but they are provided as a
supplement in case you wanted to know what we were thinking and why.

## ShardedOpWQ ##

-   What is the purpose of `sdata_op_ordering_lock`? A shard is not a
    PG, so why do things need to be ordered within shards as well as
    within PGs?
-   `sdata_lock` pairs up with the condition variable

## OSD Upper Half ##

### Regular Dispatch ###

-   Does not overlap with `fast_dispatch`. Operations in
    `ms_can_fast_dispatch` are not handled in `_dispatch` and vice versa.
-   Lock the entire OSD
-   If another dispatch is executing, go to sleep and wait for it to
    finish. What the heck?
-   Do Waiters
    * Waiters are a list of `OpRequestRef`s called `finished` for some
          reason
    * Whenever we activate an `OSDMap` the requests waiting for the
      map get put onto 'finished'
-   Call `_dispatch`
-   Do some more waiters
-   Wake up other dispatch threads
-   Unlock the entire OSD

#### `_dispatch`? ####

A giant case statement that does a bunch of things.

In the case of `OSDOp`, if we have an `OSDMap`, create an `OpRequest` and
pass it to `dispatch_op`. This is for things like PG commands, not
actual object operations.

#### `dispatch_op` ####

Another giant case statement. 

### Fast Dispatch ###

#### `ms_fast_preprocess` ####

Update the map epoch if an OSD sends us an OSDMap.

#### `ms_fast_dispatch` ####

-   Make an `OpRequest`
-   A bit weird and convoluted, it looks like we use the 'op waiting
    for map' stuff to queue up an op on a reserved map and remove the
    reservation preventing it from running before we return.
-   Specifically we mark the op as waiting for its PG in the `Session`
    and then mark the `Session` as waiting for the new map.
-   Ultimately things end up in `dispatch_op_fast`

#### `dispatch_op_fast` ####

Shovels operations into type specific calls like…

#### `handle_op` ####

-   Set up map share (if needed)
-   Calculate the True PGID and Pool (sanity check against the client?)
-   Either get the pointer to the PG (a base class) or, if it hasn't
    been loaded in, queue the session to wait for it
-   If we have the PG, `enqueue_op` (which just calls `PG::queue_op`)

## OSD Lower Half (Currently PG) ##

`ReplicatedPG` and `PG` are separate for historical reasons and actual
differentiation occurs in choice of backend according to Word of Sam.

PGs with different consistency properties are explicitly a goal
now. The idea of a `PGInterface` has been floated to facilitate their
creation and `ReplicatedPG` would become a child of that.

### `PG::queue_op` ###

-   Delay if other people are waiting for maps (to preserve the PG Ordering)
-   Enqueue in `op_wq` (owned by the OSD)
-   (Why call into the PG then? Just to enforce the map ordering?)
-   The work queue gathers operations which, during `_process` are later
    reassembled into a list of work to be done.
-   `_process` is called by a worker thread in the thread pool, so the
    call to `dequeue_op` is in worker thread. Since it's sharded, we
    get multiple groups of threads each serving some subset of PGs.

### `OSD::dequeue_op` ###

-   After a bit of fiddling about sharing maps, call `PG::do_op`

### `PG::do_op` ###

-   Sam says he plans to rewrite this to allow read asynchrony
-   We want to see reads and writes share the same transaction
    structure and similar semantics.
-   We also want to allow reads and writes in the same operation and
    to use a session mechanism to allow that.
-   We'll need transaction transforms to, for example, filter out
    reads before sending an operation to a replicating OSD. This
    shouldn't be too hard, since the output of read operations can't
    be the input for write operations. (Except in CLS?)
-   `do_op` is a virtual function, but the only implementation is in
    ReplicatedPG.
-   Here looks to be where we apply ordering to Writes
-   `execute_ctx` actually performs the operations after `do_op` has set
    everything up

### `execute_ctx` ###

-   May be called multiple times on the same `OpContext`
-   In the case of clone operations (that's the only thing that takes
    `src_obc`?), get a read-lock for the object context
-   it's called `ondisk` but I'm not sure why, it doesn't look like they get serialized
-   Then we have a brief detour into `prepare_transatcion`
-   Here's the read-write restriction. ReplicatedPG.cc:2975. Later we can
    create a better session abstraction to fix that.
-   For reads
    *   `do_osd_op_effects`!
    *   If all our reads were synchronous (or there were none)
        `complete_read_ctx`, which creates and sends the reply
    -   Otherwise, `start_async_reads`, which passes the pending reads off
        to `objects_read_async`
    -   Once the backend completes, we go to `finish_read`, which calls
        `complete_read_ctx`
-   Trim the PG Log
-   Hey, cool, there's a lambda! Register an `onack` closure that sends a reply
-   And `oncommit`. And `onsuccess`. And `finish`.
-   Package up the `OpContext` and its transaction and whatnot into a
    `RepOp`. This is where all the mutations get done.
-   Call `issue_repop`
-   Call `eval_repop`
-   Adam really wishes we would use `boost::intrusive_ptr` everywhere
    and stop using explicit gets and puts.

### `prepare_transaction` ###

-   `do_osd_ops`!
-   If we're not full, `finish_ctx`.

### `do_osd_ops` ###

-   Loop over the ops in a gigantic case statement
-   If we hit any modification ops set the `user_modify` flag. This is
    used to update the object version as part of the transaction
-   On EC pools, do reads asynchronously, pushing them onto a list of
    reads to complete.
-   Otherwise do the reads synchronously
-   CLS calls can be tricky since they read or write depending on the
    method invoked
-   It looks like operations performed by CLS are done by calling each
    operation individually with `do_osd_ops` with reads being done
    immediately and writes being queued up as part of the transaction
-   Making the CLS API futures-based interface may be a good thing to do.
-   Cache ops like flushing seem to be about shovelling triggers to do
    perform actions into the `onack`/`oncomplete` lists.
-   For write operations, stuff them into the Transaction
-   In the case of CLS operations which do both reads and writes
    (which some of them do), it appears that putting two CLS operations
    in the same OSDOp might lead to weird results since all the reads
    will happen then all the writes.

### `finish_ctx` ###

-   Fiddle with object state and logs to update snapshot foo and to make
    sure the object exists in the form we need it
-   Update user version if we modified the object
-   Save the updated `object_info_t`
-   Append the updated object info to the `PGLog`
-   Apply context stats

### `do_osd_op_effects`

-   Add watches if we need to add watches
-   If there's notifies, notify the watchers
-   Why do we ack notifies?

### `issue_repop` ###

-   Acquire locks (I'm still not clear why they're called `ondisk`. Is
    it a lock acquired to use the store and thus it locks the on-disk
    representation?)
-   Apply built up attributes (likely verions and things that had been
    stuck in the PGLog before.)
-   Submit transaction to the PG Backend. Which is where it gets
    divided up for Erasure Coding or sent out for replication. I'll
    count that as Bottom End for the moment alongside the Store,
    Changes to the backend will be for new consistency models.

    We might be able to get a separation of concerns by varying what
    is now ReplicatedPG to support differnet 'gridding' of objects on
    the OSD and rejigger things so the consistency model is purely a
    property of the backend. That's appealing from a maintenance
    perspective, but breaks down if we want things like explicitly
    marked transactions across multiple for some volumes while not
    paying for them on others. It might not be workable in the general
    case.
-   That's also where local application takes place.

### `eval_repop` ###

-   This function just sends notifications and cleans up when we finish.
-   Its name is not very appropriate for what it does.
-   If we're already done, return.
    *   This isn't bad, but it's specifically necessary because `eval_repop`
        gets called from several places including the handlers for our
        subservient OSDs completing an operation.
-   If everyone's ack'd, fire off our ack handlers. If everyone's
    completed, fire off our completion handlers.
-   Notify anyone waiting for the version we've committed…
-   And for those waiting on the one we've applied
-   If we've done everything, update usage stats
    *   Fire off `on_success` callbacks
    *   Remove ourselves

## Flex Points ##

### PlacementGroup/FlexiblePlacement/OtherConsistencyStrategy ###

-   Fast Dispatch currently shoves requests into a PG.
-   `handle_op` calculates a pgid and actually gets the pointer to or
    queues the session to wait on the associated PG
-   If we implement `queue_op` in FlexiblePlacement we can do whatever we
    want with it. We can ignore the WorkQueue.
-   Much of the code in `ReplicatedPG` is useful even with other
    semantic models than PG-ordered replication
-   We might want to make `ReplicatedPG` a template and
    supply the `PG` specific parts as a class instantiation. Then we
    could create more classes for other partition/dispatch models.
-   We will want a consitency/semantic variation orthogonal to the
    partition/dispatch model.
        * In this divide dividing objects into PGs where every all
          operations are dispatched into the PG for whatever objet they
      effect would be partition/gridding
    * Whereas the total ordering on PG operations and constraints on
          when a request blocks versus being served are the
          consistency/semantics

### Allocation/Locking/Dispatch ###

-   `OpRequest` (currently allocated in `do_op` and other structures
    might be allocated at various points. IN our earlier prototype we
    allocated OpRequest and another structure alongside the MOSDOp and
    reused MOSDOps rather than deallocating them to cut down on
    allocator use in the fast path.

        That might fight with also promising designs using core-affine
    memory management, unless we can determine core affinity quickly
    before allocating the message. (Maybe peeking into the undecoded
    bytes?)
-   Lock freedom should be orthogonal to flexible placement. There may
    be situations where we want lockful systems in flexible placement
    (since flexible placement can have a variety of sync behaviors.)
    and we know that Sam and others are interested in pursuing
    lock-free designs in in PG-placement.
-   In a lock-free design, if PGs are core-affine,
    `enqueue_op` could just submit a message to a core without locking
    or some of the thread/worker complexity.
-   For Volumes, where the volume itself may be partitioned across cores
    `enqueue_op` would have to look at the object name to find its target.
-   Thus, we would want to pull that logic into a separate function
    giving our dispatch target.

### Read-Write Symmetry ###

-   Thankfully, `init_op_flags` is happy to set both read and write
-   CLS in particular falls afoul of this. Futures might be the best
    way to deal with it.

### Things we know we had to do anyway from previous work ###

-   Use `std::map` less as a parameter/return type, same for std::set
-   Objecter improvements
    *   Less allocation, change data structures. A dual to some of the
        work we want to do to make the EC interface less memory
        intensive.
    -   If we have zero copy there should be a way to materialize that
        at the level of the client.
-   See about bootstrapping client-side EC from EC overwrite
-   Librados4 should be more like Objecter than it is like librados3

## Sam and `do_op` (♪ Doo-Wop? ♪) ##

### Discussion ###

Notes taken during a BlueJeans call between Adam Emerson and Sam
Just. (Sorry for any mistakes, recording a conversation while having
it is tricky.)

-   We should never have to block for I/O
-   It's not `do_op` per se, though we are rewriting that to put it into a
    continuation passing style with trampolines
-   Various bits should be allowed to block, but whether they do or
    don't should not effect the caller's code-flow.
-   Once we've got to that point, everything after is easier
-   We have to make sure we don't introduce so much overhead that it's measurable
-   Eventually plans to go to a lock-free/sharded/partitioned style like Seastar
-   We are not using Seastar's system because, when you fulfil a
    promise you don't want to have the promise fulfilled in that
    thread, it should be easy to fulfill it in a different thread.
-   Also adapting an existing codebase to Seastar is much harder than
    writing one from scratch to use it.        
-   It should also allow us to run all the OSDs in the same process
-   We might want to have one messenger per logical OSD and have those share
    threads (loses some efficiency gains but is backwards compatible.)
-   These sorts of changes will also make EC overwrites much easier.
-   Any refactors in the code should move us in this direction as a side effect
-   The sooner the better, so if it does cause performance problems we
    can find out soon
-   Branch is wip-do-op in athanatos

### Brief Exploration of the code ###

Adam Emerson looked briefly through the `wip-do-op` branch in
`https://github.com/athanatos/ceph.git` to see what the general design
looked like and how it matched up with our goals.

-   Getting rid of the 'ondisk lock' looks good, someone good at
    scheduling (Matt?) should review the queue. It should not use
    `std::list`, though.
-   The `do_replica_safe_reads` refactor isn't bad but doesn't seem to
    have an immediate effect. Sam described it as providing safety
    shunting things replicas could do into their own function, so
    should make future development and refactor easier.
-   It reinforces the idea that reads inhabit a separate magisterium
    with its own law and dispensation from writes and is the oposite
    direction from the read/write transactions we want. At least
    potentially, we could use it as a fast/safe path and have it do a
    more specialized transaction dispatch for reads, maybe.
-   The `do_op`/`do_replica_op` split seems reasonable for the
    replicated case, since in that one we want to transform the
    transaction before sending it to the replicas. If we want to allow
    CLS methods on EC pools (which we do, in principle) or mixed
    read-write, then the distinction between primary and replica might
    break down.
-   Not sure if the error channel is better pe se, but since we
    currently have a bunch of functions that return `int` to indicate
    errors, it might be easier to integrate.
-   C++ should have a `void` type a bit more like unit so you could
    explicitly return `void()` from void functions. You'd think they
    could put *that* in C++17 since their list of things to add to the
    standard now consists entirely of "3 to the version number".
-   The `future` implementation looks promising. I'll need to review
    how it's put together in more detail later, how it's used is more
    pertinent at the moment.
-   Things make sense from a gradualist position. Given the desire for
    a progression from from here to _A Really Fast OSD_ where we have
    _A Working OSD_ at every point along the way, this approach makes
    sense. Restructuring everything around a blocking-agnostic futures
    design then opens the way to introducing asynchronous, lock-free code.
-   This is also compatible with flexing, since we can have multiple
    `LogicalOSD` implementations with different locking strategies or
    core affinity.
-   `aio_read` looks to be less aio than the name would suggest. This
    isn't bad, it's reasonable to do a transform by having things call
    blocking procedures in a way that will work if they become non-blocking.
-   Reimplementing the blocking calls in terms of nonblocking calls is
    smart.
-   `OSDReactor` looks like it could be adapted, at least the public
    interface, into LogicalOSD once we made it less PG specific.
-   In principle it's a good idea. A LogicalOSD would have to be bound
    closely to the DataSetInterface it worked with since they're two
    halves of a queueing mechanism.
-   The futures stuff definitely isn't naïve. We need to understand
    the blockers and other details.  The idea of having a future yield
    when it needs to wait for something is a good one.
-   It uses `std::list` though.

## Why librados is not wonderful ##

Not that we hate RADOS, we just like Objecter way more
-   Does not support read and write in same op. Neither does RADOS, to
    be fair, but we plan to fix that.
-   Takes a giant lock with every operation. Yuck.
-   Has its own 'callback' interface
-   Its handing of asynchronous operations seems very heavyweight and
    not natural.
-   Hides the internal structure of RADOS operations
-   Does not expose object locator in a useful way
-   Does way too many allocations
-   The dimensioning of the interface is weird, like binding the IoCtx
    to a pool
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html