Ceph Developers,
We've put together a few of the main ideas from our previous work in a
brief form that we hope people will be able to digest, consider, and
debate. We'd also like to discuss them with you at Ceph Next this
Tuesday.
Thank you.
---8<---
We have been looking at improvements to Ceph, particularly RADOS,
while focusing on flexibility (allowing users to do more things)
and performance. We have come up with a few proposals with these two
things in mind. Sessions and read-write transactions aim to allow
clients to batch up multiple operations in a way that is safe and
correct, while allowing clients to gain the advantages of atomic
read-write operations without having to lock. Sessions also provide
a foundation for flow-control which ultimately improves performance
by preventing an OSD from being ground into uselessness under a
storm of impossible requests. The CLS proposal is a logical follow-on
from the read-write proposal, as we attempt to address some problems
of correctness that exist now and consider how to integrate the
facility into an asynchronous world.
Flexible Placement, as you would expect from the name, is about
allowing users more control, as are Flexible Semantics. They both
have profound performance implications, as tuning placement to better
match a workload can increase throughput, and relaxed consistency can
decrease latency. The proposed Interfaces are meant to support both as
well as work currently being done to allow an asynchronous OSD and to
hide details like locking and thread pools so that backends can be
written with different forms of concurrency and load-balancing
across processors.
Finally, Map Partitioning is not directly related to code paths within
the OSD itself, but does affect everything that can be done with Ceph.
People are beginning to run into limits on how large a Ceph cluster can
grow and how many ways they can be partitioned, and both these problems
fundamentally derive from the way the OSD map is handled by the monitors.
There are also some notes at the end. They are not critical, but if you
find yourself asking "What were they thinking?" the notes might help.
# Sessions and Read-Write #
From `ReplicatedPG.cc`.
```c++
// Write operations aren't allowed to return a data payload because
// we can't do so reliably. If the client has to resend the request
// and it has already been applied, we will return 0 with no
// payload. Non-deterministic behavior is no good. However, it is
// possible to construct an operation that does a read, does a guard
// check (e.g., CMPXATTR), and then a write. Then we either succeed
// with the write, or return a CMPXATTR and the read value.
…
if (ctx->op_t->empty() || result < 0) {
…
if (ctx->pending_async_reads.empty()) {
complete_read_ctx(result, ctx);
} else {
in_progress_async_reads.push_back(make_pair(op, ctx));
ctx->start_async_reads(this);
}
return;
}
…
// issue replica writes
ceph_tid_t rep_tid = osd->get_tid();
RepGather *repop = new_repop(ctx, obc, rep_tid);
issue_repop(repop, ctx);
eval_repop(repop);
```
As you can see, if we have any writes (all mutations end up in the
`op_t` transaction), we just flat out don't do the requested read
operations. If we don't have any writes, we perform the read
operations and return. This is justified in the comment above because
of the non-deterministic behavior of resent read-write operations.
This is not an unsolved problem and we can bootstrap a solution on our
existing `Session` infrastructure.
## An upgraded session ##
Behold, `OSDSession`:
```c++
struct Session : public RefCountedObject {
EntityName entity_name;
OSDCap caps;
int64_t auid;
ConnectionRef con;
WatchConState wstate;
…
};
```
This structure exists once for every connection to the OSD. Where they
are created depends on who is doing the creation. In the case of
clients (what we're interested in) it occurs in `ms_handle_authorizeri`
```c++
…
isvalid = authorize_handler->verify_authorizer(cct, monc->rotating_secrets,
authorizer_data, authorizer_reply, name, global_id, caps_info, session_key, &auid);
if (isvalid) {
Session *s = static_cast<Session *>(con->get_priv());
if (!s) {
s = new Session(cct);
con->set_priv(s->get());
s->con = con;
dout(10) << " new session " << s << " con=" << s->con << " addr=" << s->con->get_peer_addr() << dendl;
}
s->entity_name = name;
if (caps_info.allow_all)
s->caps.set_allow_all();
s->auid = auid;
…
}
```
In order to solve this problem, we propose a new data structure,
modelled on NFSv4.1
```c++
struct OpSlot {
uint64_t seq;
int r;
MOSDOpReplyRef cached; // Nullable
bool completed;
};
```
We do not want to give the OSD an unbounded obligation to hang on to
old message replies: that way lies madness. So, the additions to
`Session` we might make are:
```c++
struct Session : public RefCountedObject {
…
uint32_t maxslots; // The maximum number of operations this client
// may have in flight at once;
std::vector<OpSlot> slots // The vector of in-progress operations
ceph::timespan slots_expire; // How long we wait to hear from a
// client before the OSD is free to
// drop session resources
cepu::coarse_mono_time last_contact; // When (by our measure) we
// last received an operation
// from the client.
};
```
## Message Additions ##
The OSD needs to communicate this information to the client. The most
useful way to do this is with an addition to `MOSDOpReply`.
```c++
class MOSDOpReply : public Message {
…
uint32_t this_slot;
uint64_t this_seq;
uint32_t max_slot;
ceph::timespan timeout;
…
};
```
This overlaps with the function of the transaction ID, since the
slot/sequence/OSD triple uniquely identifies an operation. Unlike the
transaction ID, this provides consistent semantics and a measure of
flow control.
To match our reply, the `MOSDOp` would need to be amended.
```c++
class MOSDOp : public Message {
…
uint32_t this_slot;
uint64_t this_seq;
bool please_cache;
…
};
```
## Operations ##
### Connecting ###
A client, upon connecting to an OSD for the first time should send a
`this_slot` of 0 and a `this_seq` of 0. If it reconnects to an OSD it
should use the `this_slot` and `this_seq` values from before it lost
its connection. If an OSD has state for a client and receives a
`(slot,seq) = (0,0)` then it should feel free to free any saved state
and start anew.
### OSD Feedback ###
In every `MOSDOpReply` the OSD should send `this_slot` and `this_seq` to
the value from the `MOSDOp` to which we're replying.
More usefully, the OSD can inform the client how many operations it is
allowed to send concurrently with `max_slot`. The client must **not**
send a slot value higher than `max_slot`. (The OSD should error if it
does.)
The OSD may increase the number of operations allowed in-flight
if it has capacity by increasing `max_slot`. If it finds itself
lacking capacity, it may decrease `max_slot`. If it does, the client
should respect the new bound. (The OSD should feel free to free the
rescinded slots as soon as the client sends another `MOSDOp` with a
slot value equal to one on which the new `max_slot` has been sent.)
If the client sends a `this_seq` lower than the one held for a slot by
the OSD, the OSD should error. If it is more than one greater than the
current `this_seq`, the OSD should error.
### Caching ###
The client is in an excellent position to know whether it **requires**
the output of a previous operation of mixed reads and writes on
resend, or whether it merely needs the status on resend. Thus, we let
the client set `please_cache` to request that the OSD store a
reference to the sent message in the appropriate `OpSlot`.
The OSD is in an excellent position to know how loaded it is. It can
calculate a bound on how large a given reply will be before executing
it. Thus, the OSD can send an error if the client has requested it
cache something larger than it feels comfortable caching.
Assuming no errors, the behavior, for any slot, is this: If the client
sends an `MOSDOp` with a `this_seq` one greater than the current value
of `OpSlot::seq`, that represents a new operation. Increment
`OpSlot::seq`, clear `OpSlot::completed` and begin the operation. When
the operation finishes, set `OpSlot::completed`. If `please_cache` has been
set, store the `MOSDOpReply` in `OpSlot::cached`. Otherwise simply store the
result code in `OpSlot::r`.
If the client sends an `MOSDOp` with a `this_seq` equal to
`OpSlot::seq` and `OpSlot::completed` is false, drop the request. (We
will reply when it completes.) If it has completed, send the stored
`OpSlot::MOSDOpReply` if there is one, otherwise send just a replay
with just `OpSlot::r`.
### Reconnection ###
Currently the `Session` is destroyed on reset and a new one is created
on authorization. In our proposed system the `Session` will not be
destroyed on reset, it will be moved to a structure where it can be
looked up and destroyed after `timeout` since the last message
received.
On connection, the OSD should first look up a `Session` keyed
on the entity name and create one if that fails.