Re: Attempt to rethink log-based replication in Ceph on fast IO path

Gregory Farnum <gfarnum@xxxxxxxxxx> · Sun, 26 Jan 2020 07:58:05 +0100

Hi Roman,
I’ve often thought idly about what we could do if we gave up on the
overbearing strict consistency Ceph provides and built something more
specifically suited to serving up block devices, so it’s very cool to
see somebody’s gotten more organized about it and written their
thoughts down. I for one would be happy to discuss this topic at
Cephalocon!

I’ve read through this proposal with varying levels of attention so I
may have missed something, but there are a few important points I
think you’ve missed in your writeup that need addressing, and a few
choices you’ve made that you’ll want to be ready to defend. :)

The first and most important one is this: while a system like you
describe could be *Ceph*, it is fundamentally and absolutely not
*RADOS*. The transactions RADOS provides in its ordered log are not
solely about recovery; we expose transactions and that ordered log in
great detail to the user over the network protocol with operations
like object append, object classes, compare-and-swap, the entire omap
interface, and even by returning object versions on every write
operation! These operations aren’t just novelties; RGW, RBD, and
CephFS metadata schemes all rely on them implicitly and explicitly —
and even for the “simple” RBD data objects you’re targeting, the
changes you describe break how RADOS does snapshots. (I think the RBD
data journal also relies on a number of these, but I’m not certain.)
RADOS actually *is* a primitive database, not just a disk, and I don’t
think we can usefully separate operations into the “management” and
“data” division you suggest. (Among other things, any write op has the
potential to change the object’s size and that needs to be ordered
with truncate operations.)

Now, making these changes isn’t necessarily bad if we want to develop
a faster but less ridiculously-consistent storage system to better
serve the needs of the interfaces that actually get deployed — I have
long found it a little weird that RADOS, a strictly-consistent
transactional object store, is one of the premier providers of virtual
block-device IO and S3 storage. But if that’s the goal, we should
embrace being not-RADOS and be willing to explicitly take much larger
departures from it than just “in the happy path we drop ordering and
fall back to backfill if there’s a problem”.

The second big point is that if you want to have a happy path and a
fallback ordered path, you’ll need to map out in a lot more detail how
those interact and how the clients and OSDs switch between them. Ideas
like this have come up before but almost every one (or literally every
one?) has had a fatal flaw that prevented it actually being safe.

Thirdly, you talk a lot about guaranteeing READ RECENCY and rejecting
IOs from clients when they arrive at replicas in different orders —
but I don’t see how you detect that case is occurring. If every client
op has a timestamp associated with it, you need to remember all the
client ops that might conflict with each other and then reject them as
needed, which sounds an awful lot like an ordered log to me! You can
probably do something with in-memory op logs and having the object
primary tell replicas when an op is fully acked, but I haven’t thought
it through…

In terms of design choices you’ve suggested, I’m not really sure why
you want to do client-side replication. It is frequently undesirable
since OSDs tend to have lower latency and more bandwidth to their
peers than the clients do to the OSDs; this system still has to do
acks to a new per-object primary *anyway*; and the usual design reason
to for client-side replication is because it lets the data store
daemons be much simpler, but these OSDs are actually getting *more*
complicated than our existing ones! Client-side replication op latency
is going to be
MAX(client network hop, client data transmission*3) + op permission
checking + op processing time + disk commit time + client network hop
whereas for primary-replication it will be
client data transmission + op permission checking + max(OSD network
hop, OSD data transmission*2) + op processing time + disk commit time
+ OSD network hop + network hop
How those formulas compare in practice will vary but is often better
for the OSD replication — which is certainly much simpler to work with
and reason about.

In a similar vein, I’m not sure what the point is of distributing
reads out by having per-object read primaries. We already distribute
reads by PG, and I guess the larger number of objects might fix
balance problems if we have them, but that seems like it’s pushing the
argument. Moreover it complicates the design by forcing replicas to
keep track of more state and reduces cache efficiency of the cluster
since we need at least some PG metadata in memory in more places.

And I think this is more of a nit, but we really, really don’t want
client clocks to be involved in any decision-making at all, even if we
hope it’s just going to be for uncommon conflict resolutions. Sure, it
is theoretically possible to get good clock sync if you control the
environment, but it’s incredibly finicky. For instance, here’s a
ticket where a monotonic clock is somehow going backwards in time:
tracker.ceph.com/issues/43365 o_0
Luckily I don’t think the conflict resolution you’re using timestamps
for actually needs them; we can probably design a logical clock of
some description that will be safer and reasonably simple.
-Greg

On Thu, Jan 23, 2020 at 3:49 PM Roman Penyaev <rpenyaev@xxxxxxx> wrote:
>
> Hi Sage and all,
>
> I would like to share some of my scattered thoughts regarding IO
> performance improvements in Ceph. I touch on topics such as replicated
> log model, atomicity and full journaling on the objectstore side and how
> these constraints can be relaxed on IO path in order to squeeze maximum
> from a hard drive, at the same time still respecting the strong data
> consistency between replicas.
>
> Would be great to have a chance to discuss this in Seoul.
>
>
> Proposal in a few lines (for time saving)
> -----------------------------------------
>
> 1. IO and management requests follow different paths:
>
>     1. Slow path for management requests (object creation/deletion,
>        locks, etc) involves replicated log and primary-copy model, as
>        it is right now.
>
>     2. Fast IO path (read, write requests) bypasses replicated log and
>        makes clients responsible for sending data to each replica in a
>        group (PG), I reference it as hybrid client-driven replication
>        further in the text.
>
> 2. For the sake of simplicity client takes fast IO path only when
>     group of replicas (PG) is in active and healthy state
>     (active+client in terms of Ceph), otherwise client falls back to
>     the slow path. Thus, client is not involved in the disaster
>     recovery process (that's why hybrid).
>
> 3. Fast IO path (is taken only if group of replicas (PG) is
>     in active and healthy state):
>
>     1. Reads of objects happen from different primaries in a group
>        (PG), in order to spread the read load.
>
>     2. Writes are distributed by clients themselves to all replicas in
>        a group (PG), avoiding one extra network hop.
>
> 4. Objectstore becomes journal-less and behaves similar to RAID 1,
>     sacrificing atomicity and erasure coding (due to the "write hole"
>     problem) for write performance, which becomes close to the write
>     performance of a bare hard drive, but still keeping strong data
>     consistency between replicas.
>
>
> !!!!! CAUTION: many letters !!!!!
>
>
> Introduction
> ------------
>
> The most impressive IO performance numbers among products which
> provide data redundancy is certainly delivered by RAID and that is not
> a surprise: you just write a block to several disks in parallel,
> nothing more. Perfect.
>
> Then why Ceph can't achieve the same performance as RAID 1 on IO path,
> doing plain old write-in-parallel thing, but supplemented with
> scalability, CRUSH object placement and all other beloved things which
> make Ceph so powerful?
>
> The answer to this question lies in the replication design used by
> Ceph, namely log-based replication (PG log in Ceph terms), which is
> commonly used in storage and database systems and well described in
> many publications [1, 2, 3, 4]). Sequential replicated log indeed
> resolves issues related to atomicity, consistency, isolation and
> durability (this famous ACID buzzword acronym), especially in DBMS
> world where transactions are involved. But Ceph is a storage, not a
> database.
>
> Modern hard drives do not provide atomicity, strict writes ordering or
> durability. Hardware breaks all the rules and relax the constraints
> for the sake of performance, and software has to deal with it. Every
> modern filesystem can survive crashes (journals, COW, soft-updates,
> etc approaches are used), each application which cares about data
> consistency and can't rely on weak guarantees of POSIX file API (any
> DBMS for example) uses same methods (journals, COW, etc) to avoid
> inconsistency. Why then a storage solution, which works below a
> filesystem and on top of a hardware, should provide such strong ACID
> properties for IO?
>
> So what is a log-based replication which is a core of Ceph? Very
> simple: duplication (mirroring) of a log which consist of records
> describing storage operations. In order to duplicate something we have
> to be sure that an object which we duplicate has not been changed on
> the way, i.e. actually copy-paste does what it is supposed to do
> without nasty surprises. When we have a log of storage operations we
> need not only to deliver a recent record to a replica without
> corruption (and wrong order of operations in a log is a corruption)
> but also invoke this operation, which has been described in a log
> record. Not only that: new record in a log and operation, which is
> described by that record, should happen in one transaction,
> atomically, thus either we see a record and operation was performed,
> either no updates in a log and no operation consequences is visible.
>
> Having such a wonderful log (which obviously can survive a crash or
> even a nuclear strike) we can very easily answer The Most Important
> question for a high scalable, self-healing storage product: what has
> happened on other replicas while one of them took a nap (crashed). In
> other words we need such a complicated mechanism only for one thing:
> just to tell the exact data difference (e.g. in blocks of data and
> operations), what has been changed since a crash.
>
> Short quiz: can Сeph stay Ceph, but without any log-based replication?
> Yes, throw it away, no need to know the difference, just copy the
> whole bunch of data between replicas on any disaster recovery. Very
> slow on resyncing but fast on doing operations and simple. I'm, of
> course, exaggerating, but that is not so far from the truth.
>
> Talking about log-based replication I've never mentioned why it
> actually impacts the IO performance. Here is why:
>
> 1. Each update operation on a group of replicas (PG in terms of Ceph)
> has to be strictly ordered (log should stay in the same order on all
> replicas) i.e. no any other operation is allowed till an operation,
> which has been just started, reaches a hard drive with all cache
> flushes involved (who poked the code knows the notorious pglock, which
> protects the replication log from corruption and makes it be equal on
> all other replicas).
>
> 2. Severe restrictions on how actually the objectstore should be
> implemented: data and metadata should be fully journaled, because a
> log record of a replicated operation is stored along with the data in
> atomic manner. And what is wrong with journal in terms of performance?
> Nothing is wrong, it is just incredibly slow. (As a hint for the
> curious: enable "data=journal" option for ext4 filesystem, or
> "allow_cow" for xfs or just run any load which performs random writes
> on BTRFS (cow is always enabled), and yes, I promise, you feel the
> difference immediately).
>
> 3. Communication with a group of replicas (PG) happens only through a
> primary one, where a primary is responsible for keeping log in sync
> with other replicas. This primary-copy model [7] increases IO latency
> adding one extra network hop.
>
> All of the above about log-based replication in Ceph is about strong
> consistency, but not about performance. But can we have both? And what
> is actually a strong consistency? According to formal description "The
> protocol is said to support strong consistency if all accesses are
> seen by all parallel processes (or nodes, processors, etc.) in the
> same order (sequentially)" [6]. In simple words: one thread produces
> data to the same offset of a file, other  threads read from the same
> offset, thus if strong consistency is respected readers never observe
> writes in a wrong order, e.g.:
>
>
>    Writer             Reader 1          Reader 2               Reader 3
>    --------------     --------------    --------------    ---------------
>    write(file, "A");
>                       b0 = read(file);  b0 = read(file);
>                                         b1 = read(file);
>    write(file, "B");
>                                                           b0 =
> read(file);
>                       b1 = read(file);                    b1 =
> read(file);
>
>
>
>
> In this example, readers are able to see a block of a file in the
> following states:
>
>    Reader 1     Reader 2      Reader 3
>    ---------    ----------    ----------
>    b0 == "A"    b0 == "A"     b0 == "B"
>    b1 == "B"    b1 == "A"     b1 == "B"
>
>
> But what should never be observed and can be treated as a corruption
> is the following:
>
>    b0 == "B"
>    b1 == "A"
>
>
> Having that we can develop a simple rule: writes are not blocked and
> can come at any order, once a reader observes a change in a block,
> further reads to this block see the change. Most likely I describe
> here a wheel, which has some vivid name and was deeply buried in some
> publication in early 70th, but I do not care and call the rule as READ
> RECENT. The READ RECENT rule contains one splendid property: if read
> has never happened and order of writes is unknown, writes can be
> reordered regardless of their actual order. Practical example: there
> is a distributed storage where writes are unordered by nature; a full
> outage has happened and now replicas contain different blocks of data
> on the same offsets; read of the block has never happened after it was
> updated, thus we are able to bring replicas in sync choosing block
> from *any* replica. Marvelous. I will return to this highly important
> property when I start describing a new replication model.
>
> It turns out that strong consistency is not about strict requests
> ordering in a whole group of replicas (PG) it is only about order of
> reads against writes to the same block of data. That's it. So the
> question is left unanswered: can we have both strong consistency and
> high performance of IO?
>
> In order to answer this question, it is necessary to once again state
> the main requirements for the IO replication model, which remove the
> strict restrictions imposed by the log-based replication, positively
> affect performance and do not contradict strong consistency:
>
> 1. Get rid of extra hop on fast IO path by doing client-driven
> replication, at least when group of replicas (PG) is in active state
> and not doing any sort of disaster recovery.
>
> 2. Write requests are not strictly ordered to different offsets of an
> object and can be submitted to a hard drive in parallel.
>
> 3. Get rid of any journaling on fast IO path on objectstore side if
> possible.
>
> These are major requirements of constraint relaxation which can bring
> storage IO performance close to a bare hard drive bandwidths keeping
> strong consistency. In the next chapter I will cover all the
> requirements and will show how it can be implemented in practice.
>
>
> Separation of management requests from IO requests
> --------------------------------------------------
>
> Requests in a storage cluster can be divided into two parts:
> management requests, which are object creation/deletion/listing,
> attributes setting/getting, lock management, etc; and IO requests,
> which do read or modify content of an object. The most prevailing
> requests are IO read and write requests, whose latency affect the
> overall performance of the entire cluster. The proposed requests
> separation has also a clear separation on metadata and data, which
> should be treated differently on the objectstore side, namely no
> journal should be involved in data modification (this will be
> discussed in detail below).
>
> Separating IO from management requests makes it possible to have a
> different primary replicas in a group of replicas (PG). For example in
> order to distribute a read load, primary can be chosen by an object id
> using persistent hashing (e.g. hash ring as proposed here [8]) instead
> of having a single primary. Write requests require replication
> involved and as was mentioned earlier client-driven replication model
> can be chosen in order to minimize latencies of an extra hop.
>
> Having different paths for several types of requests can be summarised
> as the following:
>
> 1. Management requests go through a persistent primary replica and
> log-based replication keeps strict ordering of operations in the
> entire group (PG). There are no changes to original Ceph architecture
> are proposed.
>
> 2. Read object requests are sent to different primaries according to
> some persistent hashing rule [8].
>
> 3. Write requests can follow two IO paths, where on one of the paths
> client is fully responsible for request replication and delivers
> requests to all replicas in a group (PG). This path is always taken
> when group of replicas (PG) is in active and healthy state. The second
> IO path always lies through a primary, thus all requests are equally
> ordered (like it is right now implemented in Ceph). This path is taken
> in order to simplify all corner cases when a group of replicas (PG) is
> not in an active state and performs resynchronization process.
>
>
> Hybrid client-driven replication
> --------------------------------
>
> Client-driven replication is quite self-descriptive: client is
> responsible for data delivery to replicas in a group (PG) avoiding one
> extra network hop, which exists in a primary-copy model. If client
> communicates with replicas in a group (PG) directly there is no longer
> the central point of synchronization, thus the one possible issue that
> needs to be considered is the possibility of write requests from
> several clients to come to replicas in different order:
>
>
>    Client 1                Client 2
>    --------                --------
>    sends A to replica1     sends B to replica1
>    sends A to replica2     sends B to replica2
>
>
>    Replica 1        Replica 2
>    ---------        ---------
>    A                B
>    B                A
>
>    ** replica1 writes B to the disk and then overwrites B with A
>       replica2 writes A to the disk and then overwrites A with B
>
>
> In this example both clients access the same block of the same object,
> but write requests become reordered differently because of the
> network. If this happened, then replicas can contain different data
> and they are out of sync. This issue has very bad consequences: future
> reads are undefined, especially if one of replicas crashed and clients
> start reading from another. This is definitely a data corruption. In
> order to solve the issue the order of writes to the same offset should
> be synchronized between replicas. Synchronization can be performed by
> marking each request with a timestamp, taking into consideration, that
> time flows equally on clients. Sub-microsecond time synchronization is
> not a problem [9], especially when there is the central point for all
> members - distributed state machine cluster, namely cluster of
> monitors in terms of Ceph.
>
> Having time synchronized, client marks write requests with a
> timestamp. According to the Thomas Write Rule [10] outdated requests
> are discarded by timestamp-based concurrency control keeping the
> correct order. However, in the case of the client-driven replication
> it is impossible to discard a change on all replicas
> simultaneously. Instead outdated request should be repeated for all
> replicas in a group (PG), here is an example:
>
>
>    Client 1                  Client 2
>    --------                  --------
>    marks A with stamp 1      marks B with stamp 2
>    sends A to replica1       sends B to replica1
>    sends A to replica2       sends B to replica2
>
>
>    Replica 1        Replica 2
>    ---------        --------------
>                     -- on disk --
>    A                B
>    -- on disk --
>    B                A
>
>
>    ** replica1 writes B to the disk and rejects A with RETRY
>       replica2 writes A to the disk and then overwrites A with B
>
>
> Request A is not written by replica1, because it is older then B
> request, which has been applied recently. Instead replica1 replied on
> A request with the RETRY error, which forces the client to mark
> request A with a new timestamp and resend it again to the whole group
> of replicas (PG):
>
>
>    Client 1
>    --------
>    marks A with stamp 3
>    sends A to replica1
>    sends A to replica2
>
>
>    Replica 1        Replica 2
>    ---------        ---------
>    A                A
>                     -- on disk --
>                     B
>    -- on disk --
>    B                A
>
>
>    ** replica1 overwrites B with A
>       replica2 overwrites B with A
>
>
> The next question rises: why do we need to repeat write operation on
> all replicas, but not only on those who have just replied with the
> error? Let's consider overlapping data (here I do not consider the
> reason why clients send overlapping write requests concurrently,
> obviously this is a bug on a client side, but nevertheless we keep
> replicas in sync):
>
>
>    Client 1 marks AAAA with stamp 1
>    Client 2 marks BBB with stamp 2
>    Client 3 marks C with stamp 3
>
>
>    Replica 1        Replica 2
>    ---------        ---------
>      C     3
>     BBB    2          C     3
>    AAAAA   1        AAAAA   1
>    ---------        ---------   below the line is the state on disk
>    ABCBA            AACAA
>
>
>    ** replica1 writes AAAA, then overwrites the middle with BBB and
>       then with C
>
>
>       replica2 writes AAAA and overwrites the middle with C;
>       BBB request is rejected
>
>
> Client2 receives the RETRY error and has to repeat BBB:
>
>
>    Client 2 marks BBB with stamp 4
>
>
>    Replica 1        Replica 2
>    ---------        ---------
>     BBB   4          BBB     4
>    -- on disk --
>      C    3
>                     -- on disk --
>     BBB   2           C      3
>    AAAAA  1         AAAAA    1
>    --------------   --------------   below the line is the state on disk
>    ABBBA            ABBBA
>
>
> On a second retry blocks on replicas become synchronized and write
> request is considered as completed and persistent on a drive. Since
> concurrent writes should never happen on well written client software
> (I expect distributed filesystems use lock primitives in order to
> prevent concurrent access) I do not expect any performance degradation
> because of frequent write retries, so current algorithm acts as a
> protection which keeps replicas in sync.
>
> I would like to emphasize several points obtained from the description
> of the algorithm described above:
>
> 1. Time synchronization on clients does not need to be very accurate,
> even hundreds of millisecond accuracy is enough. Described time-based
> algorithm does not depend on actual physical time, also it does not
> depend on real (absolute) order of writes sent by concurrent
> clients. Instead algorithm forces distributed cluster members to have
> a single view on requests order, which is enough to decide should
> request be rejected or executed.
>
> 2. Hybrid client-driven replication does not involve clients into data
> recovery process, thus leaving this job to replicas (exactly as it is
> right now). Here I describe the fast IO path only when group of
> replicas (PG) is in active and fully synchronized state. When group of
> replicas (PG) is in resynchronization state or client sends management
> type of requests (not IO), then for the sake of simplicity all
> communication goes through the main primary replica, as it is right
> now implemented in Ceph.
>
>
> Crash of a client in the middle of a replication
> ------------------------------------------------
>
> Client-driven replication in contrast with a primary-copy model has
> another major issue which has to be considered in detail: crash of a
> client in the middle of a replication, that is, when client sends
> write request to one of replicas in a group (PG) and then crashes,
> leaving other replicas out of sync. Since there is no central point
> which controls the whole replication process, group of replicas have
> no information, was the write request confirmed by a whole group or
> something has happened to a client and special action should be
> performed.
>
> The issue can be solved by immediate confirmation of a successfully
> completed write request sent by non-primary replicas to a primary
> one. As was mentioned earlier, each object has its own primary which
> serves reads, thus write confirmations of a modified object are sent
> to a particular primary, which accounts number of confirmations for
> the modified object. If the confirmation did not come from all
> replicas in a group, then after the timeout elapsed a primary can
> start synchronizing an object on its own.
>
> In order to be sure that confirmations are accounted correctly, each
> of them is marked with a timestamp taken from a source write request,
> thus reordered and outdated confirmations can be discarded.
>
> I would like to stress the point, that accounting of confirmations
> happens in RAM only at runtime and does not involve any disk
> operations. If one of replicas crashes after a client crash then the
> whole group of replicas (PG) changes its state, further write requests
> take another path and forwarded to the main primary in a group and
> objects synchronization would be started anyway.
>
>
> The READ RECENT rule in action
> ------------------------------
>
> In client-driven replication the order in which replicas receive
> requests and update blocks is out of control, thus readers can receive
> inconsistent data (also called weak consistency model [8]). To ensure
> that a once read block will never return to its previous value (READ
> RECENT rule mentioned earlier) all attempts to read non consistent
> (not confirmed from all replicas) block has to be delayed or rejected
> with an explicit error. Consider this scenario:
>
> 1. Client1 fans out write requests to all replicas.
>
> 2. Write request reaches only the primary one, client1 crashes.
>
> 3. Concurrent client2 reads the same block, since primary replica has
> applied write request from client1, client2 reads the requested block
> successfully and observes the change made by client1.
>
> 4. Primary replica crashes and new primary is selected, resync process
> in a group (PG) is started.
>
> 5. Concurrent client2 reads again the same block from a new primary,
> but since no one else observes the change written to previous primary
> made by client1, client2 receives old data.
>
> This scenario breaks the READ RECENT rule and introduces data
> corruption. In order to guarantee that block is consistent, read from
> line 3. should be delayed or rejected till all replicas confirm, that
> block is persistent. This can be achieved by the same confirmation
> mechanism described earlier: block is treated as inconsistent if it
> was modified and not all  confirmations are received.
>
> I do not expect any performance degradation for generic read loads and
> can quote Sage here: “Reads in shared read/write workloads are usually
> unaffected by write operations. However, reads of uncommitted data can
> be delayed until the update commits. This increases read latency in
> certain cases, but maintains a fully consistent behavior for
> concurrent read and write operations in the event that all OSDs in the
> placement group simultaneously fail and the write ack is not
> delivered” [7].
>
> Another issue, which has to be investigated in detail, is a full
> outage of a group of replicas (PG). When replicas are come back some
> of the objects can be out of sync, and without versioned log there is
> no any possibility to distinguish what blocks were updated recently
> and thus have a higher timestamp. As was mentioned earlier READ RECENT
> rule has one property: if block has not been read since the
> modification, then it is irrelevant what version of a block can be
> taken as a master copy. Thus, if read of an inconsistent block is
> delayed till it is persistent on all replicas, any replica in a group
> may be a “donor” of this particular block.
>
>  From the statement above it follows that non-atomic or partial writes
> can happen. Indeed, from a filesystem or an application which acts as
> a cluster client, this requires special handling of a write failure,
> such as log replay, for dealing with inconsistencies left behind. But
> it will assume that each block on all replicas in a group (PG) has
> just one value, so in this way group remains synchronized and future
> reads won’t report different content in case of a change of the
> primary replica which serves reads.
>
>
> Journal-less objectstore for IO path
> ------------------------------------
>
> Write atomicity requires certain support from an underlying hardware
> or software. Different techniques like journals, COW or soft-updates
> can be used in order to guarantee atomicity, but for no doubts this
> feature comes with a high price of the IO performance
> degradation. Since each modern filesystem or a DBMS application itself
> can take care of data inconsistency, atomicity constraint for the
> objectstore can be relaxed.
>
> In order to provide performance close to a bare hard drive, each
> replica has to submit writes to an underlying hardware immediately and
> without any requirements for request ordering or data atomicity. As
> was mentioned earlier, the reading delay mechanism of non-persistent
> blocks makes synchronization of objects after a complete shutdown
> possible without knowing where is the recent update. However, full
> synchronization can take days if an object is big enough. The problem
> of long resync can be solved by bookkeeping a bitmap of a
> possibly-out-of-sync blocks of an object. The algorithm can be
> described in just one sentence: “Bits are set before a write
> commences, and are cleared when there have been no writes for a
> while.” [11].
>
> In highly distributed storage solution where a group of replicas (PG)
> can consist of different replicas during the whole life of a cluster
> and members in a group constantly changing, it is important not only
> to track out-of-sync blocks of an object, but also to keep versions of
> such changes, so that to answer the following question with minimal
> computational cost: what has changed in the object between versions N
> and M?
>
> Having versions of block changes in mind, one important modification
> to bitmap algorithm can be proposed: bitmap of out-of-sync blocks is
> stored in a file with a timestamp of a first write request in the
> name; after a certain number of changed blocks each replica does a
> rotation of the bitmap file, so that new bitmap file is created with a
> timestamp of a first write request which initiated the bitmap rotation
> process. The major difference to what RAID 1 does is, that bits are
> never cleared, but new bitmap file is created instead. Bits of
> modified but not yet persistent blocks (not all confirmations are
> received from replicas) migrate to a new bitmap file, in order to
> guarantee that block synchronization will take place in the future in
> case of possible replica failure.
>
> As timestamps of client write requests always change forward and the
> time on clients is synchronized (major requirement of proposed hybrid
> client-driven replication model), each replica will have similar view
> of block changes, so the question about data difference between the
> two versions can be easily answered.
>
> Proposed rotation of bitmap files can occupy extra disk space on a
> replica, especially when new files are created and never
> deleted. Deletion of bitmap files of old versions is not a problem: N
> files can be kept, rotation process spawns new bitmap file with a
> recent timestamp, meanwhile a file with the oldest timestamp is
> removed. If for some reason there will be a need to restore changes of
> an object version, which bitmap file was removed, then full object
> resynchronization should be performed.
>
>
> Summary
> -------
>
> Proposed changes should relax a lot of constraints on fast IO path
> in current Ceph implementation keeping strong data consistency
> between replicas and bring performance of a single client to almost
> bare hard drive bandwidths.
>
> Even if there are logical holes in the model described above (there
> are), I still believe that eliminating them will not be impossible.
>
> PS. And yes, because of the “write hole” problem [12] erasure coding
> replication won’t survive full group outage without atomicity
> guarantees on the objectstore side. Sorry for that.
>
> --
> Roman
>
> [1] R. Golding, Weak-consistency group communication and membership,
> PhD thesis, University of California, Santa Cruz, 1992.
> [2] Petersen et al., “Flexible Update Propagation for weakly
> consistent replication”, Proc. of the 16th ACM Symposium on Operating
> Systems Principles (SOSP), 1997.
> [3] M. Rabinovich. N. Gehani. A. Kononov, “Scalable Update Propagation
> in Epidemic Replicated Databases”,Advances in Database Technology -
> EDBT'96, Lecture Notes in Computer Science Vol. 1057, Springer,
> pp. 207-222.
> [4] G. Wuu, A Bernstein. „Efficient Solutions to the Replicated Log
> and Dictionary Problems”, Proceedings of the third ACM Symposium on
> Principles of Distributed Computing, August 1984, pp. 233-242.
> [6] https://en.wikipedia.org/wiki/Strong_consistency
> [7] Sage A. Weil. Ceph: Reliable, Scalable, And High-Performance
> Distributed Storage, PhD thesis, University of California, Santa Cruz,
> 2007.
> [8] Jiayuan Zhang, Yongwei Wu†, Yeh-Ching Chung. PROAR: A Weak
> Consistency Model For Ceph
> [9] Precision Time Protocol (PTP/IEEE-1588)
> [10] R. H. Thomas, “A majority consensus approach to concurrency control
> for multiple copy databases,” ACM Trans. Database Syst., vol. 4, no. 2,
> pp. 180–209, 1979.
> [11] Cluster support for MD/RAID 1, https://lwn.net/Articles/674085/
> [12] https://en.wikipedia.org/wiki/RAID
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx