Design discussion about replicated persistent write-back cache in librbd.

xiaoyan li <wisher2003@xxxxxxxxx> · Fri, 20 Nov 2020 14:00:37 +0800

Hi all,

Thanks to Jason, Josh and others, we discussed replicated persistent
write-back cache during last CDM. This email is to continue discussing
detailed info about errror handling.
The following describes background knowledge and the error case
handling, welcome for any comments.

Current implementation:
======================

A persistent write-back cache  [1] is implemented in librbd, which
provides an LBA-based, ordered write-back cache using NVDIMM as cache
medium.
The data layout on the cache device is split into three parts: header,
a vector of log entries, and customer data.

The customer data part store all the customer data.

Every update request like write/discard etc is mapped to a log, and
these logs are stored sequentially into a vector of log entries. The
vector acts like a ring buffer and it is used repeatedly.

The header part records the overall information about the cache pool,
especially header and tail. The header indicates the first valid entry
in log entries, and the tail indicates the next free entry in log
entries.

Replicated write-back cache
=======================

The above is the overall implementation of persistent write-back cache
in librbd, and currently the data is stored in local computer server
with a single copy. To improve the redundancy, we are planning to add
more copies across different servers. That is replicated write-back
cache in client side through NVDIMM + RDMA.

Except librbd, some replica daemon services will be started in other
servers which provide management of NVDIMM devices in these servers.
When a librbd starts and persistent write-back cache is required, it
allocates a cache pool in local NVDIMM device. Meanwhile, it talks
with replica daemons to allocate remote replica copies. After
initialization, replica daemons register the replica pools and expose
them through RDMA connections. All the cache metadata information is
stored as part of the rbd image’s metadata. The librbd sets up RDMA
connection with the corresponding replica daemons and access the data.

With NVDIMM + RDMA, all the copies will have exactly the same data
layout and data. The simple idea is to register NVDIMM through RDMA,
and then using RDMA read/write to access the data which doesn’t need
the involvement of CPUs in remote servers. These parts will use the
RPMA library [2].

When an update request comes, it cached the request in local NVDIMM,
and meanwhile persist in the same position of remote replica copies.

This email is to focus on discussing how to handle kinds of failed scenarios.

1. Librbd crashes or local NVDIMM error

As local cache pool is mmap to librbd application, the librbd process
crashes when error happens in NVDIMM. So this NVDIMM error is the same
as librbd crashes.

Once the librbd process crashes, the RDMA connection to replicas will
lose. The replica daemons are monitoring the connection’s status. Once
they find the disconnection and wait some timeout time, they start to
get the exclusive lock of the rbd image. There is only one replica
daemon that can get the exclusive lock and it starts to flush the
cache data to OSDs. Once flush is completed, it needs to do the
following work:

a. The cache metadata of the volume is updated as none and the
exclusive lock is released.
b. Notify other replica daemons to release the cache pools.

 2. Librbd restarts.

When the librbd process restarts, its corresponding replica daemons
check that the RDMA connection is lost, wait some time, and try to
flush.

To prevent unnecessary flush by replica daemons, a timeout time can be
configured by users. Only after waiting for the timeout time, replica
daemons start to flush cached data.

3. Replica daemon crashes

Same as above, the timeout time is needed to define. When librbd finds
out the disconnection, it tries to recreate the connection. Once it
fails after the timeout, it starts to find a new replica and sync
data.

Based on our tests, it takes about 1s to sync 1G data through two
ports of 100Gb/s connection.
The failover time includes 1) the time to find the error, 2) timeout
time, 3) time to allocate a new replica copy, and 4) time to sync
data. The overall time won’t exceed 300s (common IO timeout time).

Once the failed replica daemon recovers in time, the librbd checks the
data integrity by comparing the pool header.  If data is sync, recover
IO handling.

 4. RDMA connections between librbd and replica daemons lost

If only connections are lost, replica daemons try to get the exclusive
lock. As the exclusive lock is held by librbd, it fails for replica
daemon. As a result, no flush happens.

In the librbd, it also finds out the disconnection. Its behaviors are
the same as the fail case ‘replica daemon crashes’.

[1] https://github.com/ceph/ceph/pull/35060
[2] https://github.com/pmem/rpma

-- 
Best wishes
Lisa
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx