Hi all, Thanks to Jason, Josh and others, we discussed replicated persistent write-back cache during last CDM. This email is to continue discussing detailed info about errror handling. The following describes background knowledge and the error case handling, welcome for any comments. Current implementation: ====================== A persistent write-back cache [1] is implemented in librbd, which provides an LBA-based, ordered write-back cache using NVDIMM as cache medium. The data layout on the cache device is split into three parts: header, a vector of log entries, and customer data. The customer data part store all the customer data. Every update request like write/discard etc is mapped to a log, and these logs are stored sequentially into a vector of log entries. The vector acts like a ring buffer and it is used repeatedly. The header part records the overall information about the cache pool, especially header and tail. The header indicates the first valid entry in log entries, and the tail indicates the next free entry in log entries. Replicated write-back cache ======================= The above is the overall implementation of persistent write-back cache in librbd, and currently the data is stored in local computer server with a single copy. To improve the redundancy, we are planning to add more copies across different servers. That is replicated write-back cache in client side through NVDIMM + RDMA. Except librbd, some replica daemon services will be started in other servers which provide management of NVDIMM devices in these servers. When a librbd starts and persistent write-back cache is required, it allocates a cache pool in local NVDIMM device. Meanwhile, it talks with replica daemons to allocate remote replica copies. After initialization, replica daemons register the replica pools and expose them through RDMA connections. All the cache metadata information is stored as part of the rbd image’s metadata. The librbd sets up RDMA connection with the corresponding replica daemons and access the data. With NVDIMM + RDMA, all the copies will have exactly the same data layout and data. The simple idea is to register NVDIMM through RDMA, and then using RDMA read/write to access the data which doesn’t need the involvement of CPUs in remote servers. These parts will use the RPMA library [2]. When an update request comes, it cached the request in local NVDIMM, and meanwhile persist in the same position of remote replica copies. This email is to focus on discussing how to handle kinds of failed scenarios. 1. Librbd crashes or local NVDIMM error As local cache pool is mmap to librbd application, the librbd process crashes when error happens in NVDIMM. So this NVDIMM error is the same as librbd crashes. Once the librbd process crashes, the RDMA connection to replicas will lose. The replica daemons are monitoring the connection’s status. Once they find the disconnection and wait some timeout time, they start to get the exclusive lock of the rbd image. There is only one replica daemon that can get the exclusive lock and it starts to flush the cache data to OSDs. Once flush is completed, it needs to do the following work: a. The cache metadata of the volume is updated as none and the exclusive lock is released. b. Notify other replica daemons to release the cache pools. 2. Librbd restarts. When the librbd process restarts, its corresponding replica daemons check that the RDMA connection is lost, wait some time, and try to flush. To prevent unnecessary flush by replica daemons, a timeout time can be configured by users. Only after waiting for the timeout time, replica daemons start to flush cached data. 3. Replica daemon crashes Same as above, the timeout time is needed to define. When librbd finds out the disconnection, it tries to recreate the connection. Once it fails after the timeout, it starts to find a new replica and sync data. Based on our tests, it takes about 1s to sync 1G data through two ports of 100Gb/s connection. The failover time includes 1) the time to find the error, 2) timeout time, 3) time to allocate a new replica copy, and 4) time to sync data. The overall time won’t exceed 300s (common IO timeout time). Once the failed replica daemon recovers in time, the librbd checks the data integrity by comparing the pool header. If data is sync, recover IO handling. 4. RDMA connections between librbd and replica daemons lost If only connections are lost, replica daemons try to get the exclusive lock. As the exclusive lock is held by librbd, it fails for replica daemon. As a result, no flush happens. In the librbd, it also finds out the disconnection. Its behaviors are the same as the fail case ‘replica daemon crashes’. [1] https://github.com/ceph/ceph/pull/35060 [2] https://github.com/pmem/rpma -- Best wishes Lisa _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx