Hi all, Darrick and I have been working on designing a new ioctl FICLONERANGE2. The following text attempts to explain our needs and reasoning behind this decision. Contents -------- 1. Problem Statement 2. Proof of Concept 3. Proposed Solution 4. User Interface 5. Testing Plan 1. Problem Statement -------------------- One of our VM cluster management products needs to snapshot KVM image files so that they can be restored in case of failure. Snapshotting is done by redirecting VM disk writes to a sidecar file and using reflink on the disk image, specifically the FICLONE ioctl as used by "cp --reflink". Reflink locks the source and destination files while it operates, which means that reads from the main vm disk image are blocked, causing the vm to stall. When an image file is heavily fragmented, the copy process could take several minutes. Some of the vm image files have 50-100 million extent records, and duplicating that much metadata locks the file for 30 minutes or more. Having activities suspended for such a long time in a cluster node could result in node eviction. A node eviction occurs when the cluster manager determines that the vm is unresponsive. One of the criteria for determining that a VM is unresponsive is the failure of filesystems in the guest to respond for an unacceptably long time. In order to solve this problem, we need to provide a variant of FICLONE that releases the file locks periodically to allow reads to occur as vmbackup runs. The purpose of this feature is to allow vmbackup to run without causing downtime. 2. Proof of Concept ------------------- Doing reflink in chunks enables the kernel to drop the file lock between chunks, allowing IO to proceed. Here we test this approach using a fixed chunk size of 1MB. Testing this tool on a heavily fragmented image gives us the following execution times: Number of extents in the test file - 419746, size=150GB command Time cp --reflink 18s Fixed chunk copy(1MB chunks) 20s We also performed these tests while simulating a readwrite workload on the image using fio. The copy times obtained are shown below. Using "cp --reflink" read : io=497732KB, bw=4141.3KB/s, iops=1035, runt=120188msec lat (msec): min=41, max=18240, avg=467.09, stdev=1079.23 write: io=498528KB, bw=4147.1KB/s, iops=1036, runt=120188msec lat (msec): min=44, max=17257, avg=520.01, stdev=1144.76 Using chunk based copy with chunk size 1MB read : io=617476KB, bw=5136.8KB/s, iops=1284, runt=120209msec lat (msec): min=6, max=3849, avg=385.28, stdev=487.71 write: io=617252KB, bw=5134.9KB/s, iops=1283, runt=120209msec lat (msec): min=7, max=3850, avg=411.95, stdev=512.18 These results demonstrate that periodically dropping the file lock reduces IO latency on a heavily fragmented file. Our tests show a max IO latency of 17s with regular reflink copy and a max IO latency of 3s with chunk based copy. 3. Proposed Solution -------------------- The command "cp --reflink" currently uses the FICLONE ioctl, which does not have an option to provide a chunk size. Using the existing FICLONERANGE ioctl would allow us to perform a chunk based copy as shown above. case FICLONE: return ioctl_file_clone(filp, arg, 0, 0, 0); case FICLONERANGE: return ioctl_file_clone_range(filp, argp); However, we can improve on this method by implementing a time based copy, in which we perform as much work as possible in a given time period. For example, we could do 15s of work before releasing the file locks (recall a node eviction occurs after ~30s). In order to implement a time based copy, we will need to pass additional arguments through the ioctl. Because the struct used by FICLONERANGE is already full, we are not able to add any new fields. Therefore, we need to implement a new ioctl. The proposed solution is to define a new ioctl FICLONERANGE2 which differs from FICLONERANGE in two ways: (1) FICLONERANGE2 will implement a time budget. There are two ways we can do this: (a) Define a flag that permits kernel exits (with -ERESTARTSYS) on regular signals. This is the least invasive to the kernel, since we already have mechanisms for queuing and checking for signals. This would not replace the current behavior of returning with -EINTR on fatal signals. (b) Add an explicit time budget field to the FICLONERANGE2 arguments structure and plumb that through the kernel calls. (2) FICLONERANGE2 will return the work completion status. There are two ways we can do this: (a) Add the amount of work done to the pos fields and subtract the amount of work done from the length field. This "cursor" like operation would set up userspace to call the kernel again if the request was only partially filled without having to update anything. This might be tricky given the "length==0" trick that means "reflink to the source file's EOF". (b) Provide an explicit field in the args structure to return the amount of work done and require userspace to adjust the pos/length fields. 4. User Interface ----------------- The current arguments structure for FICLONERANGE is shown below. struct file_clone_range { __s64 src_fd; __u64 src_offset; __u64 src_length; __u64 dest_offset; }; The new FICLONERANGE2 arguments structure will likely be larger. Depending on the chosen implementation, we may need several additional fields. __u64 flags; __u64 time_budget_ms; __u64 work_done; 5. Testing Plan --------------- The fstests suite already contains tests for the existing clone functionality. These tests can be found under the following groups: clone - FICLONE/FICLONERANGE ioctls clone_stress - stress testing FICLONE/FICLONERANGE We will also need to create additional tests for the new FICLONERANGE2 ioctl. - Write a test case that performs a FICLONERANGE2 copy with a time budget. If our implementation allows FICLONERANGE2 to be restarted after a signal interruption, we can test this by creating a loop and setting up a signal via alarm(2) or timer_create(2). - Write a test case that generates a file with many extents and tests that the kernel exits with partial completion when given a very short time budget. Comments and feedback appreciated! Catherine