proposal: enhance 'cp --reflink' to expose ioctl_ficlonerange

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

Darrick and I have been working on designing a new ioctl FICLONERANGE2. The
following text attempts to explain our needs and reasoning behind this decision. 


Contents
--------
1. Problem Statement
2. Proof of Concept
3. Proposed Solution
4. User Interface
5. Testing Plan


1. Problem Statement
--------------------

One of our VM cluster management products needs to snapshot KVM image files
so that they can be restored in case of failure. Snapshotting is done by
redirecting VM disk writes to a sidecar file and using reflink on the disk
image, specifically the FICLONE ioctl as used by "cp --reflink". Reflink
locks the source and destination files while it operates, which means that
reads from the main vm disk image are blocked, causing the vm to stall. When
an image file is heavily fragmented, the copy process could take several
minutes. Some of the vm image files have 50-100 million extent records, and
duplicating that much metadata locks the file for 30 minutes or more. Having
activities suspended for such a long time in a cluster node could result in
node eviction. A node eviction occurs when the cluster manager determines
that the vm is unresponsive. One of the criteria for determining that a VM
is unresponsive is the failure of filesystems in the guest to respond for an
unacceptably long time. In order to solve this problem, we need to provide a
variant of FICLONE that releases the file locks periodically to allow reads
to occur as vmbackup runs. The purpose of this feature is to allow vmbackup
to run without causing downtime.

2. Proof of Concept
-------------------
Doing reflink in chunks enables the kernel to drop the file lock between
chunks, allowing IO to proceed. Here we test this approach using a fixed
chunk size of 1MB. Testing this tool on a heavily fragmented image gives us
the following execution times:

Number of extents in the test file - 419746, size=150GB
  
command                       Time
cp --reflink                  18s
Fixed chunk copy(1MB chunks)  20s

We also performed these tests while simulating a readwrite workload on the
image using fio. The copy times obtained are shown below.

Using "cp --reflink"
read : io=497732KB, bw=4141.3KB/s, iops=1035, runt=120188msec
   lat (msec): min=41, max=18240, avg=467.09, stdev=1079.23 
 
write: io=498528KB, bw=4147.1KB/s, iops=1036, runt=120188msec
   lat (msec): min=44, max=17257, avg=520.01, stdev=1144.76 

Using chunk based copy with chunk size 1MB 
read : io=617476KB, bw=5136.8KB/s, iops=1284, runt=120209msec
   lat (msec): min=6, max=3849, avg=385.28, stdev=487.71 
 
write: io=617252KB, bw=5134.9KB/s, iops=1283, runt=120209msec
   lat (msec): min=7, max=3850, avg=411.95, stdev=512.18

These results demonstrate that periodically dropping the file lock reduces
IO latency on a heavily fragmented file. Our tests show a max IO latency of
17s with regular reflink copy and a max IO latency of 3s with chunk based copy.

3. Proposed Solution
--------------------
The command "cp --reflink" currently uses the FICLONE ioctl, which does not
have an option to provide a chunk size. Using the existing FICLONERANGE ioctl
would allow us to perform a chunk based copy as shown above.

case FICLONE:
	return ioctl_file_clone(filp, arg, 0, 0, 0);
 
case FICLONERANGE:
	return ioctl_file_clone_range(filp, argp);

However, we can improve on this method by implementing a time based copy, in
which we perform as much work as possible in a given time period. For example,
we could do 15s of work before releasing the file locks (recall a node eviction
occurs after ~30s). In order to implement a time based copy, we will need to
pass additional arguments through the ioctl. Because the struct used by
FICLONERANGE is already full, we are not able to add any new fields. Therefore,
we need to implement a new ioctl.

The proposed solution is to define a new ioctl FICLONERANGE2 which differs
from FICLONERANGE in two ways:

(1) FICLONERANGE2 will implement a time budget. There are two ways we can do this:
	(a) Define a flag that permits kernel exits (with -ERESTARTSYS) on
	regular signals. This is the least invasive to the kernel, since we
	already have mechanisms for queuing and checking for signals. This
	would not replace the current behavior of returning with -EINTR on fatal
	signals.
	(b) Add an explicit time budget field to the FICLONERANGE2 arguments
	structure and plumb that through the kernel calls.
(2) FICLONERANGE2 will return the work completion status. There are two ways we
can do this:
	(a) Add the amount of work done to the pos fields and subtract the
	amount of work done from the length field. This "cursor" like operation
	would set up userspace to call the kernel again if the request was
	only partially filled without having to update anything. This might
	be tricky given the "length==0" trick that means "reflink to the
	source file's EOF".
	(b) Provide an explicit field in the args structure to return the
	amount of work done and require userspace to adjust the pos/length fields.

4. User Interface
-----------------
The current arguments structure for FICLONERANGE is shown below.

struct file_clone_range {
	__s64 src_fd;
	__u64 src_offset;
	__u64 src_length;
	__u64 dest_offset;
};

The new FICLONERANGE2 arguments structure will likely be larger. Depending
on the chosen implementation, we may need several additional fields.

	__u64 flags;
	__u64 time_budget_ms;
	__u64 work_done;

5. Testing Plan
---------------

The fstests suite already contains tests for the existing clone functionality.
These tests can be found under the following groups:

clone - FICLONE/FICLONERANGE ioctls
clone_stress - stress testing FICLONE/FICLONERANGE

We will also need to create additional tests for the new FICLONERANGE2 ioctl.

- Write a test case that performs a FICLONERANGE2 copy with a time budget.
  If our implementation allows FICLONERANGE2 to be restarted after a signal
  interruption, we can test this by creating a loop and setting up a signal
  via alarm(2) or timer_create(2).
- Write a test case that generates a file with many extents and tests that
  the kernel exits with partial completion when given a very short time budget.


Comments and feedback appreciated!

Catherine





[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux