File snapshot design propsals

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Thu, 8 Sep 2016 13:29:42 +0530

hi,
       Doing file-snapshots is becoming important in the context of providing virtual block in containers so that we can take a snapshot of the block device and switch to different snapshots etc. So there have been some attempts at the design for such solutions. This is a very early look at some of the solutions proposed so far. Please let us know what you think about these and feel free to add any more solutions you may have for this problem.

Assumptions:

- Snap a single file

- File is accessed by a single client (VM images, block store for container etc.)
As there is a single client that accesses the file/image, the file read/write (or other modification FOPs) can act on a version number of the file (read as a part of lookup say, and communicated to other FOPs as a part of the xdata). 

1) Doing file snapshot using shards: (This is suggested by shyam, tried to keep the text as is)
If a block for such a file
 is written to with a higher version then the brick xlators can perform a
 block copy and then change the new block to the new version, and let 
the older version be as is.

This means, to snap such a file, just the first shard needs a higher version # and the client that is operating on this file
 needs to be updated with this version (mostly the client would be the 
one that is taking the snap, but even other wise). To update the client 
we can leverage the granted lease, by revoking the same, and forcing the
 client to reacquire the lease by visiting the first shard (if we need 
to coordinate the client writes post the snap this maybe sort of a 
must).

Anyway, bottom line is, a shard does not know a snap is taken, rather 
when a data modification operation is sent to the shard, it then acts on
 preserving the older block.

This leaves blocks with various versions on disk, and when a older snap 
(version) is deleted, then the corresponding blocks are freed.

A sparse block for a version never exists in this method, i.e when 
taking a snap, if a shard did not exist, then there is no version for it
 that is preserved, and hence it remains a empty/sparse block etc.

Pros: good distribution of the shards across different servers and efficient usage of the space available
Cons: Difficult to give data locality for the applications that may demand it.

2) Doing a file snapshot using sparse files:
This is sort of inspired from granular data self-heal idea we wanted to implement in afr, where we logically represent each block/shard used in the file by a bitmap stored either as an xattr or written to a metafile. So there is no physical division of the file into different shards. When a snapshot is taken, a new sparsefile is created of same size as before, new writes on the file are redirected to this file instead of the original file, thus preserving the old file. When a write is performed on this file, we mark which block is going to be written, copy out this block from older shard, overwrite the buffer and then write to the new version and mark the block as used either in xattr/metafile.

Pros: Easier to give data locality for the applications that may demand it.
Cons: in-efficient usage of the space available, we may end up with uneven usage among different servers in the cluster.

3) Doing filesnapshots by using reflink functionality given by the underlying FS:
When a snapshot request comes, we just do a reflink of the earlier file to the latest version and new writes are redirected to this new version of the file.

Pros: Easiest to implement among all the three, easier to give data locality for the applications that may demand it.
Cons: FS specific, i.e. not going to work on disk Filesystems that don't support file-snapshots, this too has the same problem as we have in 2) above i.e. in-efficient usage of the space available, we may end up with uneven usage among different servers in the cluster.

-- 
Pranith

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel