Re: Introducing file based snapshots in gluster

Kaushal M <kshlmster@xxxxxxxxx> · Tue, 1 Mar 2016 13:15:06 +0530

On Tue, Mar 1, 2016 at 12:37 PM, Prasanna Kumar Kalever
<pkalever@xxxxxxxxxx> wrote:
> Hello Gluster,
>
>
> Introducing a new file based snapshot feature in gluster which is based  on reflinks feature which will be out from xfs in a couple of months  (downstream)
>
>
> what is a reflink ?
>
> You might have surely used softlinks and hardlinks everyday!
>
> Reflink  supports transparent copy on write, unlike soft/hardlinks which if useful for  snapshotting, basically reflink points to same data blocks that are used  by actual file (blocks are common to real file and a reflink file hence  space efficient), they use different inode numbers hence they can have  different permissions to access same data blocks, although they may look  similar to hardlinks but are more space efficient and can handle all  operations that can be performed on a regular file, unlike hardlinks  that are limited to unlink().
>
> which filesystem support reflink ?
> I  think its Btrfs who put it for the first time, now xfs trying hard to  make them available, in the future we can see them in ext4 as well
>
> You can get a feel of reflinks by following tutorial
> https://pkalever.wordpress.com/2016/01/22/xfs-reflinks-tutorial/
>
>
> POC in gluster: https://asciinema.org/a/be50ukifcwk8tqhvo0ndtdqdd?speed=2
>
>
> How we are doing it ?
> Currently  we don't have a specific system-call that gives handle to reflinks, so I  decided to go with ioctl call with XFS_IOC_CLONE command.
>
> In POC I have used setxattr/getxattr to create/delete/list the snapshot. Restore feature will use setxattr as well.
>
> We  can have a fop although Fuse does't understand it, we will manage with a  setxattr at Fuse mount point and again from client side it will be a fop till  the posix xlator then as a ioctl to the underlying filesystem. Planing  to expose APIs for create, delete, list and restore.
>
> Are these snapshots Internal or external?
> We  will have a separate file each time we create a snapshot, obviously the  snapshot file will have a different inode number and will be a  readonly, all these files are maintained in the ".fsnap/ " directory  which is maintained by the parent directory where the  snapshot-ted/actual file  resides, therefore they will not be visible to user (even with ls -a option, just like USS).
>
> *** We can always restore to any snapshot available  in the list and the best part is we can delete any snapshot between  snapshot1 and  snapshotN because all of them are independent ***
>
> It  is applications duty to ensure the consistency of the file before it  tries to create a snapshot, say in case of VM file snapshot it is the  hyper-visor that should freeze the IO and then request for the snapshot
>
>
>
> Integration with gluster: (Initial state, need more investigation)
>
> Quota:
> Since  the snapshot files resides in ".fsnap/" directory which is maintained  by the same directory where the actual file exist, it falls in the same  users quota :)
>
> DHT:
> As said the snapshot files will resides in the same directory where the actual file resides may be in a ".fsnap/" directory
>
> Re-balancing:
> Simplest  solution could be, copy the actual file as whole copy then for  snapshotfiles rsync only delta's and recreate snapshots history by  repeating snapshot sequence after each snapshotfile rsync.
>
> AFR:
> Mostly  will be same as write fop (inodelk's and quorum's). There could be no  way to recover or recreate a snapshot on node (brick to be precise) which was down while  taking snapshot and comes back later in time.
>
> Disperse:
> Mostly take the inodelk and snapshot the file, on each of the bricks should work.
>
> Sharding:
> Assume we have a file split into 4 shards. If the fop for take snapshot is sent to all the subvols having the shards, it would be sufficient. All shards will have the snapshot for the state of the shard.
> List of snap fop should be sent only to the main subvol where shard 0 resides.
> Delete of a snap should be similar to create.
> Restore would be a little difficult because metadata of the file needs to be updated in shard xlator.
> <Needs more investigation>
> Also in case of sharding, the bricks have gfid based flat filesystem. Hence the snaps created will also be in the shard directory, hence quota is not straight forward and needs additional work in this case.
>
>
> How can we make it better ?
> Discussion page: http://pad.engineering.redhat.com/kclYd9TPjr

This link is not accessible externally. Could you move the contents to
a public location?

>
>
> Thanks to "Pranith Kumar Karampuri", "Raghavendra Talur", "Rajesh Joseph", "Poornima Gurusiddaiah" and "Kotresh Hiremath Ravishankar"
> for all initial discussions.
>
>
> -Prasanna
>
>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel