Re: Bricks as BTRFS

Ric Wheeler <rwheeler@xxxxxxxxxx> · Fri, 26 Sep 2014 17:57:32 -0400

On 09/26/2014 03:40 PM, James wrote:
On Fri, Sep 26, 2014 at 3:15 PM, Ric Wheeler <rwheeler@xxxxxxxxxx> wrote:
On 09/26/2014 01:58 PM, James wrote:
On Thu, Sep 25, 2014 at 2:53 AM, Venky Shankar <vshankar@xxxxxxxxxx>
wrote:
Hey folks,

Wanted to check if anyone out here uses BTRFS (and willing to share their
experiences[1]) as the backend filesystem for GlusterFS. We're planning
to
explore some of it's features and put it to use for GlusterFS. This was
discussed briefly during the weekly meeting on #gluster-meeting[2].

To start with, we plan to explore data/metadata checksumming (+
scrubbing)
and subvolumes to "offload" the work to BTRFS. The mentioned features
would
help us with BitRot detection[3] and Openstack Manila use cases
respectively
(though there are various other nifty things one would want to do with
them).

Thanks in advance!

Hey,

I couldn't make the meeting, but I am interested in BTRFS. I added
this in puppet-gluster a bunch of months ago as a feature branch.

https://bugzilla.redhat.com/show_bug.cgi?id=1094860

I just pushed it to git master.

https://github.com/purpleidea/puppet-gluster/commit/6c962083d8b100dcaeb6f11dbe61e6071f3d13f0

The reason I want btrfs support, is I want glusterfs to eventually be
able to support reflinks across gluster volumes. There is a strong use
case for this feature.

Let me know if this helps!
Cheers,
James

Reflinks in btrfs (or ocfs2) need to be between files in the same linux
kernel instance of btrfs.  Effectively, we have two inodes backed by the
same physical blocks.

It won't, in general, be useful for reflinks across volumes....

Regards,

Ric

Agreed... Which is why this isn't a trivial thing for GlusterFS to do,
but we've discussed certain mechanisms to emulate this behaviour
across a Gluster volume. For example:

* If the reflink causes the file to be on the same brick, just reflink.
* If the reflink causes the file to be on a different brick, then
reflink to self, and put a pointer to that original brick
* If we want to reflink across volumes, then it's tricky, because fuse
would have to pass this information through and down to the
filesystem.

The winning use case for this feature is that someone could
backup/restore petabytes of data "virtually instantly". This is
possible with single volume things, but I'd like to scale this to a
distributed-replicated data store.

Not clear why you would do anything but "try reflink" and fail or succeed.

Effectively, from a kernel point of view, reflink is just a copy offload method. 
The default should be to fall back to the invoking application and it will fail 
if not supported (where you would go back to a full copy).

reflink for backup is really a bad idea since you will not have really made a 
second copy - if the disk fails (even partially!) you might lose data since we 
will not have a second copy of the blocks. Where it is not supported, you will 
still need to do a full file copy which means normal file operation speed for 
the backup and restore.

Ric

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users