Re: Progress on adding support for SEEK_DATA and SEEK_HOLE

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Mon, 06 Jul 2015 09:37:08 +0200

On 07/06/2015 01:15 AM, Niels de Vos wrote:
On Wed, Jul 01, 2015 at 09:41:19PM +0200, Niels de Vos wrote:
On Wed, Jul 01, 2015 at 07:15:12PM +0200, Xavier Hernandez wrote:
On 07/01/2015 08:53 AM, Niels de Vos wrote:
On Tue, Jun 30, 2015 at 11:48:20PM +0530, Ravishankar N wrote:

On 06/22/2015 03:22 PM, Ravishankar N wrote:

On 06/22/2015 01:41 PM, Miklos Szeredi wrote:
On Sun, Jun 21, 2015 at 6:20 PM, Niels de Vos <ndevos@xxxxxxxxxx> wrote:
Hi,

it seems that there could be a reasonable benefit for virtual machine
images on a FUSE mountpoint when SEEK_DATA and SEEK_HOLE would be
available. At the moment, FUSE does not pass lseek() on to the
userspace
process that handles the I/O.

Other filesystems that do not (need to) track the position in the
file-descriptor are starting to support SEEK_DATA/HOLE. One example is
NFS:

https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-38#section-15.11

I would like to add this feature to Gluster, and am wondering if there
are any reasons why it should/could not be added to FUSE.
I don't see any reason why it couldn't be added.  Please go ahead.

Thanks for bouncing the mail to me Niels, I would be happy to work on
this. I'll submit a patch by Monday next.

Sent a patch @
http://thread.gmane.org/gmane.comp.file-systems.fuse.devel/14752
I've tested it with some skeleton code in gluster-fuse to handle lseek().

Ravi also sent his patch for glusterfs-fuse:

   http://review.gluster.org/11474

I have posted my COMPLETELY UNTESTED patches to their own Gerrit topic
so that we can easily track the progress:

   http://review.gluster.org/#/q/status:open+project:glusterfs+branch:master+topic:wip/SEEK_HOLE

My preference goes to share things early and make everyone able to
follow progress (know where to find the latest patches). Assistance in
testing, reviewing and improving is welcome! There are some outstanding
things like seek() for ec and sharding, and probably more.

This all was done as a suggestion from Christopher (kripper) Pereira,
for improving the handling of sparse files (like most VM images).

I've posted the patch for ec in the same Gerrit topic:

     http://review.gluster.org/11494/

Thanks!

It has not been tested and some discussion about if it's really needed to
send the request to all subvolumes will be needed.

The lock and the xattrop are absolutely needed. Even if we send the request
to only one subvolume, we need to know which ones are healthy (to avoid
sending the request to a brick that could have invalid hole information).
This could have been done in open, but since NFS does not issue open calls,
we cannot rely on that.

Ok, yes, that makes sense. We will likely have SEEK as an operation in
NFS-Ganesha at one point, and that will use the handle-based gfapi
functions.

Once we know which bricks are healthy we could opt for sending the request
only to one of them. In this case we need to be aware that even healthy
bricks could have different hole locations.

I'm not sure if I understand what you mean, but that likely has to do
that I dont know much about ec. I'll try to think it through later this
week.

The only thing that would need to be guaranteed is that the offset of
the hole/data is safe. The whole purpose is to improve handling of
sparse files, this does not need to be perfect. The holes themselves are
not important, but the non-holes are.

When a sparse file (think VM image) is copied, the goal is to not read
the holes which would return NUL bytes. If calculating the start of a
hole or the end is not exact, that is not a fatal issue. Reading and
backing up a series of NUL bytes before/after the hole should be
acceptable.

A drawing can probably explain things a little better.

                         lseek(SEEK_HOLE)
                           |       |
                   perfect |       | acceptable
                     match |       | match
                           |       |
      .....................|.......|.....................
      :file                |       |                    :
      : .----------------. v       v           .------. :
      : | DATA DATA DATA | NUL NUL NUL NUL NUL | DATA | :
      : '----------------'                 ^   '------' :
      :                                    |   ^        :
      .....................................|...|.........
                                           |   |
                                acceptable |   | perfect
                                     match |   | match
                                           |   |
                                         lseek(SEEK_DATA)

I have no idea how ec can figure out the offset of holes/data, that
would be interesting to know. Is it something that is available in a
design document somewhere?

EC splits the file in chunks of 512 * #data bricks. Each brick receives 
a fragment of 512 bytes for each chunk. These fragments are the minimal 
units of data and they are a hole or they contain data but not a mix (if 
part of the fragment should be a hole, it's filled with 0's). This means 
that backend filesystems can only have data/holes aligned to offsets 
multiple of 512 bytes.

Reading some other information and your explanation, I will need to 
change the logic to detect data/holes. I'll update the patch as soon as 
possible.

My inclination is to have the same consistency for the seek() FOP as for
read(). The same locking and health-checks would apply. Does that help?

What provides consistency to read() is the initial check done just after 
the locking. I think this is enough to choose one healthy brick, so I'll 
also update the patch to only use a single brick for seek() instead of 
sending the request to multiple bricks.

Even if the data/hole positions can be different between healthy bricks, 
those bricks that have data where others have holes, must contain 0's 
(otherwise they shouldn't be healthy). So I think it's not so important 
to query multiple bricks to obtain more accurate information.

Xavi
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel