Fwd: Re: [RFC] Block Device Xlator Design

Amar Tumballi <amarts@xxxxxxxxxx> · Wed, 11 Jul 2012 16:26:05 +0530

Wrong mail-id used earlier. please refer below

-------- Original Message --------
Subject: Re: [RFC] Block Device Xlator Design
Date: Wed, 11 Jul 2012 16:24:24 +0530
From: Amar Tumballi <atumball@xxxxxxxxxx>
To: M. Mohan Kumar <mohan@xxxxxxxxxx>
CC: Shishir Gowda <sgowda@xxxxxxxxxx>, gluster-devel@xxxxxxxxxx

I posted GlusterFS server xlator patches to enable exporting Block
Devices (currently only Logical Volumes) as regular files at the
client side couple of weeks ago. Here is the link for the patches:
        http://review.gluster.com/3551

I would to like to discuss about the design of this xlator.

Current code uses lvm2-devel library to find out list of logical volumes
for the given volume group (in BD xlator each volume file exports on
volume group, in future we may extend this to export multiple volume
groups if needed). init routine of BD xlator constructs internal data
structure holding list of all logical volumes in the VG.

Went through the patchset, and it looks fine. One of the major thing to
take care is, the build should not fail or it should not assume that
lvm2-devel library is always present. Hence it should have corresponding
checks in configure.ac to handle the situation. (ref: you can look into
how libibverbs-devel dependency is handled)

When open request comes corresponding open interface in BD xlator opens
the intended LV by using this logic: /dev/<vg-name>/<lv-name>. This path
is actually a symbolic link to /dev/dm-<x>. Is my assumption about
having this /dev/<vg-name>/<lv-name> is it right? Will it always work?

This should be fine. One concern here is how do we keep track of gfid to
path mappings. With having proper resolution there, we can guarantee the
behavior.

Also if there is a request to create a file (in turn it has to create a
LV at the server side), lvm2 api is used to create a logical volume in
the given VG but with a pre-determined size ie one logical extent size
because create interface does not take size as one of the parameters but
size is one of the parameters to create a logical volume.

In a typical VM disk image scenario qemu-img first creates a file and
then uses truncate command to set the required file size. So this should
not be an issue with this kind of usage.

I think creat() followed by ftruncate() should just work fine too.

But there are other issues in the BD xlator code as of now. lvm2 api
does not support resizing a LV, creating snapshot of LV. But there are
tools available to do the same. So BD xlator code forks and executes the
required binary to achieve the functionality. i.e when truncate is
called on a BD xlator volume, it will result in running lvresize binary
with required parameters. I checked with lvm2-devel mailing list about
their plan to support lv resizing and creating snapshots & waiting for
the responses.

Is it okay to rely on external binaries to create a snapshot of a LV and
resize it?

It is ok to call external binaries (security issues are present, but is
a different topic of discussion). Two things to take care here:

1. as Avati rightly mentioned, utilize the runner
('libglusterfs/src/run.h') interface.

2. If you are expecting/waiting on return value of these binaries, then
we have to make sure we have a mechanism to handle hang situation.

Also when a LV is created out-of-band for example, using gluster cli to
create a LV (I am working on the gluster cli patches to create LV and
copy/snapshot LVs), BD xlator will not be aware of these changes and I
am looking if 'notify' feature of xlator can be used to notify the BD
xlator to create a LV, snapshot instead of doing it from gluster
management xlators. I have sent a mail to gluster-devel asking some more
information about this.

Refer to Shishir's response for this.

Hope this serves as a initial review comment.

Regards,
Amar