"M.Mohan Kumar" <mohan@xxxxxxxxxx> writes: Patches posted to Gerrit http://review.gluster.com/4809 Also added interested people in the CC list. > bd: [RFC] posix/multi-brick support to BD xlator > > Current BD xlator (block backend) has a few limitations such as > * Creation of directories not supported > * Supports only single brick > * Does not use extended attributes (and client gfid) like posix xlator > * Creation of special files (symbolic links, device nodes etc) not > supported > > Basic limitation of not allowing directory creation is blocking > oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM > creates multi-level directories when GlusterFS is used as storage > backend for storing VM images. > > To overcome these limitations a new BD xlator with following improvements > is suggested. > > * New hybrid BD xlator that handles both regular files and block device files > * The volume will have both POSIX and BD bricks. Regular files are > created on POSIX bricks, block devices are created on the BD brick (VG) > * BD xlator leverages exiting POSIX xlator for most POSIX calls and > hence sits above the POSIX xlator > * Block device file is differentiated from regular file by an extended attribute > * The xattr 'trusted.glusterfs.bd' (BD_XATTR) plays a role in mapping a > posix file to Logical Volume (LV). > * When a client sends a request to set BD_XATTR on a posix file, a new > LV is created and mapped to posix file. So every block device will > have a representative file in POSIX brick with 'trusted.glusterfs.bd.path' > (BD_XATTR_PATH) set. > * Here after all operations on this file results in LV related operations. > > For example opening a file that has BD_XATTR_PATH set results in opening the LV > block device, reading results in reading the corresponding LV block device. > > New BD xlator code is placed in xlators/storage/bd directory. It also > disables existing bd-map xlator (ie you cant create a gluster volume to > use bd_map xlator). But in next version support for bd_map xlator will > be retained. > > When BD xlator gets request to set BD_XATTR via setxattr call, it > creates a LV[1] and information about this LV is placed in the xattr of > the posix file. xattr "glusterfs.bd.path" is used to map > "vg_name/lv_name" in the posix file. > > Usage: > Server side: > [root@host1 ~]# gluster volume create bdvol device vg host1:/vg1 host2:/vg2 > It creates a distributed gluster volume 'bdvol' with Volume Group vg1 in > host1 and Volume Group vg2 in host2. [2] > > [root@host1 ~]# gluster volume start bdvol > > Client side: > [root@node ~]# mount -t glusterfs host1:/bdvol /media > [root@node ~]# touch /media/posix > It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2 brick > > [root@node ~]# mkdir /media/image > [root@node ~]# touch /media/image/lv1 > It also creates regular posix file 'lv1' in either host1:/vg1 or > host2:/vg2 brick > > [root@node ~]# setfattr -n "trusted.glusterfs.bd" -v 1 /media/image/lv1 > [root@node ~]# > Above setxattr results in creating a new LV in corresponding brick's VG > and it sets 'trusted.glusterfs.bd.path' with value 'vg1/lv1' (assuming > its created in host1:vg1) > > [root@node ~]# truncate -s5G /media/image/lv1 > It results in resizig LV 'lv1'to 5G > > Todos/Fixme: > [1] LV name generation: LV name is generated from the full path of posix > file. But it may exceed the length of lvname limit. So from the full > path an unique LV name with limited length has to be generated > [2] As of now posix brick directory is assumed from the VG name. Enhance > gluster volume creation CLI command to mention the VG name similar to > this: > # gluster volume create bdvol host1:/<brick1>:vg1 host2:/<brick2>:vg2 > <brick1> is standard posix brick and :vg1 specifies it has to > map VG 'vg1' to that brick path. The syntax will be > # gluster volume create <volname> <NEW-BRICK[:VG]> > > Second ':' in the brick path is used to differentiate between posix > and BD volume file and to specify the associated VG. > [3] New bd xlator code is not working with open-behind xlator. In order to > work with open-behind, BD xlator has to use the same approach similar > to posix xlator where a <brick-path>/.glusterfs/gfid file is opened > when a fd is needed. But exposing posix brick-path to BD xlator may > not be an good idea. > [4] BD xlator is not doing BD specific operation in readv/writev fop and it > can be forwarded to posix readv/writev if posix xlator could handle > opening BD device also. i.e when a open request comes for a BD mapped > posix file, posix_open routine has to open the posix file and BD (in > this case LV) and embed the posix_fd and bd_fd_t structures in fd_t > structure. So later readv/writev will result in reading/writing to a > BD (and/or posix file). This also solves open-behind issue with BD > xlator. But it needs changes in posix xlator code and posix xlator > [5] When a new brick is added a file may be moved to a new brick and > with BD volume it should also move the mapped LV to the new brick's > VG. > [6] In BD volume file if VG is served by SAN and its suggested to use > the posix brick directory also from the same SAN so that data (LVs) > and meta data (posix files) are stored in the same SAN. > [7] Some fops copy inode->gfid to loc->gfid before forwarding the > request to posix xlator > [8] Add dm-thin support > [9] Add support to full and linked clone of LV images. > [10] Retain support for bd_map xlator > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxx > https://lists.nongnu.org/mailman/listinfo/gluster-devel