bd: [RFC] posix/multi-brick support to BD xlator Current BD xlator (block backend) has a few limitations such as * Creation of directories not supported * Supports only single brick * Does not use extended attributes (and client gfid) like posix xlator * Creation of special files (symbolic links, device nodes etc) not supported Basic limitation of not allowing directory creation is blocking oVirt/VDSM to consume BD xlator as part of Gluster domain since VDSM creates multi-level directories when GlusterFS is used as storage backend for storing VM images. To overcome these limitations a new BD xlator with following improvements is suggested. * New hybrid BD xlator that handles both regular files and block device files * The volume will have both POSIX and BD bricks. Regular files are created on POSIX bricks, block devices are created on the BD brick (VG) * BD xlator leverages exiting POSIX xlator for most POSIX calls and hence sits above the POSIX xlator * Block device file is differentiated from regular file by an extended attribute * The xattr 'trusted.glusterfs.bd' (BD_XATTR) plays a role in mapping a posix file to Logical Volume (LV). * When a client sends a request to set BD_XATTR on a posix file, a new LV is created and mapped to posix file. So every block device will have a representative file in POSIX brick with 'trusted.glusterfs.bd.path' (BD_XATTR_PATH) set. * Here after all operations on this file results in LV related operations. For example opening a file that has BD_XATTR_PATH set results in opening the LV block device, reading results in reading the corresponding LV block device. New BD xlator code is placed in xlators/storage/bd directory. It also disables existing bd-map xlator (ie you cant create a gluster volume to use bd_map xlator). But in next version support for bd_map xlator will be retained. When BD xlator gets request to set BD_XATTR via setxattr call, it creates a LV[1] and information about this LV is placed in the xattr of the posix file. xattr "glusterfs.bd.path" is used to map "vg_name/lv_name" in the posix file. Usage: Server side: [root@host1 ~]# gluster volume create bdvol device vg host1:/vg1 host2:/vg2 It creates a distributed gluster volume 'bdvol' with Volume Group vg1 in host1 and Volume Group vg2 in host2. [2] [root@host1 ~]# gluster volume start bdvol Client side: [root@node ~]# mount -t glusterfs host1:/bdvol /media [root@node ~]# touch /media/posix It creates regular posix file 'posix' in either host1:/vg1 or host2:/vg2 brick [root@node ~]# mkdir /media/image [root@node ~]# touch /media/image/lv1 It also creates regular posix file 'lv1' in either host1:/vg1 or host2:/vg2 brick [root@node ~]# setfattr -n "trusted.glusterfs.bd" -v 1 /media/image/lv1 [root@node ~]# Above setxattr results in creating a new LV in corresponding brick's VG and it sets 'trusted.glusterfs.bd.path' with value 'vg1/lv1' (assuming its created in host1:vg1) [root@node ~]# truncate -s5G /media/image/lv1 It results in resizig LV 'lv1'to 5G Todos/Fixme: [1] LV name generation: LV name is generated from the full path of posix file. But it may exceed the length of lvname limit. So from the full path an unique LV name with limited length has to be generated [2] As of now posix brick directory is assumed from the VG name. Enhance gluster volume creation CLI command to mention the VG name similar to this: # gluster volume create bdvol host1:/<brick1>:vg1 host2:/<brick2>:vg2 <brick1> is standard posix brick and :vg1 specifies it has to map VG 'vg1' to that brick path. The syntax will be # gluster volume create <volname> <NEW-BRICK[:VG]> Second ':' in the brick path is used to differentiate between posix and BD volume file and to specify the associated VG. [3] New bd xlator code is not working with open-behind xlator. In order to work with open-behind, BD xlator has to use the same approach similar to posix xlator where a <brick-path>/.glusterfs/gfid file is opened when a fd is needed. But exposing posix brick-path to BD xlator may not be an good idea. [4] BD xlator is not doing BD specific operation in readv/writev fop and it can be forwarded to posix readv/writev if posix xlator could handle opening BD device also. i.e when a open request comes for a BD mapped posix file, posix_open routine has to open the posix file and BD (in this case LV) and embed the posix_fd and bd_fd_t structures in fd_t structure. So later readv/writev will result in reading/writing to a BD (and/or posix file). This also solves open-behind issue with BD xlator. But it needs changes in posix xlator code and posix xlator [5] When a new brick is added a file may be moved to a new brick and with BD volume it should also move the mapped LV to the new brick's VG. [6] In BD volume file if VG is served by SAN and its suggested to use the posix brick directory also from the same SAN so that data (LVs) and meta data (posix files) are stored in the same SAN. [7] Some fops copy inode->gfid to loc->gfid before forwarding the request to posix xlator [8] Add dm-thin support [9] Add support to full and linked clone of LV images. [10] Retain support for bd_map xlator