Re: GFID2 - Proposal to add extra byte to existing GFID

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Mon, 15 May 2017 18:48:44 +0200

Hi Amar, 
On May 15, 2017 2:15 PM, Amar Tumballi <atumball@xxxxxxxxxx> wrote:

>

>

>

> On Tue, Apr 11, 2017 at 2:59 PM, Amar Tumballi <amarts@xxxxxxxxx> wrote:

>>

>> Comments inline.

>>

>> On Mon, Dec 19, 2016 at 1:47 PM, Xavier Hernandez <xhernandez@xxxxxxxxxx> wrote:

>>>

>>> On 12/19/2016 07:57 AM, Aravinda wrote:

>>>>

>>>>

>>>> regards

>>>> Aravinda

>>>>

>>>> On 12/16/2016 05:47 PM, Xavier Hernandez wrote:

>>>>>

>>>>> On 12/16/2016 08:31 AM, Aravinda wrote:

>>>>>>

>>>>>> Proposal to add one more byte to GFID to store "Type" information.

>>>>>> Extra byte will represent type(directory: 00, file: 01, Symlink: 02

>>>>>> etc)

>>>>>>

>>>>>> For example, if a directory GFID is f4f18c02-0360-4cdc-8c00-0164e49a7afd

>>>>>> then, GFID2 will be 00f4f18c02-0360-4cdc-8c00-0164e49a7afd.

>>>>>>

>>>>>> Changes to Backend store

>>>>>> ------------------------

>>>>>> Existing: .glusterfs/gfid[0:2]/gfid/[2:4]/gfid

>>>>>> Proposed: .glusterfs/gfid2[0:2]/gfid2[2:4]/gfid2[4:6]/gfid2

>>>>>>

>>>>>> Advantages:

>>>>>> -----------

>>>>>> - Automatic grouping in .glusterfs directory based on file Type.

>>>>>> - Easy identification of Type by looking at GFID in logs/status output

>>>>>>   etc.

>>

>>

>> Above two will be good enough points to bump up the priority for the feature.

>>  

>>>>>>

>>>>>> - Crawling(Quota/AFR): List of directories can be easily fetched by

>>>>>>   crawling `.glusterfs/gfid2[0:2]/` directory. This enables easy

>>>>>>   parallel Crawling.

>>

>>

>> With the current design, we still have to do a distributed readdir() to get all 

>> the entries in the directory. This layout change, along with proposed 

>> DHT2/EHT/DHT2+ (name for me doesn't matter here) layout, where directory 

>> entries would be created in just one place should enhance the performance overall.

>>  

>>>>>>

>>>>>> - Quota - Marker: Marker transator can mark xtime of current file and

>>>>>>   parent directory. No need to update xtime xattr of all directories

>>>>>>   till root.

>>>>>> - Geo-replication: - Crawl can be multithreaded during initial sync.

>>>>>>   With marker changes above it will be more effective in crawling.

>>>>>>

>>  

>>>>>>

>>>>>> Please add if any more advantageous.

>>>>>>

>>>>>> Disadvantageous:

>>>>>> ----------------

>>>>>> Functionality is not changed with the above change except the length

>>>>>> of the ID. I can't think of any disadvantages except the code changes

>>>>>> to accommodate this change. Let me know if I missed anything here.

>>>>>

>>>>>

>>>>> One disadvantage is that 17 bytes is a very ugly number for

>>>>> structures. Compilers will add paddings that will make any structure

>>>>> containing a GFID noticeable bigger. This will also cause troubles on

>>>>> all binary formats where a GFID is used, making them incompatible. One

>>>>> clear case of this is the XDR encoding of the gluster protocol.

>>>>> Currently a GFID is defined this way in many places:

>>>>>

>>>>>         opaque gfid[16]

>>>>>

>>>>> This seems to make it quite complex to allow a mix of gluster versions

>>>>> in the same cluster (for example in a middle of an upgrade).

>>

>>

>> Totally agree with Xavier here. Not in support of adding one more byte.

>>  

>>>>>

>>>>>

>>>>> What about this alternative approach:

>>>>>

>>>>> Based on the RFC4122 [1] that describes the format of an UUID, we can

>>>>> define a new structure for new GFID's using the same length.

>>>>>

>>>>> Currently all GFID's are generated using the "random" method. This

>>>>> means that all GFID have this structure:

>>>>>

>>>>>         xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx

>>>>>

>>>>> Where N can be 8, 9, A or B, and M is 4.

>>>>>

>>>>> There are some special GFID's that have a M=0 and N=0, for example the

>>>>> root GFID.

>>>>>

>>>>> What I propose is to use a new variant of GFID, for example E or F

>>>>> (officially marked as reserved for future definition) or even 0 to 7.

>>>>> We could use M as an internal version for the GFID structure (defined

>>>>> by ourselves when needed). Then we could use the first 4 or 8 bits of

>>>>> each GFID as you propose, without needing to extend current GFID

>>>>> length nor risking to collide with existing GFID's.

>>>>>

>>>>> If we are concerned about the collision probability (quite small but

>>>>> still bigger than the current version) because we loose some random

>>>>> bits, we could use N = 0..7 and leave M random. This way we get 5 more

>>>>> random bits, from which we could use 4 to represent the inode type.

>>>>>

>>>>> I think this way everything will work smoothly with older versions

>>>>> with minimal effort.

>>>>>

>>>>> What do you think ?

>>>>

>>>> That is really nice suggestion.

>>>>

>>>> To get the crawling advantageous as mentioned above, we need to make

>>>> backend store as .glusterfs/N/gfid[0:2]/gfid[2:4]/gfid

>>>

>>>

>>> That's one possibility. Since N will be 4 bits at most, it won't collide with currently existing subdirectories that represent 8 bits. Or we could use M. It all depends on the exact interpretation we give to each field.

>>>

>>> One suggestion I would make is to define it in a way that we use the minimal amount of bits to represent what we need now but leave space for future extensions. For example creating a "reserved" value for the field.

>>>

>

> While discussing this with Aravinda, we realized, if we just make changes in UUID generation logic, we don't need to be worried about version incompatibility.
Yes. That's one of the main advantages of keeping an standard UUID. 
>

> Also, I have a question, What are the chances of uuid collision if we take just 3 bits from the first byte ? 

>

> 000 - Unspecified (can be anything).

> 001 - Directory

> 010 - Regular File

> 011 - Special files (symlink, Block and Char devices, socket files etc).

> {100 - 111} - Reserved.
This cannot be done. Since we are currently using random UUIDs, on average, one of every eight randomly generated ids will start with each one of the combinations. 
Already existing GFIDs will be a problem when updating. The only thing that can avoid the problem is to create new GFIDs in a format that won't collide with existing ones, and this can only be done safely if we use the special fiels of the UUID itself. 
>

> As a side-effect, it reduces the number of directories created at as the metadata, inside of .glusterfs directory. (Will be 50% of current load). 
Maybe we can find a better way to store the GFIDs using the standard fields instead of relying on the first bits, which is not a valid solution. 
We can think more about this. 
Xavi
>

> -Amar

>  

>>>

>>> Proposal:

>>>

>>> Use N = 00xx for special GFID's, like NULL GFID, or the ones currently used in some places. All these will also have M = 0. All other values of M will be reserved for future extensions.

>>>

>>> Also reserve all other values of N (01xx) for future extensions.

>>>

>>> This gives a lot of space to represent many things in the future if necessary, while keeping current usage compatible with it.

>>>

>>> For this particular case we could use N = 0000 and define M as (this is a mapping of the posix S_IFxxx values):

>>>

>>> M = 0000 Current special GFID's

>>> M = 0001 Fifo (S_IFIFO)

>>> M = 0010 Character Device (S_IFCHR)

>>> M = 0100 Directory (S_IFDIR)

>>> M = 0110 Block Device (S_IFBLK)

>>> M = 1000 Regular File (S_IFREG)

>>> M = 1010 Symbolic Link (S_IFLNK)

>>> M = 1100 Socket (S_IFSOCK)

>>>

>>> M = xx11 \

>>> M = x1x1  | Reserved for future extensions

>>> M = 1xx1  |

>>> M = 111x /

>>>

>>> If we use our own mapping instead of using the same values than IF_Sxxx macros, we can get a more compact representation if needed.

>>>

>>> In this case the directory structure could be .glusterfs/M/gfid[0:2]/gfid[2:4]/gfid. And use M = 0 to put all current existing gfid's, or we could leave existing gfid's in their current location.

>>>

>>> Or we could even have .glusterfs/NM/gfid[0:2]/gfid[2:4]/gfid. This would probably be compatible even with future extensions.

>>>

>>

>> I would go with only 'M' being considered for current layout and keeping N for future developments. Even though we are not considering 'N' internally, we can keep directory name as '00MM' (zero zero M M). so that backend layout would be compatible to consider N later if required.

>>

>> One major thing is we need a solid plan for migration from current layout to newer layout.

>>

>> Regards,

>> Amar

>>  

>>>

>>> Xavi

>>>

>>>

>>>

>>>>>

>>>>> Xavi

>>>>>

>>>>> [1] https://www.ietf.org/rfc/rfc4122.txt

>>>>>

>>>>>>

>>>>>> Changes:

>>>>>> ---------

>>>>>> - Code changes to accommodate 17 bytes GFID instead of 16 bytes(Read

>>>>>>   and Write)

>>>>>> - Migration Tool to upgrade GFIDs in Volume/Cluster

>>>>>>

>>>>>> Let me know your thoughts.

>>>>>>

>>>>>

>>>>

>>>

>>> _______________________________________________

>>> Gluster-devel mailing list

>>> Gluster-devel@xxxxxxxxxxx

>>> http://www.gluster.org/mailman/listinfo/gluster-devel

>>

>>

>>

>> _______________________________________________

>> Gluster-devel mailing list

>> Gluster-devel@xxxxxxxxxxx

>> http://lists.gluster.org/mailman/listinfo/gluster-devel

>

>

>

>

> -- 

> Amar Tumballi (amarts)

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel