Re: [RFC] nilfs2: xattrs support implementation idea

Ryusuke Konishi <konishi.ryusuke@xxxxxxxxxxxxx> · Sun, 29 Sep 2013 01:00:37 +0900 (JST)

On Thu, 26 Sep 2013 18:07:59 +0400, Vyacheslav Dubeyko wrote:
>> For the tree of xanodes, it seems a bit inefficent because it is
>> implemented over xafile (meta data file), and this introduces
>> threefold tree lookups to access the target disk block:
>> 
>>   xanode tree lookup --> xafile btree lookup --> DAT btree lookup
>> 
>> Simple block chaining over xafile may be better for most cases.
>> 
>> I think the tree-over-xafile implementation is a candidate, but please
>> try considering ideas to replace the the tree based lookup into an
>> index based access over xafile.
>> 
> 
> First of all, in my concept the xafile is aggregate of xanodes. Every
> xanode is accessed by index in xafile. So, namely index based access
> over xafile is used in my implementation.

I said this is "For the tree of xanodes".  You are misunderstanding.
I don't comment on shared xanodes here.

Yes, of course, I know block access over xafile i_xattr can be used
for that index access.  That is what we intended to prepare the
i_xattr field and the reserved extended attribute metadata file
(NILFS_XATTR_INO)

> The xafile as container of xanodes is important concept from the GC
> point of view. Data blocks and xattr blocks have different nature of
> aging. So, mixture of data and xattr blocks can be not GC
> friendly.

Yes, I know.  I don't deny using xafile.  The xafile approach also has
the merit of disk format compatibility.  These are exactly the reason
why original nilfs design is reserving xattr inode number.

> Access to xanodes by means of index over xafile is
> important concept too. Because namely index of xanode in xafile is
> stored as xanode number in header and other structures. So, xafile
> as metadata file hides problem of translation logical index (xanode
> number) into real block number.
> 
> About lookup... On-disk inode keeps xattrs by itself in the inline
> xanode case. So, we don't need to search an inline xanode in such case.
> An inode keeps xanode number (or index of xafile's node) in the case of
> shared xanode. It means that we have metadata file lookup overhead only
> (xafile btree lookup --> DAT btree lookup). An inode keeps xanode number
> of a first xanode of dedicated xanodes' tree. Only this inode possesses
> by such dedicated tree. So, dedicated tree will not big in ordinary
> case.

I am not worrying about lookup of inline xattr or shared xattr.  I
inteded to comment on the tree type xattr.

> The dedicated xanodes' tree can be big if an inode will have very
> big count of xattrs (or xattrs with big content size). Every xanode has
> small reserved space for index keys. But, initially, only first xanode
> is used for storing index keys. And reserved space for index keys is
> growing in the case of necessity to have more index keys. I suppose that
> one index level will be enough for most cases. So, usually, reading of
> first xanode will be enough for getting knowledge about all leaf xanodes
> in dedicated xanodes' tree.

Ok, you are pointing out that the depth of xattr trees is expected to
be small.  That is reasonable.

> Yes, (xafile btree lookup --> DAT btree lookup) can be the reason of
> overhead. But I think that we need to optimize this overhead for all
> use-cases, anyway.

Agreed.

>> For the inline xattr, I think it should not be mandatory.  It should
>> be an option for users who wants faster accesses to the extended
>> attributes and can accept reformat of the file system with 256-bit
>> size inodes for that purpose.
>> 
> 
> Yes, sure. Inline xattrs are optional. If a user hasn't opportunity (or
> desire) for reformat NILFS2 volume then it will not use inline xattrs
> completely.

Okay.

>> > /*
>> >  * struct nilfs_inline_xanode_header - header of inline node
>> >  * @magic: magic number for identification
>> >  * @type: node type
>> >  * @entries: number of entries in node
>> >  * @log_node_size: log2(node size in bytes)
>> 
>> log2(node size in bytes) - 10 ?
>> 
>> 4-bit width log_node_size only can represent 0~15 bytes.
>> 
> 
> I think that xafile's xanode size can be in range: 1KB (1024) - 32KB
> (32768). I suppose that maximum value in 32KB is enough for all cases.
> So, it needs 15 bytes for 32KB node size.

Ok, this is my misunderstanding.

4-bit width log_node_size can represent 2^0 ~ 2^15 bytes (32KB).

>> >  * @flags: node flags
>> >  */
>> > struct nilfs_inline_xanode_header {
>> >         __le16 magic;
>> 
>> Is this magic field necessary?   The reserved field
>> at the end of the current disk inode is valuable.
>> 
>> Setting the same magic number to all inodes seems wasteful.
>> 
> 
> The magic field is useful for consistency checking. And magic number
> keeps for the whole xanode. I don't think that 2-byte value wastes 4KB
> xanode, for example. Moreover, I use magic for detecting presence of
> inline xanode in extended space of on-disk inode. Of course, we can
> leave padding field at the end of on-disk inode untouched. But it needs
> to use another way of detecting inline xanode presence. What do you
> suggest?

I am not denying adding 2-bytes magic number to 4k-bytes blocks on
xafile.

What I am saying is the magic number inserted in nilfs_inode.

i_pad is guaranteed to be zero-filled by mkfs, so you don't have to
add 16-bit magic field; a few bits magic number field is enough for
the inlined xattr header.

You can do the consistency check with the few bit magic number and
whether i_xattr is zero or not. (for compatibility reason, we must
define i_xattr is zero if and only if xattr is not used.)

We agreed that inline xattr should be optional, so, I think moving
nilfs_inline_xanode_header into the extended inode area is a
reasonable option.

Anyway, I want to reserve at least a 16-bit field for inode checksum
functionality.

>> > /*
>> >  * struct nilfs_xattr_inline_key - xattr key for inline node
>> >  * @type: key type
>> >  * @flags: key's flags
>> >  * @reserved: reserved field
>> >  * @name_index: attribute name index
>> >  * @name_hash: name hash
>> >  * @entry_offset: entry offset inside the node
>> >  * @entry_size: size of entry in bytes
>> >  */
>> > struct nilfs_xattr_inline_key {
>> >         u8 type : 2;
>> >         u8 flags : 6;
>> >         u8 reserved;
>> >         __le16 name_index;
>> >         __le32 name_hash;
>> >         __le16 entry_offset;
>> >         __le16 entry_size;
>> > } __packed;
>> 
>> For what purpose is the type field used ?
> 
> The purpose of "type" field is to distinguish different type of xanode's
> headers or keys. Namely, for example, detecting that we have header of
> inline xanode, shared xanode or tree xanode.

That looks to be defined as the purpose of "flags" field.  See the
following definitions:

> /* Xattr key's flags */
> #define NILFS_INLINE_NODE_KEY           0
> #define NILFS_SHARED_XANODE_KEY         1
> #define NILFS_XANODE_INDEX_KEY          2
> #define NILFS_XANODE_LEAF_KEY           3

>> Again, the type field.  What is the difference between the type field
>> and the flags field ?
>> 
> 
> The type field is part of flags field. So, it can be extracted from
> flags field by means of mask.

OK, please reorganize the two fields when you will get rid of
bit-field uses.

>> > struct nilfs_xattr_index_key {
>> >         u8 type : 2;
>> >         u8 flags : 6;
>> >         u8 reserved1;
>> >         __le16 name_index;
>> >         __le32 name_hash;
>> >         __le64 parent;
>> >         __le64 leaf;
>> > } __packed;
>> 
>> Why is the "parent" field required ?
>> 
>> The size of this structure is 24 bytes.
>> It fits in 16 bytes structure if the parent field is eliminable.
>> Otherwise, you need 8 more pad bytes to align on-disk structures 
>> to 8 byte boundary.
> 
> I need to think over the real necessity of "parent" field. But my
> initial intention was to have backpointer on index xanode that describe
> this xanode as leaf.

I know it.  What I was asking about is the necessity of the "parent"
field.

> But now I doubt that keeps this value in an index
> key is a good idea. Maybe, it makes sense to keep this value in xanode's
> header in the case of real necessity in "parent" backpointer. So, I need
> to reflect about it.

Ok, please reconsider it as needed when you proceed implementation.

Regards,
Ryusuke Konishi

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html