On 06/23/2015 01:33 PM, Darrick J. Wong wrote: > On Mon, Jun 22, 2015 at 08:24:37PM -0700, mingming cao wrote: >> --- >> ext4_btree.h | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> 1 file changed, 146 insertions(+) >> create mode 100644 ext4_btree.h >> >> diff --git a/ext4_btree.h b/ext4_btree.h >> new file mode 100644 >> index 0000000..efd5ce3 >> --- /dev/null >> +++ b/ext4_btree.h >> @@ -0,0 +1,146 @@ >> +/* >> + * copyright (C) 2015 Oracle. All rights reserved. >> + * >> + * This program is free software; you can redistribute it and/or >> + * modify it under the terms of the GNU General Public >> + * License v2 as published by the Free Software Foundation. >> + * >> + * This program is distributed in the hope that it will be useful, >> + * but WITHOUT ANY WARRANTY; without even the implied warranty of >> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU >> + * General Public License for more details. >> + * >> + * You should have received a copy of the GNU General Public >> + * License along with this program; if not, write to the >> + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, >> + * Boston, MA 021110-1307, USA. >> + */ >> + >> +#ifndef EXT4_BTREE_H >> +#define EXT4_BTREE_H >> + >> +/* >> + * This btree geometry structure is used to defines how the btree block >> + * layout for different type of btrees. This is used to implement a generic >> + * btree implementation. The record length, value lenth, etc, are variable >> + * based on different btree type, like reflink btree, and other btree use cases. >> + */ >> +struct ext4_btree_geo { >> + __le16 node_len; >> + __le16 header_len; >> + __le16 key_len; >> + __le16 val_len; >> + __le16 rec_len; > (I wonder if these four _len values could be u8, but whatever...) okay... > >> + __le16 max_pairs; >> + __le16 max_records; >> +}; > Does this geometry structure live on disk? The __le prefix suggests that it > does, but the only definition of one of these is the in-core ref_btree_geo, > which means that u16 is sufficient. At the beginning since the prototype is a in memory btree, the geometry structure was not on disk, and declared globally. So I had those values to be u8 before... but I am planning to store those layout info on disk... that's why here I made change to __le, but hasnt really update the rest of code to reflect the geo is stored on disk yet. One question I have is whethere to store the geometry structure into the btree block header. Maybe only for index node... > > Given that the ext4_btree_geo seems to contain all the incore info about the > btree you're working with, it ought to know about the magic number so that it > can compare with whatever it reads off disk. Good point, it should know the magic number to compare with the header, if we decide to keep the geometry incore only.... I started to wonder if we really need to store geometry info on disk... > (Thinking ahead to when we have multiple btrees with different record > definitions and magic numbers.) Agree... >> + >> +/* >> + * Every btree block starts from a header structure, including the index node >> + * and the leaf node. >> + * The index btree node started from this header structure, followed by >> + * (key,value) pairs >> + * Index node: >> + * --------------------------------------------------------------------------- >> + * |header| key | key | key | key|...........| value | value | value | value | >> + * |-------------------------------------------------------------------------- >> + * >> + * And the leaf node begins with ext4_btree_header, and followed by records. >> + * leaf node >> + * * ------------------------------------------------------------------------ >> + * |header| rec | rec | rec | rec |...........| rec | rec | rec | rec | rec | >> + * |------------------------------------------------------------------------- >> + */ >> + >> +#define REF_BTREE_MAGIC 0x52454642 /* "REFB" magic number for refcount btree */ >> + >> +struct ext4_btree_header { >> + __le32 btree_magic; /* type of btree */ >> + __le32 btree_generation; /* generation for this type of btree */ >> + __le32 btree_level; /* the level of this node in the tree */ >> + __le32 btree_numkeys; /* # of records for leaf node*/ >> + __le32 btree_padding1; > I think this padding field results in a hidden u32 gap here? Sorry that was a mistake. I added generation but forget to remove this padding. >> + __le64 btree_blockno; /* remember the blk# of this node */ > (Unfortunate that you have to burn all 64 bits for all btrees, but I remembered > that ext4 lets you set flexbg size to 2^31. Oh well.) true... though this block no is >> + __le64 btree_leftsib; /* used for fast search sibling nodes */ >> + __le64 btree_rightsib; >> + __le64 btree_crc; /* this node checksums */ > Future proofing, I guess? Since we don't use 64-bit crcs right now. yes... that was the intention... I guess that is a little overkill for now. > If you decide that we only need 32 bits for a crc, you could make btree_level a > u16 (since max levels is 8), numkeys a u16, and store the crc32 after that. > Then this on-disk header would only be 40 bytes, instead of 64 bytes. Sounds good to me .. make it more packed... > (Hmm. Maybe you were trying to make things align to a 64B cacheline?) > >> + __le64 btree_padding2; >> +}; >> + >> +# define EXT4_BLOCK_SIZE 4096 > Shouldn't this be the FS blocksize? It'll waste a lot of space on a 64k block > filesystem (reduced fanout opportunities as well as 60K of empty space in each > tree node) and I'm not sure what happens with 1k blocks. Again, sorry this is me being lazy in the prototype here. I will fix it properly with filesystem blocksize ... >> +#define EXT4_BTREE_HEADER_SIZE sizeof(struct ext4_btree_header) >> +#define EXT4_BTREE_NODESIZE EXT4_BLOCK_SIZE >> + >> +struct ext4_btree_key { >> + __le64 blockno; >> +}; >> +#define EXT4_BTREE_KEY_SIZE sizeof(struct ext4_btree_key) >> + >> +struct ext4_btree_val { >> + __le64 blockno; >> +}; >> +#define EXT4_BTREE_VAL_SIZE sizeof(struct ext4_btree_val) >> + >> +struct ext4_btree_rec { >> + struct ext4_btree_key key; >> + __le32 len; >> + __le32 ref_count; >> +}; > Hm. Looking forward to having btrees to track things other than refcount, I > imagine it'll become necessary to parameterize the record definition for each > type of btree? > totally agree. Will need to make it more generic by adding interface to define different type of btree, with own record_data structure and let the btree user to pass those info into the generic code.. > (Though, I guess this record format works well enough for the inode and block > bitmaps, and maybe that's as far as you want to go...) for reflink, block allocation, yes. But I like to make the code more generic if possible >> +#define EXT4_BTREE_REC_SIZE sizeof(struct ext4_btree_rec) >> + >> +#define EXT4_BTREE_MAX_KEY_VAL_PAIRS \ >> + ((EXT4_BTREE_NODESIZE - EXT4_BTREE_HEADER_SIZE) / \ >> + (EXT4_BTREE_KEY_SIZE + EXT4_BTREE_VAL_SIZE)) >> + >> +#define EXT4_BTREE_MIN_KEY_VAL_PAIRS \ >> + (EXT4_BTREE_MAX_KEY_VAL_PAIRS / 2) >> + >> +#define EXT4_BTREE_MAX_RECS \ >> + ((EXT4_BTREE_NODESIZE - EXT4_BTREE_HEADER_SIZE) / \ >> + EXT4_BTREE_REC_SIZE) >> + >> +#define EXT4_BTREE_MIN_RECS \ >> + (EXT4_BTREE_MAX_RECS / 2) >> + >> +#define EXT4_BTREE_MAX_LEVELS 8 >> + >> + >> +/* Index node */ >> +struct ext4_btree_index_node { >> + struct ext4_btree_header node_header; >> + struct ext4_btree_key key[EXT4_BTREE_MAX_KEY_VAL_PAIRS]; >> + struct ext4_btree_val val[EXT4_BTREE_MAX_KEY_VAL_PAIRS]; > (Changing EXT4_BTREE_NODESIZE to depend on the FS blocksize makes it impossible > to use arrays here, oh well...) Hmm.. okay, probably a pointer here. >> +}; >> + >> +/* Record Node */ >> +struct ext4_btree_leaf_node { >> + struct ext4_btree_header node_header; >> + struct ext4_btree_rec rec[EXT4_BTREE_MAX_RECS]; >> +}; >> + >> +struct ext4_btree_node { >> + struct ext4_btree_header node_header; >> +}; >> + >> +struct ext4_btree_root { >> + struct ext4_btree_node *node; >> + struct ext4_btree_geo geo; >> +}; >> + >> +#define ext4_btree_trace printf > printk? > > --D :) this was done in the user-space first. When the code finally all move to kernel will be something like printk Thanks a lot , for your review and comments!:) Mingming -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html