Hi, list, I am working on the directory inode reservation feature now. Here is the detailed description of my understand of the designing, and current implementations. Please give me your comments on this idea. Thanks for your help in advance. Best regards. Coly Li ----------------------------------------------------------------------------- Designing and Implementation of Directory Inode Reservation version 0.1 Coly Li This text explains what is the idea of directory inode reservation, and current designing and implementation for this idea. Andreas Dilger and Danial Phillips developed this idea when ext3 htree was first written, now it is the time to implement it. 1. Issues for current inode allocating Currently ext3 and ext4dev allocate inodes in linear order within each block groups. The linear allocating may causes bad performance when stat or unlink huge number of files under a directory recursively. The reasons are: * Inodes are allocated in linear order, while dentries of files are accessed by hashed order in directory files. The difference in ordering may cause a single inode block in inode table to be submitted multiple times. For example, in hashed order of directory file, the inode of first accessed file is in second inode block, inode of second accessed file is in first inode block, inode of third accessed file is in second inode block ... This will cause each inode block be dirtied and submitted into journal or nature filesystem multiple times, especially in data=writeback mode. * Inodes of files in different sub-directories may be allocated in one inode block. This condition will also cause multiple dirtying and submitting to this inode block, it can not be helpful that even the inodes of same directory are in hashed order. The issue will happen when creating huge number of files under a directory, and even worse when creating huge number of files under multiple directories alternately within one block group. 2. Improve performance by inode reservation for sub-directories Inode reservation for sub-directories means when creating a sub-directory, reserve a number of continuous inodes in inode table for it. When creating new files under the sub-directory, inodes can be allocated from the reserved region. Once the reserved region is full, just find another larger reservd region in inode tables. * First goal, make new file inodes of same directory be allocated from reserved inode region. * Second goal, make new file inodes of same directory be allocated in hashed (like) order from reserved inode region. The first goal can avoid inodes from different sub-directories mixed in one inode blocks. The second goal can try best to make inodes allocating order follow hashed order of dentries in directory file. Both can decrease multiple times for inode block dirting and submitting. 3. Benchmarks for ideal performance improvement A benchmark is done for ideal condition, the improved results are impressive (copy operations are done on differenct harddisk, all the files are 0 byte). Operations are: * cd hdiskA/sub; for i in `seq 1 500000`;do touch `/usr/bin/keygen | head -c 8`;done;done * reboot * time cp -r hdiskA/sub hdiskB/ordered1 * cp -r hdiskB/ordered1 hdiskA/ordered1 * reboot * time cp -r hdiskA/ordered1 hdiskB/ordered2 * reboot * time rm -rf hdiskA/ordered1 * time rm -rf hdiskA/sub Here are the results for different journaling modes: a) data=writeback mode "cp -r hdiskA/sub hdiskB/ordered1" | "cp -r hdiskA/ordered1 hdiskB/ordered2" real 7m17.616s | real 1m8.764s user 0m1.456s | user 0m1.568s sys 0m27.586s | sys 0m26.050s "rm -rf hdiskA/sub" | "rm -rf hdiskA/ordered1" real 9m49.902s | real 0m37.493s user 0m0.220s | user 0m0.076s sys 0m14.377s | sys 0m11.089s b) data=ordered "cp -r hdiskA/sub hdiskB/ordered1" | "cp -r hdiskA/ordered1 hdiskB/ordered2" real 7m57.016s | real 7m46.037s user 0m1.632s | user 0m1.604.s sys 0m25.558s | sys 0m24.902s "rm -rf hdiskA/sub" | "rm -rf hdiskA/ordered1" real 10m17.966s | real 6m32.278s user 0m0.236s | user 0m0.176s sys 0m14.453s | sys 0m12.093s c) data=journal "cp -r hdiskA/sub hdiskB/ordered1" | "cp -r hdiskA/ordered1 hdiskB/ordered2" real 6m54.151s | real 7m7.696s user 0m1.696s | user 0m1.416s sys 0m22.705s | sys 0m23.541s "rm -rf hdiskA/sub" | "rm -rf hdiskA/ordered1" real 10m41.150s | real 7m43.703s user 0m0.188s | user 0m0.216s sys 0m13.781s | sys 0m12.117s From the results we can find that if inodes of same directory are allocated continuously, and in same order of the dentries hashed in directory file, there will be much performance improved. In data=writeback mode, the improvement is astonishing. In data=ordered and data=journal mode, performance of stat can not be improved much, but unlink can be improved around 27% ~ 36%. Andreas Dilger points out that in this benchmark, all files are 0 byte. If allocated data blocks to these files (as in practice), performance improvement can be more obvious. 4. Designing and implementation The designing principle is quite simple. * When a directory is created, reserved a couple of inodes for it. * When a file under this directory is created, allocating inodes from this reserved region. * When allocating an inode in reserved inode region, * If no free inode in the reserved region, find a another double sized continuous inodes region from the inode tables. * If there is no available inodes for inode reservation, use original linear inodes allocator. * Do not change the on disk layout of ext4 filesystem. * Some inodes will serve as magic inodes. There are 2 kinds of magic inode, EXT4_MINODE_TYPE_LASTRES records the offset of latest reserved inode for directory in a inode table of a block group, EXT4_MINODE_TYPE_LINK records the inode number at the head of next reserved inode area for the same directory. Here are the details for implementation. a) Initial number of reserved inodes is 1 block in inode table. When the reserved area is full, looking for a doubled size free inode block for the directory, until there is not available free inode blocks in all block groups, or the size of reserved inode area reachs maximum size of one inode table. b) The last inode in each reserved inode block are EXT4_MINODE_TYPE_LINK magic inode. In this magic inode, records the head inode number of next double sized inode block, and the size for next reserved inode block for the same directory, and whethere there is no next reserved inode block for the directory. c) For each block group, there is a EXT4_MINODE_TYPE_LASTRES magic inode stays in the last inode of last inode block of inode table. This magic inode records the offset of last inode used in reservation within its inode table. When creating a new directory in a block group, the inodes for itself and reserved inodes can be allocated after the offset recorded in the EXT4_MINODE_TYPE_LASTRES magic inode. d) Inode reservation for sub-directory is different to inode/block reservation for root user. The latter one will set all the reserved inode/block in inode/block bitmap as busy, so others can not use the inode or block within the reserved region. In the sub-directory inode reservation, all inodes in reserved region are still free in bitmap, only magic inodes are set as busy in inode bitmap. By this scheme, others also can allocated inodes from the reserved area (e.g. in case of no available reservation inode area for new created sub-directory), but this is the lastest choice. e) Inode of each sub-directory will be the first inode in reserved inode blocks for itself. When a new file is created under this sub-directory, the inode for the new file will be allocated from the reserved area. Firstly, the inode will try to be allocated in a hashed order (same as htree for dentries) other than current linear order. Secondly, if the resulted position is allocated already, find a new position near around. If no near around free inode can be found to allocate, the inode will be allocated by linear order in the reserved inode area. f) In theory, the smallest size of reserved inode area is 1 inode block in inode talbe of a block group, while the largest size of reserved inode area is the whole inode table of a block group (except inodes reserved for root user and all magic inodes). g) When creating a new directory, the target block group is decided by orlov allocator. But a restriction should be patched in the orlov allocator, that is the target block group should also have enough continuous inodes in inode table to be reserved for the new created directory. h) When creating a new file under a directory, the initial target block group is the same block group with the dirctory is located on. But if the inode will be allocated in a reserved inode area on another block group, the target block group will be switched. i) Maigc inode will mark with EXT4_MINODE_MAGIC_STR string, kernel and fsck can verify it by magic string and checksum. j) Mke2fs and fsck should be patched to understand magic inodes for sub-directory inode reservation. 5. Compatibility issues Current designing and implementation can work well with legacy e2fsprogs and ext3/ext4dev kernel code. * If a sub-directory inode reservation enabled ext4 partition is mounted on a ext3 or current ext4dev kernel, the magic inodes will be taken as busy inodes, and no affect to normal inodes allocating or reaping. * If a sub-directory inode reservation enabled ext4 partition is checked by a magic-inode-unknown legacy fsck, the magic inodes will be taken as corrupted inodes and be reaped. For the next mount, the partition will work as a sub-directory inode reservation disabled parition. * For other conditions, kernel will try to fix the error firstly. Only when the error is unrecoverable, kernel will disable the inode reservation automatically. e.g. a incorrect EXT4_MINODE_TYPE_LASTRES inode is found in a sub-directory inode reservation enabled ext4 partition. Kernel will try to fix offset recorded in this magic inode, if this error is unrecoverable, kernel will disable inode reservation, and any new inode will be allocated in traditional linear order. Therefor, even several magic inodes are added into the ext4 ondisk layout, basic format is unchanged. The new format can continue to work with legacy kernel or e2fsprogs. 6. Modification in ext4 source code As Andreas Dilger predicts, about 500 lines C code needed to implement this patch. The patch will be in: * Super block filling code. Kernel needs to check whether inode reservation is enabled in the mounting partition, and try best to fix founded magic inode errors. * Orlov allocator. Add extra restriction to choose the target block group for inode reservation. * Inode allocator for non-directory files. New inodes will not be allocated as the first free inode in inode table of a block group, it will be allocated from reserved inode area of the directory. * Other unpredictable places in ext4 kernel code. e.g. If dynamic inode allocation is accepted, now patch will be made to implement sub-directory inode reservation in dynamic inode allocation. 7. Expected performance improvement I am not confident to make improvement as better as the ideal condition benchmark. How about 50% improvement of the benchmark in ideal condition? Who knows, let me implement it firstly. The benchmakr will give us the result :-) - To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html