On 2014-09-24 17:01, Ryusuke Konishi wrote: > On Wed, 24 Sep 2014 10:01:05 +0200, Andreas Rohner wrote: >> On 2014-09-23 18:35, Ryusuke Konishi wrote: >>> On Tue, 23 Sep 2014 16:21:33 +0200, Andreas Rohner wrote: >>>> On 2014-09-23 14:47, Ryusuke Konishi wrote: >>>>> By the way, if you are interested in improving this sort of bad >>>>> implemetation, please consider improving inode allocator that we can >>>>> see at nilfs_ifile_create_inode(). >>>>> >>>>> It always searches free inode from ino=0. It doesn't use the >>>>> knowledge of the last allocated inode number (inumber) nor any >>>>> locality of close-knit inodes such as a file and the directory that >>>>> contains it. >>>>> >>>>> A simple strategy is to start finding a free inode from (inumber of >>>>> the parent directory) + 1, but this may not work efficiently if the >>>>> namespace has multiple active directories, and requires that inumbers >>>>> of directories are suitably dispersed. On the other hands, it >>>>> increases the number of disk read and also increases the number of >>>>> inode blocks to be written out if inodes are allocated too discretely. >>>>> >>>>> The optimal strategy may differ from that of other file systems >>>>> because inode blocks are not allocated to static places in nilfs. For >>>>> example, it may be better if we gather inodes of frequently accessed >>>>> directories into the first valid inode block (on ifile) for nilfs. >>>> >>>> Sure I'll have a look at it, but this seems to be a hard problem. >>>> >>>> Since one inode has 128 bytes a typical block of 4096 contains 32 >>>> inodes. We could just allocate every directory inode into an empty block >>>> with 31 free slots. Then any subsequent file inode allocation would >>>> first search the 31 slots of the parent directory and if they are full, >>>> fallback to a search starting with ino 0. >>> >>> We can utilize several characteristics of metadata files for this >>> problem: >>> >>> - It supports read ahead feature. when ifile reads an inode block, we >>> can expect that several subsequent blocks will be loaded to page >>> cache in the background. >>> >>> - B-tree of NILFS is efficient to hold sparse blocks. This means that >>> putting close-knit 32 * n inodes far from offset=0 is not so bad. >>> >>> - ifile now can have private variables in nilfs_ifile_info (on-memory) >>> struct. They are available to store context information of >>> allocator without compatibility issue. >>> >>> - We can also use nilfs_inode_info struct of directories to store >>> directory-based context of allocator without losing compatibility. >>> >>> - Only caller of nilfs_ifile_create_inode() is nilfs_new_inode(), and >>> this function knows the inode of the parent directory. >> >> Then the only problem is how to efficiently allocate the directories. We >> could do something similar to the Orlov allocator used by the ext2/3/4 >> file systems: >> >> 1. We spread first level directories. Every one gets a full bitmap >> block (or half a bitmap block) >> 2. For the other directories we will try to choose the bitmap block of >> the parent unless the number of free inodes is below a certain >> threshold. Within this bitmap block the directories should also >> spread out. > > In my understanding, the basic strategy of the Orlov allocator is to > physically spead out subtrees over cylinder groups. This strategy is > effective for ext2/ext3/ext4 to mitigate overheads which come from > disk seeks. The strategy increases the locality of data and metadata > and that of a parent directory and its childs nodes, but the same > thing isn't always true for nilfs because real block allocation of > ifile and other files including directories is virtualized and doesn't > reflect underlying phyiscs (e.g. relation between LBA and seek > time) as is. > > I think the strategy 1 above doesn't make sense unlike ext2/3/4. I know that it is a sparse file and the blocks can end up anywhere on disk, independent of the offset in the ifile. I just thought it may be a good idea to give top level directories more room to grow. But you are probably right and it makes no sense for nilfs... >> File inodes will just start a linear search at the parents inode if >> there is enough space left in the bitmap. >> >>>> This way if a directory has less than 32 files, all its inodes can be >>>> read in with one single block. If a directory has more than 32 files its >>>> inodes will spill over into the slots of other directories. >>>> >>>> But I am not sure if this strategy would pay off. >>> >>> Yes, for small namespaces, the current implementation may be enough. >>> We should first decide how we evaluate the effect of the algorithm. >>> It may be the scalability of namespace. >> >> It will be very difficult to measure the time accurately. I would >> suggest to simply count the number of reads and writes on the device. >> This can be easily done: >> >> mkfs.nilfs2 /dev/sdb >> >> cat /proc/diskstats > rw_before.txt >> >> do_tests >> >> extract_kernel_sources >> >> ... >> >> find /mnt >> >> cat /proc/diskstats > rw_after.txt >> >> The algorithm with fewer writes and reads wins. >> >> I am still not convinced that all of this will pay off, but I will try a >> few things and see if it works. > > How about measuring the following performance? > > (1) Latency of inode allocation and deletion in a file system which > holds millions of inodes. > > (Can you estimate the cost of bitmap scan per block and > its relation with the number of inodes ?) > > (2) Time to traverse a real model directory tree such as a full UNIX > system tree or data storage of an actually-used samba server. > > How do you think ? Yes I agree. We definitely need some test case with millions of inodes and test for allocation latency and traversal time. But I think that even with millions of inodes both algorithms will be very fast. So any time measurement will contain a lot of variability from sources we cannot control. So it will be hard to measure a difference. That is why I suggested to measure the number of blocks read and written to disk, which is easy, repeatable and a good indicator of the performance of the algorithm. Regards, Andreas Rohner -- To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html