On Tue, Feb 16, 2016 at 09:39:28AM +0100, David Casier wrote: > "With this model, filestore rearrange the tree very > frequently : + 40 I/O every 32 objects link/unlink." > It is the consequence of parameters : > filestore_merge_threshold = 2 > filestore_split_multiple = 1 > > Not of ext4 customization. It's a function of the directory structure you are using to work around the scalability deficiencies of the ext4 directory structure. i.e. the root cause is that you are working around an ext4 problem. > The large amount of objects in FileStore require indirect access and > more IOPS for every directory. > > If root of inode B+tree is a simple block, we have the same problem with XFS Only if you use the same 32-entries per directory constraint. Get rid of that constraint, start thinking about storing tens of thousands of files per directory instead. i.e. let the directory structure handle IO optimisation as the number of entries grow, not impose artificial limits that prevent them from working efficiently. Put simply, XFS is more efficient in terms of the average physical IO per random inode lookup with shallow, wide directory structures than it will be with a narrow, deep setup that is optimised to work around the shortcomings of ext3/ext4. When you use deep directory structures to inde millions of files, you have to assume that any random lookup will require directory inode IO. When you use wide, shallow directories you can almost guarantee that the directory inodes will remain cached in memory because the are so frequently traversed. hence we never need to do IO for directory inodes in a wide, shallow config, and so that IO can be ignored. So let's assume, for ease of maths, we have 40 byte dirent structures (~24 byte file names). That means a single 4k directory block can index aproximately 60-70 entries. More than this, and XFs switches to a more scalable multi-block ("leaf", then "node") format. When XFs moves to a multi-block structure, the first block of the directory is converted to a name hash btree that allows finding any directory entry in one further IO. The hash index is made up of 8 byte entries, so for a 4k block it can index 500 entries in a single IO. IOWs, a random, cold cache lookup across 500 directory entries can be done in 2 IOs. Now lets add a second level to that hash btree - we have 500 hash index leaf blocks that can be reached in 2 IOs, so now we can reach 25,000 entries in 3 IOs. And in 4 IOs we can reach 2.5 million entries. It should be noted that the length of the directory entries doesn't affect this lookup scalability because the index is based on 4 byte name hashes. Hence it has the same scalability characterisitics regardless of the name lengths; it is only affect by changes in directory block size. If we consider your current "1 IO per directory" config using a 32 entry structure, it's 1024 entries in 2 IOs, 32768 in 3 IOs and with 4 IOs it's 1 million entries. This is assuming we can fit 32 entries in the inode core, which we shoul dbe able to do for the nodes of the tree, but the leaves with the file entries are probably going to have full object names and so are likely to be in block format. I've ignored this and assume the leaf directories pointing to the objects are also inline. IOWs, by the time we get to needing 4 IOs to reach the file store leaf directories (i.e. > ~30,000 files in the object store), a single XFS directory is going to have the same or better IO efficiency than your configuration fixed confiugration. And we can make XFS even better - with an 8k directory block size, 2 IOs reach 1000 entries, 3 IOs reach a million entries, and 4 IOs reach a billion entries. So, in summary, the number of entries that can be indexed in a given number of IOs: IO count 1 2 3 4 32 entry wide 32 1k 32k 1m 4k dir block 70 500 25k 2.5m 8k dir block 150 1k 1m 1000m And the number of directories required for a given number of files if we limit XFS directories to 3 internal IOs: file count 1k 10k 100k 1m 10m 100m 32 entry wide 32 320 3200 32k 320k 3.2g 4k dir block 1 1 5 50 500 5k 8k dir block 1 1 1 1 11 101 So, as you can see, once you make the directory structure shallow and wide, you can reach many more entries in the same number of IOs and there is much lower inode/dentry cache footprint when you do so. IOWs, on XFS you design the heirachy to provide the necessary lookup/modification concurrency as IO scalibility as file counts rise is already efficeintly handled by the filesystem's directory structure. Doing this means the file store does not need to rebalance every 32 create/unlink operations. Nor do you need to be concerned about maintaining a working set of directory inodes in cache under memory pressure - there directory entries become the hotest items in the cache and so will never get reclaimed. Cheers, Dave. -- Dave Chinner dchinner@xxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html