On Thursday, April 20, 2017 10:00:48 PM CEST Alexey Lyashkov wrote: > Hi All, > > I run some testing on my environment with large dir patches provided by > Artem. Each test run a 11 loops with creating 20680000 mknod objects for > normal dir, and 20680000 for large dir. FS was reformatted before each > test, files was created in root dir to have an allocate inodes and blocks > from GD#0 and up. Journal have a size - 4G and it was internal journal. > Kernel was RHEL 7.2 based with lustre patches. > > Test script code > #!/bin/bash > > LOOPS=11 > > for i in `seq ${LOOPS}`; do > mkfs -t ext4 -F -I 256 -J size=4096 ${DEV} > mount -t ldiskfs ${DEV} ${MNT} > pushd ${MNT} > /usr/lib/lustre/tests/createmany -m test 20680000 >& /tmp/small-mknod${i} > popd > umount ${DEV} > done > > > for i in `seq ${LOOPS}`; do > mkfs -t ext4 -F -I 256 -J size=4096 -O large_dir ${DEV} > mount -t ldiskfs ${DEV} ${MNT} > pushd ${MNT} > /usr/lib/lustre/tests/createmany -m test 206800000 >& /tmp/large-mknod${i} > popd > umount ${DEV} > done > > Tests was run on two nodes - first node have a storage with raid10 of fast > HDD’s, second node have a NMVE as block device. Current directory code have > a near of similar results for both nodes for first test: - HDD node 56k-65k > creates/s > - SSD node ~80k creates/s > But large_dir testing have a large differences for nodes. > - HDD node have a drop a creation rate to 11k create/s > - SSD node have drop to 46k create/s > > Initial analyze say about several problems > 0) CPU load isn’t high, and perf top say ldiskfs functions isn’t hot (2%-3% > cpu), most spent for dir entry checking function. > > 1) lookup have a large time to read a directory block to verify file not > exist. I think it because a block fragmentation. [root@pink03 ~]# cat > /proc/100993/stack > [<ffffffff81211b1e>] sleep_on_buffer+0xe/0x20 > [<ffffffff812130da>] __wait_on_buffer+0x2a/0x30 > [<ffffffffa0899e6c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs] > [<ffffffffa088ee4a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs] > [<ffffffffa08915af>] ldiskfs_dx_find_entry+0xef/0x200 [ldiskfs] > [<ffffffffa0891b8b>] ldiskfs_find_entry+0x4cb/0x570 [ldiskfs] > [<ffffffffa08921d5>] ldiskfs_lookup+0x75/0x230 [ldiskfs] > [<ffffffff811e8e7d>] lookup_real+0x1d/0x50 > [<ffffffff811e97f2>] __lookup_hash+0x42/0x60 > [<ffffffff811ee848>] filename_create+0x98/0x180 > [<ffffffff811ef6e1>] user_path_create+0x41/0x60 > [<ffffffff811f084a>] SyS_mknodat+0xda/0x220 > [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20 > [<ffffffff81645549>] system_call_fastpath+0x16/0x1b > [<ffffffffffffffff>] 0xffffffffffffffff I wrote patches for ext4 a long time ago to get a better caching for that https://patchwork.ozlabs.org/patch/101200/ For FhGFS/BeeGFS we then decided to use a totally different directory layout, which totally eliminated the underlying issue for the main requirement or large dirs at all. (Personally I would recommend to do the something similar for Lustre - using hash dirs to store objects has a much too random access pattern once the file system gets used with many files...). Also, a caching issue has been fixed by Mel Gorman in 3.11 (I didn't check if these patches are backported to any vendor kernel). Bernd