Re: some large dir testing results

Bernd Schubert <bschubert@xxxxxxx> · Fri, 21 Apr 2017 14:08:01 +0000

On Thursday, April 20, 2017 10:00:48 PM CEST Alexey Lyashkov wrote:
> Hi All,
> 
> I run some testing on my environment with large dir patches provided by
> Artem. Each test run a 11 loops with creating 20680000 mknod objects for
> normal dir, and 20680000 for large dir. FS was reformatted before each
> test, files was created in root dir to have an allocate inodes and blocks
> from GD#0 and up. Journal have a size - 4G and it was internal journal.
> Kernel was RHEL 7.2 based with lustre patches.
> 
> Test script code
> #!/bin/bash
> 
> LOOPS=11
> 
> for i in `seq ${LOOPS}`; do
> 	mkfs -t ext4 -F -I 256 -J size=4096 ${DEV}
> 	mount -t ldiskfs ${DEV} ${MNT}
> 	pushd ${MNT}
> 	/usr/lib/lustre/tests/createmany -m test 20680000 >& /tmp/small-mknod${i}
> 	popd
> 	umount ${DEV}
> done
> 
> 
> for i in `seq ${LOOPS}`; do
> 	mkfs -t ext4 -F -I 256 -J size=4096 -O large_dir ${DEV}
> 	mount -t ldiskfs ${DEV} ${MNT}
> 	pushd ${MNT}
> 	/usr/lib/lustre/tests/createmany -m test 206800000 >& /tmp/large-mknod${i}
> 	popd
> 	umount ${DEV}
> done
> 
> Tests was run on two nodes - first node have a storage with raid10 of fast
> HDD’s, second node have a NMVE as block device. Current directory code have
> a near of similar results for both nodes for first test: - HDD node 56k-65k
> creates/s
>  - SSD node ~80k creates/s
> But large_dir testing have a large differences for nodes.
> - HDD node have a drop a creation rate to 11k create/s
> - SSD node have drop to 46k create/s
> 
> Initial analyze say about several problems
> 0) CPU load isn’t high, and perf top say ldiskfs functions isn’t hot (2%-3%
> cpu), most spent for dir entry checking function.
> 
> 1) lookup have a large time to read a directory block to verify file not
> exist. I think it because a block fragmentation. [root@pink03 ~]# cat
> /proc/100993/stack
> [<ffffffff81211b1e>] sleep_on_buffer+0xe/0x20
> [<ffffffff812130da>] __wait_on_buffer+0x2a/0x30
> [<ffffffffa0899e6c>] ldiskfs_bread+0x7c/0xc0 [ldiskfs]
> [<ffffffffa088ee4a>] __ldiskfs_read_dirblock+0x4a/0x400 [ldiskfs]
> [<ffffffffa08915af>] ldiskfs_dx_find_entry+0xef/0x200 [ldiskfs]
> [<ffffffffa0891b8b>] ldiskfs_find_entry+0x4cb/0x570 [ldiskfs]
> [<ffffffffa08921d5>] ldiskfs_lookup+0x75/0x230 [ldiskfs]
> [<ffffffff811e8e7d>] lookup_real+0x1d/0x50
> [<ffffffff811e97f2>] __lookup_hash+0x42/0x60
> [<ffffffff811ee848>] filename_create+0x98/0x180
> [<ffffffff811ef6e1>] user_path_create+0x41/0x60
> [<ffffffff811f084a>] SyS_mknodat+0xda/0x220
> [<ffffffff811f09ad>] SyS_mknod+0x1d/0x20
> [<ffffffff81645549>] system_call_fastpath+0x16/0x1b
> [<ffffffffffffffff>] 0xffffffffffffffff

I wrote patches for ext4 a long time ago to get a better caching for that

https://patchwork.ozlabs.org/patch/101200/

For FhGFS/BeeGFS we then decided to use a totally different directory layout, 
which totally eliminated the underlying issue for the main requirement or 
large dirs at all. (Personally I would recommend to do the something similar 
for Lustre - using hash dirs to store objects has a much too random access 
pattern once the file system gets used with many files...).

Also, a caching issue has been fixed by Mel Gorman in 3.11 (I didn't check if 
these patches are backported to any vendor kernel).

Bernd