Linus Torvalds wrote: > On Sun, Sep 14, 2014 at 7:49 PM, Tetsuo Handa > <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote: > > > > I care, at least as of RHEL6 (2.6.32) kernels. Some users are using tmpfs > > in order to avoid dentry cache bouncing bomb (a series of simple stat() calls > > take about 10 seconds when there are 300,000,000+ unused dentries, P31 of > > http://I-love.SAKURA.ne.jp/tomoyo/LCJ2014-en.pdf ). I don't know performance > > on recent kernels. > > Hmm. Probably not hugely different. It's still fairly easy to generate > a lot of negative dentries. Do you have an actual test program that > does this and is annoying? Do people actually do that? OK. I quote comment #5 and #9 of RHBZ 1061562 below. The test program I used (source code of a.out) is in comment #5. The customer in comment #9 is now using tmpfs and so far I have heard no problems. ----- comment #5 start ----- At least two enterprise systems which create a lot of dentries are suffering unexpected system reboots/hangups. One of the systems has about 100GB of RAM and the sar report showed that the unused dentries was over 300,000,000 just before an unexpected reboot happened. Also, by using the dentry producer program described in comment #0, the system successfully triggered another unexpected reboot. Therefore, I am examining the cause of unexpected reboots. Since I had a chance to access a machine with 64GB of RAM this week, I ran a test using SystemTap and an updated version of dentry producer program described below. ---------- a SystemTap command line start ---------- stap -e ' global stat_start; global shrink_start; global counter; probe kernel.function("vfs_stat") { counter++; stat_start[tid()] = gettimeofday_ns(); } probe kernel.function("vfs_stat").return { t = gettimeofday_ns() - stat_start[tid()]; if (t >= 1000000) printf("%s: vfs_stat(%u) took %u ms\n", execname(), counter, t / 1000000); } probe kernel.function("shrink_slab") { printf("%s: shrink_slab() started\n", execname()); shrink_start[tid()] = gettimeofday_ns(); } probe kernel.function("shrink_slab").return { t = gettimeofday_ns() - shrink_start[tid()]; printf("%s: shrink_slab() took %u ms\n", execname(), t / 1000000); } ' ---------- a SystemTap command line end ---------- ---------- an example of very-fast dentry producer start ---------- #include <stdio.h> #include <sys/stat.h> int main(int argc, char *argv[]) { unsigned int i = 0; struct stat buf; char buffer[128] = { }; while (1) { snprintf(buffer, sizeof(buffer) - 1, "%u", i++); stat(buffer, &buf); } return 0; } ---------- an example of very-fast dentry producer end ---------- The test result showed that a single stat() call takes a few seconds if kswapd is reclaiming memory at shrink_slab(), and a series of a few stat() calls takes more than 10 seconds. ---------- SystemTap output start ---------- bash: shrink_slab() started bash: shrink_slab() took 18 ms bash: shrink_slab() started bash: shrink_slab() took 0 ms crond: vfs_stat(9964303) took 10 ms crond: vfs_stat(79137289) took 6 ms sa1: vfs_stat(79200631) took 9 ms bash: vfs_stat(98784249) took 10 ms run-parts: vfs_stat(98787397) took 5 ms run-parts: vfs_stat(98791464) took 3 ms 0anacron: vfs_stat(98801077) took 15 ms 0anacron: vfs_stat(98804416) took 10 ms kswapd0: shrink_slab() started kswapd1: shrink_slab() started a.out: vfs_stat(305840783) took 282 ms a.out: vfs_stat(305840784) took 999 ms a.out: vfs_stat(305840785) took 335 ms a.out: vfs_stat(305840786) took 3654 ms a.out: vfs_stat(305840787) took 3009 ms a.out: vfs_stat(305840788) took 999 ms a.out: vfs_stat(305840790) took 1228 ms a.out: vfs_stat(305847338) took 2 ms a.out: vfs_stat(305849179) took 3 ms a.out: vfs_stat(305849870) took 2 ms a.out: vfs_stat(305850478) took 1 ms a.out: vfs_stat(305854046) took 1 ms a.out: vfs_stat(305856251) took 1 ms a.out: vfs_stat(305864002) took 5 ms a.out: vfs_stat(305876155) took 6 ms a.out: vfs_stat(305885622) took 12 ms a.out: vfs_stat(305886187) took 9 ms a.out: vfs_stat(305888365) took 2 ms a.out: vfs_stat(305892807) took 2 ms a.out: vfs_stat(305896127) took 2 ms a.out: vfs_stat(305897223) took 2 ms a.out: vfs_stat(305910449) took 1 ms a.out: vfs_stat(305911688) took 8 ms a.out: vfs_stat(305914482) took 12 ms kswapd1: shrink_slab() took 11810 ms kswapd1: shrink_slab() started a.out: vfs_stat(305939079) took 7 ms a.out: vfs_stat(305943797) took 21 ms a.out: vfs_stat(305944628) took 12 ms a.out: vfs_stat(305964155) took 3 ms a.out: vfs_stat(305965994) took 15 ms a.out: vfs_stat(305966685) took 15 ms a.out: vfs_stat(305967124) took 1 ms a.out: vfs_stat(305973015) took 2 ms a.out: vfs_stat(305975511) took 4 ms a.out: vfs_stat(305978080) took 3 ms a.out: vfs_stat(305982128) took 3 ms a.out: vfs_stat(305988765) took 4 ms a.out: vfs_stat(305991667) took 3 ms a.out: vfs_stat(305995990) took 3 ms a.out: vfs_stat(306000642) took 1 ms a.out: vfs_stat(306010369) took 3 ms a.out: vfs_stat(306023386) took 10 ms a.out: vfs_stat(306023881) took 9 ms a.out: vfs_stat(306024374) took 3 ms a.out: vfs_stat(306033201) took 5 ms kswapd1: shrink_slab() took 1373 ms a.out: vfs_stat(306038979) took 4 ms a.out: vfs_stat(306045321) took 1 ms a.out: vfs_stat(306048306) took 1 ms a.out: vfs_stat(306050274) took 5 ms a.out: vfs_stat(306053855) took 4 ms a.out: vfs_stat(306054220) took 1 ms a.out: vfs_stat(306075005) took 8 ms a.out: vfs_stat(306076840) took 17 ms a.out: vfs_stat(306085567) took 3 ms a.out: vfs_stat(306101199) took 1 ms a.out: vfs_stat(306102299) took 20 ms a.out: vfs_stat(306109915) took 2 ms a.out: vfs_stat(306111825) took 14 ms a.out: vfs_stat(306114245) took 3 ms a.out: vfs_stat(306114878) took 4 ms kswapd0: shrink_slab() took 14025 ms kswapd0: shrink_slab() started a.out: vfs_stat(306133947) took 29 ms a.out: vfs_stat(306138301) took 5 ms a.out: vfs_stat(306144529) took 5 ms a.out: vfs_stat(306149390) took 5 ms a.out: vfs_stat(306164696) took 8 ms a.out: vfs_stat(306165424) took 5 ms a.out: vfs_stat(306176530) took 1 ms a.out: vfs_stat(306180342) took 20 ms a.out: vfs_stat(306187037) took 7 ms a.out: vfs_stat(306192183) took 12 ms a.out: vfs_stat(306201618) took 9 ms a.out: vfs_stat(306203887) took 2 ms a.out: vfs_stat(306204688) took 4 ms a.out: vfs_stat(306206134) took 1 ms a.out: vfs_stat(306207974) took 2 ms a.out: vfs_stat(306220292) took 12 ms a.out: vfs_stat(306224972) took 6 ms a.out: vfs_stat(306231423) took 6 ms a.out: vfs_stat(306232489) took 1 ms a.out: vfs_stat(306236307) took 4 ms a.out: vfs_stat(306238302) took 1 ms a.out: vfs_stat(306245840) took 1 ms a.out: vfs_stat(306249541) took 14 ms a.out: vfs_stat(306251562) took 6 ms a.out: vfs_stat(306260444) took 2 ms a.out: vfs_stat(306275088) took 4 ms a.out: vfs_stat(306278510) took 2 ms a.out: vfs_stat(306278795) took 3 ms kswapd0: shrink_slab() took 1693 ms kswapd0: shrink_slab() started a.out: vfs_stat(306295322) took 22 ms a.out: vfs_stat(306307052) took 17 ms a.out: vfs_stat(306309420) took 1 ms a.out: vfs_stat(306315293) took 4 ms a.out: vfs_stat(306325204) took 10 ms a.out: vfs_stat(306325362) took 17 ms a.out: vfs_stat(306329728) took 2 ms a.out: vfs_stat(306334637) took 5 ms a.out: vfs_stat(306350149) took 7 ms a.out: vfs_stat(306352355) took 1 ms a.out: vfs_stat(306354047) took 1 ms a.out: vfs_stat(306356301) took 1 ms a.out: vfs_stat(306384456) took 1 ms a.out: vfs_stat(306388446) took 10 ms a.out: vfs_stat(306391249) took 2 ms a.out: vfs_stat(306393632) took 1 ms a.out: vfs_stat(306407549) took 6 ms a.out: vfs_stat(306414708) took 2 ms a.out: vfs_stat(306438538) took 1 ms a.out: vfs_stat(306447413) took 9 ms a.out: vfs_stat(306453184) took 2 ms a.out: vfs_stat(306459036) took 1 ms kswapd0: shrink_slab() took 1691 ms ---------- SystemTap output end ---------- If a watchdog program does something like while (1) { read from /proc/$pid/status for process check sleep for 1 second for CPU saving write to /dev/watchdog for resetting timer } the watchdog program could be blocked for duration that is sufficient to fail to write to /dev/watchdog within the timeout second. So far, I'm suspecting that the cause of unexpected reboots is the watchdog program which is unexpectedly blocked by dcache_lock contention due to kswapd calling cond_resched_lock(&dcache_lock) from __shrink_dcache_sb() from prune_dcache() from shrink_dcache_memory() from shrink_slab(). If my guess is correct, this is a time bomb for systems with a huge amount of RAM and a huge amount of dentries. ----- comment #5 end ----- ----- comment #9 start ----- If I remember correctly, the system is an active/standby cluster, and this problem occurs on the standby machine which is almost idle. On the standby machine, "service drbd status" is issued by heartbeat software at the speed of twice per a second. The /etc/init.d/drbd uses /bin/sort , and /bin/sort creates temporary files. Since the standby machine is nearly idle, there is nothing but dentry which consumes the RAM. Therefore, by the time uptime becomes 120 days, the sum of unused dentries (created by /bin/sort as temporary files) exceeded 300,000,000. Since the cause of these unused dentries turned out to be /etc/init.d/drbd , I suggested the customer to create temporary files on tmpfs (rather than ext4) by doing # echo "export TMPDIR=/dev/shm" >> /etc/default/drbd as a workaround. Also, I suggested the customer to use serial console or netconsole to catch kernel messages (if any) in case of unexpected reboots/hangups, for I'm currently suspecting watchdog-like timeout when kswapd starts memory reclaiming. ----- comment #9 end ----- -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html