Adding Al Viro to the Cc list as I believe Stephen Whitehouse and Al have discussed something similar, please feel free to chime in with your thoughts Al. On 29/11/17 09:17, NeilBrown wrote: > On Tue, Nov 28 2017, Mike Marion wrote: > >> On Tue, Nov 28, 2017 at 07:43:05AM +0800, Ian Kent wrote: >> >>> I think the situation is going to get worse before it gets better. >>> >>> On recent Fedora and kernel, with a large map and heavy mount activity >>> I see: >>> >>> systemd, udisksd, gvfs-udisks2-volume-monitor, gvfsd-trash, >>> gnome-settings-daemon, packagekitd and gnome-shell >>> >>> all go crazy consuming large amounts of CPU. >> >> Yep. I'm not even worried about the CPU usage as much (yet, I'm sure >> it'll be more of a problem as time goes on). We have pretty huge >> direct maps and our initial startup tests on a new host with the link vs >> file took >6 hours. That's not a typo. We worked with Suse engineering >> to come up with a fix, which should've been pushed here some time ago. >> >> Then, there's shutdowns (and reboots). They also took a long time (on >> the order of 20+min) because it would walk the entire /proc/mounts >> "unmounting" things. Also fixed now. That one had something to do in >> SMP code as if you used a single CPU/core, it didn't take long at all. >> >> Just got a fix for the suse grub2-mkconfig script to fix their parsing >> looking for the root dev to skip over fstype autofs >> (probe_nfsroot_device function). >> >>> The symlink change was probably the start, now a number of applications >>> now got directly to the proc file system for this information. >>> >>> For large mount tables and many processes accessing the mount table >>> (probably reading the whole thing, either periodically or on change >>> notification) the current system does not scale well at all. >> >> We use Clearcase in some instances as well, and that's yet another thing >> adding mounts, and its startup is very slow, due to the size of >> /proc/mounts. >> >> It's definitely something that's more than just autofs and probably >> going to get worse, as you say. > > If we assume that applications are going to want to read > /proc/self/mount* a log, we probably need to make it faster. > I performed a simple experiment where I mounted 1000 tmpfs filesystems, > copied /proc/self/mountinfo to /tmp/mountinfo, then > ran 4 for loops in parallel catting one of these files to /dev/null 1000 times. > On a single CPU VM: > For /tmp/mountinfo, each group of 1000 cats took about 3 seconds. > For /proc/self/mountinfo, each group of 1000 cats took about 14 seconds. > On a 4 CPU VM > /tmp/mountinfo: 1.5secs > /proc/self/mountinfo: 3.5 secs > > Using "perf record" it appears that most of the cost is repeated calls > to prepend_path, with a small contribution from the fact that each read > only returns 4K rather than the 128K that cat asks for. > > If we could hang a cache off struct mnt_namespace and use it instead of > iterating the mount table - using rcu and ns->event to ensure currency - > we should be able to minimize the cost of this increased use of > /proc/self/mount*. > > I suspect that the best approach would be implement a cache at the > seq_file level. > > One possible problem might be if applications assume that a read will > always return a whole number of lines (it currently does). To be > sure we remain safe, we would only be able to use the cache for > a read() syscall which reads the whole file. > How big do people see /proc/self/mount* getting? What size reads > does 'strace' show the various programs using to read it? Buffer size almost always has a significant impact on IO so that's likely a big factor but the other aspect of this is notification of changes. The risk is improving the IO efficiency might just allow a higher rate of processing of change notifications and similar symptoms to what we have now. The suggestion is that a system that allows for incremental (diff type) update notification is needed to allow mount table propagation to scale well. That implies some as yet undefined user <-> kernel communication protocol. Ian