On 3/21/16 8:24 PM, Ian Kent wrote: > On Mon, 2016-03-21 at 18:08 -0400, Jeff Mahoney wrote: >> On 3/20/16 11:44 PM, Ian Kent wrote: >>> On Fri, 2016-03-18 at 17:25 -0400, Jeff Mahoney wrote: >>>> Commit d3ce65e05d1 (autofs-5.0.3 - allow directory create on NFS >>>> root), >>>> introduced a bug where contained_in_local_fs would *always* return >>>> true. >>>> >>>> Since the goal of contained_in_local_fs is only to determine if >>>> the >>>> path would be created on a local file system, it's enough to check >>>> a >>>> few things directly without parsing the mount table at all. >>>> >>>> We can check via stat calls: >>>> 1) If parent is the root directory >>>> 2) If the parent is located on the root file system >>>> 3) If the parent is located within a file system mounted >>>> via a block device >>>> >>>> Large numbers of direct mounts when using /proc/self/mounts >>>> instead >>>> of /etc/mtab result in very slow startup time. In testing, we >>>> observed >>>> ~8k mounts taking over 6 hours to complete. >>> >>> When this was originally done the mtab was (usually) an actual file >>> so >>> the autofs mounts weren't present. So mostly it was fairly small. >>> >>> But now, with the symlinking of the mtab to a proc filesystem, this >>> is a >>> big problem. >>> >>>> >>>> This is due to reading /proc/self/mounts for every path component >>>> of >>>> every mount. For my test case, that turns out to be about 119k >>>> times >>>> for a total of just under 400 GB read. If it were a flat file, it >>>> would be also be slow, but /proc/self/mounts is dynamically >>>> generated >>>> every time it's read. >>>> >>>> By avoiding reading /proc/self/mounts many times, startup time >>>> with 8k >>>> direct mounts drops from ~6 hours to about 25 seconds. >>> >>> Yeah, I get that. >>> >>> In some tests done before the symlinking became the norm I saw >>> direct >>> mount tables of over 20k entries taking between 20-30 seconds or a >>> bit >>> more (can't quite remember now, it was a long time ago) but nothing >>> like >>> this. >> >> Yep, I figured you'd remember what was up since it's in the README, >> but >> the context is mostly for a complete commit log entry. >> >>> I thought that was bad so this really needs to be fixed. >>> >>> I am puzzled why I haven't seen other bug reports. >> >> Yeah, I usually see all of the autofs bugs reported against our >> products >> and it's the first time I've seen it too. >> >>>> >>>> Signed-off-by: Jeff Mahoney <jeffm@xxxxxxxx> >>>> --- >>>> daemon/automount.c | 47 >>>> ++++++++++++++++++++++++++++++++++++++++++++--- >>>> include/mounts.h | 1 - >>>> lib/mounts.c | 45 --------------------------------------- >>>> ---- >>>> -- >>>> 3 files changed, 44 insertions(+), 49 deletions(-) >>>> >>>> --- a/daemon/automount.c >>>> +++ b/daemon/automount.c >>>> @@ -98,6 +98,46 @@ static int umount_all(struct autofs_poin >>>> >>>> extern struct master *master_list; >>>> >>>> +#define SYSFS_PATTERN "/sys/dev/block/%d:%d" >>>> +static int contained_in_local_fs(const char *parent) >>>> +{ >>>> + char buf[128]; >>>> + struct stat st, root; >>>> + int ret; >>>> + >>>> + /* The root directory is always considered local, even if >>>> it's not. */ >>>> + if (!*parent) >>>> + return 1; >>>> + >>>> + ret = stat("/", &root); >>>> + if (ret != 0) { >>>> + logerr("error: stat failed on /: %s", >>>> strerror(errno)); >>>> + return 0; >>>> + } >>>> + >>>> + ret = stat(parent, &st); >>>> + if (ret) >>>> + return 0; >>>> + /* >>>> + * If the parent is on the root file system, it's local. >>>> + */ >>>> + if (root.st_dev == st.st_dev) >>>> + return 1; >>> >>> I'm struggling to remember now. >>> >>> I think the problem was only when the root itself was an NFS file >>> system >>> and automount was trying to create a directory. >> >> That sounds like the problem that the patch that caused it to always >> return true was supposed to solve. >> >> Git tells me that contained_in_local_fs matches up with: >> commit 8293b030944eb402893a008a7d2ed1f8969ebee8 >> Author: Ian Kent <raven@xxxxxxxxxxxxxxxx> >> Date: Wed Oct 25 15:51:13 2006 +0800 >> >> - deal with changed semantics of mkdir in 2.6.19. >> >> Does that ring a bell? Google associates it with a patch called >> "don't >> create remote dirs" which makes sense on its own but I can't find >> anything in the kernel changelog between v2.6.18 and v2.6.19 that >> stands >> out as changed mkdir semantics that would necessitate the work being >> done here. > > There was a log going on at that time, it was around the initial v5 > development. > > I had started cataloguing the patches for each release at that time but > by the look of the path order there's nothing else related that went in > with it. Not only that, at that time, my change log entries were either > non-existent or less than stellar. > > You can see the patch series for each release at: > https://www.kernel.org/pub/linux/daemons/autofs/v5/ > > I think this all arose from a discussion on the NFS mailing list due to > problems when trying to create mount point directories that resulted in > something like "automount shouldn't be creating directories on NFS file > systems". So the changed semantics were probably wrt. NFS and not the > VFS. > > But the same logic extends to remote file systems in general. > >> >> At the point where that commit was added, the test for an autofs >> parent >> wasn't in the caller. That searched through the mount table to see if >> there was a match and if it was autofs. The statfs call was >> originally >> in the loop and was rightly moved out of it since statfs() doesn't >> actually require that its path argument be a mountpoint. So that just >> makes the loop test whether it's a block device. Sort of, since it >> uses >> /[^/] as a pattern to mean block device. Later that was extended to >> include UUID/LABEL and also with the patch that was intended to match >> NFS roots. > > There's not really enough context, we'll just need to resolve it based > on what's sensible. > >> >> So, I suppose the question is this: if that patch was accepted based >> on >> its intent, it should be ok to write to a remote file system if it's >> the >> root file system, right? The root file system should be assumed to be >> private, but once we follow a path off of the root file system we >> shouldn't write anymore. > > The mount table scan has to go, whatever we come up with is going to get > committed with the next bunch of patches. > >> >>>> + >>>> + /* >>>> + * If the sysfs node for the block device exists, it's >>>> local. >>>> + */ >>>> + ret = snprintf(buf, sizeof(buf), SYSFS_PATTERN, >>>> + major(st.st_dev), minor(st.st_dev)); >>>> + if (ret != 0) { >>>> + logerr("error: not enough space for sysfs path"); >>>> + return 0; >>>> + } >>> >>> This is the only bit that concerns me. >>> >>> Not because I think it's wrong, on the contrary, it looks ok. >>> >>> I just can't think of cases where it might not cover what's needed. >> >> I realized shortly after posting the patch that anything using >> anonymous >> device nodes would fail the lookup test since they don't register a >> node >> in /sys/dev/block (nor should they.) This includes btrfs since it >> uses >> anonymous devices to account for the fact that virtualizes the file >> system across multiple devices and also needs separate namespace for >> inode numbers between snapshots. Ubifs also uses an anonymous device, >> but I've never actually seen an ubifs file system in the wild. > > Or perhaps pass down the statfs buffer and check for remote fs magic > numbers. That's got to be a smaller list and should be much less likely > to change. > >> Overlayfs will use an anonymous device for directories since they're >> also virtualized. That's a separate issue, though, since it reports >> the >> device as 'none' in the mount table so direct mounts don't work on >> overlayfs yet anyway. >> >>> After all, it's already been checked if the parent is an autofs >>> filesystem which should take care of the bulk of cases except for >>> longer >>> direct mount paths, but will often not be the case for offset >>> mounts. >>> >>> For example a mount entry with offset mounts like: >>> >>> <direct or indirect key> / server:/path \ >>> /other server:/otherpath \ >>> ... >>> >>> But the intermediate path components of a direct mount path (and >>> offset >>> mount path) must already exist if within a remote mount. So the >>> existenc >>> e case, checked before this, should cover these cases. >>> >>> Let me think about it a little longer. >> >> Sure. It seems like the solution, whatever it ends up being, will >> simplify this code a lot. It seems like it boils down to "don't >> create >> directories on a remote file system unless it's the root file system." >> >> As noted above, my patch is incomplete. I think a complete solution >> would look like this: >> >> 1/ Test whether the parent is autofs. >> 2/ Test whether the parent resides on a type of file system known to >> be >> local. This is mostly because we'll need to at least test for btrfs, >> and if we're already testing file system type, we might as well test >> for >> the common local file systems as well. >> 3/ Test whether the parent resides on the root file system. >> 4/ Test whether the file system is mounted on a block device using the >> /sys/dev/block lookup. This is mostly a fallback for the test in (2) >> being imperfect WRT new file systems. >> >> Is this missing any other cases? > > The only bit that looks tricky is the test for a block device, the rest > covers many of the cases. > > And the point is to prohibit mount point directory creation on remote > file systems, so if we get that far then just checking the magic number > of of the statfs struct we already have for known remote file systems > should be enough and be a fairly short list. Ok, that's an easy enough change to code up. I'll post an updated patch. -Jeff -- Jeff Mahoney SUSE Labs
Attachment:
signature.asc
Description: OpenPGP digital signature