Andrew Vagin <avagin@xxxxxxxxx> writes: > On Mon, Aug 03, 2015 at 04:30:54PM -0500, Eric W. Biederman wrote: >> >> Add a new field mnt_escape_count in nameidata, initialize it to 0 and >> cache the value of read_mnt_escape_count in nd->mnt_escape_count. >> >> This allows a single check in path_connected in the common case where >> either the mount has had no escapes (mnt_escape_count == 0) or there >> has been an escape and it has been validated that the current path >> does not escape. >> >> To keep the cache valid nd->mnt_escape_count must be set to 0 whenever >> the nd->path.mnt changes or when nd->path.dentry changes such that >> the connectedness of the previous value of nd->path.dentry does >> not imply the connected of the new value of nd->path.dentry. >> >> Various locations in fs/namei.c are updated to set >> nd->mnt_escape_count to 0 as necessary. >> >> Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> >> --- >> fs/namei.c | 26 +++++++++++++++++++++++--- >> 1 file changed, 23 insertions(+), 3 deletions(-) >> >> diff --git a/fs/namei.c b/fs/namei.c >> index bccd3810ff60..79a5dca073f5 100644 >> --- a/fs/namei.c >> +++ b/fs/namei.c >> @@ -514,6 +514,7 @@ struct nameidata { >> struct nameidata *saved; >> unsigned root_seq; >> int dfd; >> + unsigned mnt_escape_count; >> }; >> >> static void set_nameidata(struct nameidata *p, int dfd, struct filename *name) >> @@ -572,12 +573,13 @@ static bool path_connected(struct nameidata *nd) >> struct vfsmount *mnt = nd->path.mnt; >> unsigned escape_count = read_mnt_escape_count(mnt); >> >> - if (likely(escape_count == 0)) >> + if (likely(escape_count == nd->mnt_escape_count)) >> return true; > > The size of mnt_escape_count is only 4 bytes. Looks like it possible to > make UINT_MAX / 2 operations for the resonable time and get the same > value of mnt_escape_count, path_connected() will return true, but the > path may be already detached. What do you think about this? It is an interesting question. For the locking on rename we have something that looks like: mutex_lock(&...s_vfs_mutex); mutex_lock(&p2->d_inode->i_mutex); mutex_lock(&p1->d_inode->i_mutex); read_seqlock_excl(&mount_lock); escape_count++; write_seqlock(&rename_lock); write_seqcount_begin(&dentry->d_seq); write_seqcount_begin_nested(&target->d_seq); write_seqcount_end_nested(&target->d_seq); write_seqcount_end(&dentry->d_seq); write_sequnlock(&rename_lock); escape_count++; read_sequnlock_excl(&mount_lock); mutex_unlock(&p1->d_inode->i_mutex); mutex_unlock(&p2->d_inode->i_mutex); mutex_unlock(&...s_vfs_mutex); Which is at least 16 serialized cpu operations. To reach overflow then it would take at least 16 * 2**32 operations cpu operations. On a 4Ghz 16 * 2**32 operations would take roughly 16 seconds. In practice I think it would take noticably more than 16 seconds to perform that many renames as there is a lot more going on than just the locking I signled out. A pathname lookup taking 16 seconds seems absurd. But perhaps in the worst case. The maximum length of a path that can be passed into path_lookup is 4096. For a lookup to be problematic there must be at least as many instances of .. as there are of any other path component. So each pair of a minium length path element and a .. element must take at least 5 bytes. Which in 4096 bytes leaves room for 819 path elements. If every one of those 819 path components triggered a disk seek at 100 seeks per second I could see a path name lookup potentially taking 8 seconds. MAXSYMLINKS is 40. So you could perhaps if you have a truly pathlogical set of file names might make that 320 seconds. Ick. To get some real figures I have performed a few renames on my 2.5Ghz laptop with ext4 on an ssd. Performing a simple rename in the same directory (which involves much less than in required to rename a file out of a mount point) I get the following timings: renames time 200,000 1.2s 2,000,000 17.2s 20,000,000 205.6s 40,000,000 418.5s At that kind of speed I would expect 4,000,000,000 renames to take 41850 seconds or 11.625 hours. Going the other way on an old system with spinning rust for a hard drive I created 1000 nested directories all named a and times how long it would take to stat a pathlogical pathname. With a simple pathname without symlinks involved the worst case I have been able to trigger is a 0.3 second path name lookup time. Not being satisified with that I managed to create a file about 83,968 directories deep and a set of 40 symlinks setup up to get me there in one stat call the first symlink being 8192 directories deep. The worst case time I have been able to measure from stat -L on the symlink that leads me to that file is 14.5 seconds. If my numbers are at all representative I don't see a realistic possibility of being able to perform enough renames to roll over a 32bit mnt_escape_count during a path name lookup. Even my most optimistic estimate required 16 seconds to perform the renames and I have not been able to get any pathname lookup to run that long. At the same time I do think it is probably worth it to make escape count an unsigned long, because that is practically free and it removes the any theoretical concern on 64bit. Am I missing anything? Eric _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers