On Thu, Apr 23, 2015 at 05:45:44PM +1000, NeilBrown wrote: > follow_link calls link_path_walk -> walk_component -> lookup_fast which sets > nd->seq. Is that not enough? I guess not when nd_jump_link is called. Is > that what I missed? No. Potential nd_jump_link() callers are just refusing to go there in lazy mode, end of story. That's not the problem; see below. > One thing that is clear to me is that I don't really understand all the > handling of 'seq' numbers, making me unable to comment usefully. > I'll try to go through the current code and you current patch with that issue > in mind and make sure I do understand it. Then I'll try to comment. OK, here's the basic theory behind the lazy pathwalk: * during the entire exercise we never drop rcu_read_lock, therefore any objects that have all references to them removed before RCU-delayed freeing (inodes, dentries, superblocks and vfsmounts are among such) that we might find in process won't be freed until after we are done. * invariant we are keeping: at some point between the beginning of walk and now the pathname traversed so far would lead to nd->path, with nd->seq and nd->inode equal to ->d_seq and ->d_inode of the dentry in question. * path_init() arranges for that to be true in the beginning of the walk. * symlinks aside, walk_component() preserves that. + normal name components go through lookup_fast(), where we have __d_lookup_rcu() find a child of current nd->path with matching name and (atomically) picks ->d_seq of that child, which had the name matching our component. Atomicity is provided by ->d_lock on child. Then we proceed to pick ->d_inode (and verify that ->d_seq has not changed, thus making sure that ->d_inode value at the moment when the name matched had been the same and child is still hashed. Then we verify that parent's ->d_seq has not changed, guaranteeing that parent hadn't been moved or unhashed from the moment we'd found it until after we'd found its child. Assuming nothing's mounted on top of that thing, and there's no problem with ->d_revalidate()), that's it - we have new nd->seq, nd->path and nd->inode satisfying our invariant for longer piece of pathname. + "." needs nothing - we just stay where we are + ".." is handled by follow_dotdot_rcu(), which in the normal case (no mountpoint crossing) picks ->d_parent of where we are, fetches its ->d_inode and ->d_seq and verifies that our directory still hadn't changed _its_ ->d_seq. The last part guarantees that it hadn't been moved since the time we'd found it, and thus its ->d_parent had remained unchanged at least until that verification. Therefore, it remained pinned all along, and it ->d_inode had remained stable, including the moment when we fetched ->d_seq. Which means that the value we had picked *and* its ->d_inode and ->d_seq would satisfy the invariant for the longer piece of pathname. + mountpoint crossing towards leaves is handled in __follow_mount_rcu(); it is simple (->mnt_root never changes and is always pinned, stabilizing its ->d_inode), but we might need to worry about automount points *and* need to make sure that we stop right there if mount_lock has been bumped. See commit b37199e6 for the details on the last part - basically, making sure that false negatives from __lookup_mnt() won't end up with hard error when we walk into whatever had been under the mountpoint we'd missed. + mountpoint crossing towards root (in follow_dotdot_rcu()) is similar, but there we obviously don't care about automounts. Looking at it now, it might make sense to recheck mount_lock there as well, though - potential danger is to hit the moment when mnt_parent and mnt_mountpoint are out of sync, leaving us with mismatched vfsmount,dentry pair. Generally, that will be caught when we try to leave lazy mode (legitimize_mnt() will fail) or to cross towards leaves, but the next crossing towards root might be bogus as well, and we could end up with unwarranted hard error. Should be very hard to hit, but it's easy enough to check *and* backport, so it looks like a good idea for -stable. Linus, do you have any objections against adding if (read_seqretry(&mount_lock, nd->m_seq)) goto failed; right after if (!follow_up_rcu(&nd->path)) break; in follow_dotdot_rcu()? It's a very narrow race with mount --move and most of the time it ends up being completely harmless, but it's possible to construct a case when we'll get a bogus hard error instead of falling back to non-lazy walk... OTOH, anyone doing that will get something inherently timing-dependent as the result, lazy mode or not. I'm in favour of adding that check, but IMO it's not a critical problem. * if we find that we can't continue in lazy mode because some verification fails, we just fail with -ECHILD. However, in cases when our current position might be fine, but the next step can't be done in lazy mode, we attempt to fall back to non-lazy without full restart that would be caused by -ECHILD. That's what unlazy_walk() is. Of course, if we reach the end of walk we need to leave the lazy mode as well (complete_walk()). Either can fail, and such a failure means restart from scratch in non-lazy mode. We need to grab references on vfsmount and dentry (or two dentries, when we have parent and child to deal with). The interesting part is vfsmount - we need to make sure we won't interfere with umount(2), having walked in that sucker *after* umount(2) has checked that it's not busy. See legitimize_mnt() for details; basically, we have umount(2) mark the "I've verified it's not busy and it's going to be killed no matter what" with MNT_SYNC_UMOUNT and rely on RCU delays on that path - if we find one of those, we undo the increment of its refcount we'd just done, without having dropped rcu_read_lock(). Full-blown mntput() is done only on mismatches that had _not_ been marked that way. BTW, something similar to the above probably needs to be turned into coherent text, either in parallel to Documentation/filesystems/path-lookup.txt, or as update to it. The reason why walk_component() in your call chain won't suffice is simple - it will fail this /* * This sequence count validates that the parent had no * changes while we did the lookup of the dentry above. * * The memory barrier in read_seqcount_begin of child is * enough, we can use __read_seqcount_retry here. */ if (__read_seqcount_retry(&parent->d_seq, nd->seq)) return -ECHILD; in lookup_fast(), just before updating nd->seq to new value. Basically, it has no way to tell if parent has been buggered to hell and back. It's not _that_ hard to prevent - we just need to stop discarding the parent's seq number in "need to follow a symlink" case of walk_component(). Will take some massage, but not too much... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html