On Thu, 5 Mar 2015 06:05:20 +0000 Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > On Thu, Mar 05, 2015 at 04:21:21PM +1100, NeilBrown wrote: > > Hi Al (and others), > > > > I wonder if you could look over this patchset. > > It allows RCU-walk to follow symlinks in many common cases, > > thus removing a surprising performance hit caused by using symlinks. > > > > The last could of patches make changes to XFS and NFS to support > > this but I haven't forwarded to the relevant lists yet. > > If/when the early code meets with approval I'll do that. > > > > The first patch almost certainly needs to be changed. I originally > > wrote this code when filesystems could see inside nameidata. > > It is now opaque so the simplest solution was to provide an > > accessor function. > > Maybe I should as a 'flags' arg to ->follow_link?? Or have > > ->follow_link and ->follow_link_rcu ?? > > What do you suggest? > > Umm... Some observations: > * now ->follow_link() can be called in RCU mode, which means > that it can race with fs shutdown; not a problem, except that now it > joins ->lookup() et.al. in "if some data structure is needed in RCU > case of that, make sure it's not destroyed without an RCU delay somewhere > between the entry into ->kill_sb() and destruction. So inodes and dentries and associated private data should already be safe. And s_fs_info can be used if it is freed by e.g. kfree_rcu (like autofs) but not if just kfree (like ext3). xfs_fs_put_super() directly frees the 'xfs_mount', which xfs_readlink accesses. I guess that needs to be fixed. > * highmem pages in symlinks: that BS shouldn't be allowed at > all. Just make sure that at least for those filesystems symlink inodes > get mapping_set_gfp_mask(&inode->i_data, GFP_KERNEL) and be done with that. page_getlink() already uses kmap(), implying that highmem pages are supported. All I'm doing is making sure that my page_getlink_rcu() doesn't fail horribly if the page is a highmem page. If a filesystem needs improved follow_link performance on a highmem machine, then setting the gfp_mask as you suggest is probably a good idea, but I don't really want to impose that on filesystems if I don't need to. And at present I don't. So I'd rather leave it to the filesystem maintainer, or someone who discovers a need. > * are you sure that security_inode_follow_link() is OK to call in > RCU mode? No. avc_has_perm() doesn't look RCU safe, even without auditing enabled. At the very least we'll need to pass a "lookup_rcu" flag in there. > * what warranties are you giving for the lifetime of strings > passed to nd_set_link()? Right now it's "should not be freed until the > matching ->put_link()"; what happens for RCU mode? The same.... For XFS, we kmalloc a buffer GFP_ATOMIC and copy into that. Then put_link() kfrees it. For filesystems with the symlink in the page cache, we get a reference to the page (which is a bit heavy-handed for RCU-walk, but much less so than the current code) and drop the reference in ->put_link. For filesystems with a short symlink in the inode, we just provide a pointer to that... How long can we expect that to be around? I cannot see any provision for keeping those inodes in memory while we follow the symlink... What am I missing? In any case, if there is a reference held on the inode for ref-walk, then presumably complete_walk() will take a reference on that same inode when dropping out of rcu-walk.... I hope. So I think the rules here are unchanged. > * really nasty one: creat(2) on a dangling symlink. What's to > preserve the last component if you get into that symlink in RCU mode? As above - we will have a counted reference to whatever holds the text of the symlink. > > TBH, I'm less than fond of passing nameidata to ->follow_link() at all, > flags or no flags. We could kill current->link_count and > current->total_link_count, replacing them with one void * current->nameidata > and taking counters into struct nameidata itself. Have places like e.g. > kern_path_locked() do > struct nameidata nd, *saved = set_nameidata(&nd); > ... > set_nameidata(saved); > with set_nameidata(p) doing this: > old = current->nameidata; > current->nameidata = p; > if (p) { > if (!old) { > p->link_count = 0; > p->total_link_count = 0; > } else { > p->link_count = old->link_count; > p->total_link_count = old->total_link_count; > } > } > return old; > > Then nd_set_link() et.al. would use current->nameidata instead of an > explicitly passed pointer and ->follow_link() instances wouldn't need > that opaque pointer passed to them at all. Sounds interesting - I might try it. Would ->follow_link() than get a 'flags' argument, or would "nd_is_rcu()" reference current->nameidata->flags ?? Thanks, NeilBrown
Attachment:
pgp2ciPTSYTne.pgp
Description: OpenPGP digital signature