[RFC] dupfs semantics

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Mon, 3 Nov 2014 06:16:32 +0000

FWIW, now it can be done - with fairly limited changes we *can*
implement something similar to Plan 9 dupfs (#d) and *BSD fdescfs.  I.e.
a filesystem with one directory, with contents depending on which process
is looking there, files corresponding to opened descriptors of the process
in question.  So far it looks like our /proc/self/fd (or /dev/fd), but
there's one really important difference - open("/fd/0", 0) on Plan 9 doesn't
reopen your stdin, it is equivalent to dup(0).  In other words, you get
an extra reference to the corresponding file, not fresh open of the same
underlying object.

	In a lot of situations it makes for better semantics.  Example:
; cat >a.sh <<'EOF'
read i
diff -u "$i" /dev/stdin
EOF
; (echo a.sh; cat a.sh) > data
; cat data |sh a.sh
; <data sh a.sh

--- a.sh        2014-11-02 23:31:03.000000000 -0500
+++ /dev/stdin  2014-11-02 23:32:41.000000000 -0500
@@ -1,2 +1,3 @@
+a.sh
 read i 
 diff -u "$i" /dev/stdin
;

... and we have a different behaviour when fed from pipe and when redirected
from file.  Similar to that,
; cat >a.sh <<'EOF'
read i
grep "$i" -
EOF
; cat >b.sh <<'EOF'
read i
grep "$i" /dev/stdin
EOF
; cat >data <<'EOF'
a
a
b
EOF
; cat data | sh a.sh
a
; cat data | sh b.sh

So far, so good - "-" in grep arguments means stdin, so we could expect to get
the same behaviour.  Except that it breaks on redirects -
; sh a.sh <data
a
; sh b.sh <data
a
a
;

In other words, in situation when you have a program that expects a filename
and want to feed it to/from a preexisting descriptor, our semantics is bloody
inconvenient.  Worse, it simply fails when descriptor in question happens to
be something like a socket, eventfd, etc. - regular files, devices, directories
and pipes work, everything else is SOL.  We *can't* reopen a socket - a lot
of logics in net/* assumes that there's only one struct file over given socket.

The reason why we really couldn't do it with dup-style semantics was that
our ->open() takes struct file and returns 0 on success and -E<something>
on error.  There's no way to return a different file *and* we have too many
instances of ->open() to change the method's signature.

FWIW, FreeBSD got away with a horrible hack - they stash the descriptor
number in their equivalent of task_struct and pull off rather brittle and
ugly trick to pick in their kern_openat().  ->open() side is
        /*
         * XXX Kludge: set td->td_proc->p_dupfd to contain the value of the file
         * descriptor being sought for duplication. The error return ensures
         * that the vnode for this device will be released by vn_open. Open
         * will detect this special error and take the actions in dupfdopen.
         * Other callers of vn_open or VOP_OPEN will simply report the
         * error.
         */
        ap->a_td->td_dupfd = VTOFDESC(vp)->fd_fd;       /* XXX */
        return (ENODEV);
and the other end is
                /*
                 * Handle special fdopen() case. bleh.
                 *
                 * Don't do this for relative (capability) lookups; we don't
                 * understand exactly what would happen, and we don't think
                 * that it ever should.
                 */
                if (nd.ni_strictrelative == 0 &&
                    (error == ENODEV || error == ENXIO) &&
                    td->td_dupfd >= 0) {
                        error = dupfdopen(td, fdp, td->td_dupfd, flags, error,
                            &indx);
                        if (error == 0)
                                goto success;
                }
 
                goto bad;
Bleh, indeed...

I hadn't looked at Solaris source.  Plan 9 probably has it the easiest way -
their ->open() does, in our terms, take a pointer to struct file (Chan * for
them) and return another such pointer, normally the one it had been given.
If it decides to return a different one - no problem, just drop what you've
got, grab an extra reference to something else and return that.  End of
story.

Now, changing our ->open() is obviously far too much churn.  Fortunately,
we have ->atomic_open() with only 8 instances in the entire tree, none
of them in drivers.  That can be changed without too much PITA.  There are
several possible calling conventions; my preference would be
	old	new
	0	file it has been given
	1	NULL
	-E...	ERR_PTR(-E...)
	-----	an extra reference to preexisting file
letting the caller deal with freeing the unused one in the last case, but
that's not particulary interesting - whichever variant ends up with the
best code in callers (path_openat->do_last->lookup_open->atomic_open).

Getting open() to hit ->atomic_open() is also pretty easy - just don't hash
those dentries and that's it.  Considering that different processes are
going to see different things in that directory, that's the only sane variant
anyway.  They won't live for long anyway - the normal way to pin dentry down
for a long time is open(), and in this case open will *not* do that.  FWIW,
they can all share the same inode - it won't be accessed, anyway (we just
need to supply ->getattr(), which takes a dentry).

IOW, it's quite doable - I'm putting together a branch with minimal variant
of that thing and so far it shapes out reasonably well.  The interesting
part is what should be done in corner cases.

Everyone agrees that read/write access should be a subset of what the existing
file has been opened for; that much is obvious.  However, what about the
other bits?  Everyone appears to agree upon ignoring O_TRUNC.  FreeBSD ignores
O_APPEND as well (and Plan 9 doesn't have it at all); we might do the same,
or we might fail on mismatches.  I'd rather ignore it completely - if our
stdout is opened with O_APPEND | O_WRONLY, I would expect opening /fd/1
(or wherever it might be mounted) with O_WRONLY to succeed.
O_DIRECT is another one - should we ignore mismatches?

Another interesting question is what to do with chmod, etc. on those suckers.
Plan 9 EPERMs on that; FreeBSD in effect turns it into chmod of the target
file.   Note that stat() is *not* forwarded to the target file in either
of those, so chmod() hitting the target is inconsistent (and possibly risky
as well).

A minor twist is statfs() behaviour on that one - FreeBSD is putting rlimit
in f_files and the number of descriptors you could open until you run afoul
of rlimit into f_ffree.  Cute, but not too interesting, IMO...

A really interesting bit is ctl files on Plan 9 - /fd/<n>ctl there is a mix
of our readlink /proc/self/fd/<n> and /proc/self/fdinfo/<n>.  Fairly easy
to implement, the question is what should layout of their contents be...

Any permission checks ought to be skipped in case when a preexisting file gets
returned by ->atomic_open(), IMO - all checks ought to be done in the method
itself (and in this case they are limited to "don't ask for more than it's
already opened for").

Comments?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html