Re: [PATCH REPOST] docs: Add struct file refcounting and SCM_RIGHTS mess info

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Fri, 12 Apr 2019 22:53:36 +0100

On Thu, Apr 11, 2019 at 02:23:08PM -0600, Jonathan Corbet wrote:
> +======================
> +Lifecycles and locking
> +======================
> +
> +This manual aspires to cover the lifecycles of VFS objects and the locking
> +that protects them.
> +
> +Reference counting for file structures
> +======================================
> +
> +(The following text derives from `this email from Al Viro
> +<https://lwn.net/ml/linux-fsdevel/20190207040058.GW2217@xxxxxxxxxxxxxxxxxx/>`_).
> +
> +The :c:type:`struct file` type represents an open file in the kernel.  Its

That wants a separate section in the beginning, with overview of notions for
which the word "file" gets misused ;-)  I have something, but it's too much
of a rant at the moment (mostly regarding the unknown linguistic genius
somewhere in the bowels of POSIX committees, responsible for the term
"file description" used along with and in contrast to "file descriptor").

> +lifetime is controlled by a simple reference count (f_count) in that
> +structure.  References are obtained with functions like fget(), fdget(),
> +and fget_raw(); they are returned with fput().

... and fdput().  BTW, are you sure that "returned" is the right verb here?
Would "released" or "dropped" be less overloaded?

The set is actually
	fget{,_raw}()		fput()
	fdget{,_raw}()		fdput()
	fdget_pos()		fdput_pos()
	get_file()		fput()
	get_file_rcu()		fput()		[not for general use]

Not sure if it's worth putting here, but _raw variants are the ones that
accept O_PATH files and fdget_pos() is "with XSI 2.9.7-inflicted exclusion".

> +.. FIXME we should have kerneldoc comments for those functions
> +

> +Descriptor tables
> +-----------------

That wants an overview of WTF those are and what are they for; again, fodder
for the overview section in the beginning...

> +execve() does almost the same thing: if the pre-exec descriptor table is
> +shared, it will switch to a new copy first.  In case of success the
> +reference to the original table is dropped, in case of failure we revert to
> +the original and drop the copy.  Note that handling of close-on-exec is
> +done in the *copy*; the original is unaffected, so failing in execve() does
> +not disrupt the descriptor table.

The last part is actually a bit misleading - close-on-exec is handled only after
the point of no return on execve(), so any execve() failure either happens
before we get to doing that, or the caller gets killed.

What that really gives is that descriptor table sharing is terminated by
execve(), even if it had been across the thread group boundary, and those
who used to share it are unaffected by our close-on-exec (everyone in the
same thread group will be simply killed by execve(); however, other processes
might be sharing the damn thing as well and _they_ will keep going - with
the original table).

> +The SCM_RIGHTS datagram option with Unix-domain sockets can be used to
> +transfer a file descriptor, and its associated struct file reference, to
> +the receiving process.

"transfer X and Y to Z" implies that both X and Y are transferred; in reality
only Y is...   I'm not sure how to put it in one sentence, though ;-/

Perhaps something along the lines of "to pass opened files between the
processes" might be better, with details explained after that?  Those who
understand what SCM_RIGHTS is will be fine either way, but I had a bad
experience explaining what SCM_RIGHTS was to people who'd never looked
into that thing before.  And most of the problems had been precisely
because of confusion re what gets passed around...  The variant I'd
found to work best is along the lines of

################
	Sender has an opened file and wants to give it to recepient.
Of course, telling recepient "the file I want you to use is
descriptor 42" is useless - _their_ descriptor 42 has nothing to do
with yours, to start with.  What's needed is
	* getting the opened file reference from sender's descriptor
table into recepients'
	* telling the recepient where in _their_ descriptor table
has it ended up.
	The way it's done is
1) sender specifies which of its opened files should be passed (by
their locations in sender's descriptor table, i.e. their file descriptors).
2) references to these opened files are attached to a packet and sent
over.
3) the packet is received; unused slots in recepient's descriptor table
are chosen (same as open(2), pipe(2), etc. would have done),
opened file references from the packet are inserted into those slots
and their numbers (i.e. file descriptors for recepient to use) are
given to recepient.  They are more than likely to be different
from whatever descriptors sender used, of course; it's the opened
files they refer to that got passed.
	That requires cooperation from recepient, of course - if
nothing else, they need to tell recvmsg(2) where to put that
array of descriptors.  Details are more than slightly clumsy;
look around in csmg_data(3) for sad boilerplate.
################

I think it's too verbose to use here, though, and I've no idea how to do it
in more concise form...

[snip]

> +Note that, in both cases, the leaks are caused by loops passing through

s/loops/loops in the graph of references/, maybe?

> +some SCM_RIGHTS datagrams that can never be received.  So locating those,
> +removing them from the queues they sit in and then discarding the suckers,
> +is enough to resolve the situation. Furthermore, in both cases the loop
> +passes through the unix_sock of something that got sent over in an
> +SCM_RIGHTS datagram.  So we can do the following:

I wonder if the subsequent discussion of the garbage collector guts (as
opposed to the need to have it + "the strategy is to find a set of undeliverable
SCM_RIGHTS datagrams and kick them out, breaking the unreachable loops")
would better off living separately...