Re: Orangefs ABI documentation

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Fri, 22 Jan 2016 07:11:47 +0000

On Fri, Jan 15, 2016 at 04:46:09PM -0500, Mike Marshall wrote:
> Al...
> 
> We have been working on Orangefs ABI (protocol between
> userspace and kernel module) documentation, and improving
> the code along the way.
> 
> We wish you would look at the documentation progress, and
> look at the way that .write_iter is implemented now... plus
> the other improvements. We know that every problem you
> pointed out hasn't been worked on yet, but your feedback
> would let us know if we're headed in the right direction.
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux.git
> for-next
> 
> The improved (but not yet officially released) userspace
> function that does the writev to the Orangefs pseudo device is in:
> 
> http://www.orangefs.org/svn/orangefs/branches/trunk.kernel.update/
> src/io/dev/pint-dev.c
> 
> You may decide we don't deserve an ACK yet, but since we are in
> the merge window I hope you can look at it...

write_iter got much better, but there are serious problems with locking.
For example, just what will happen if orangefs_devreq_read() picks an op
for copying to daemon, moves it to hash, unlocks everything and starts
copy_to_user() at the time when submitter of that op (sleeping
in wait_for_matching_downcall()) gets a signal and buggers off?  It calls
orangefs_clean_up_interrupted_operation(), removes the sucker from hash
and proceed to free the damn thing.  Right under ongoing copy_to_user().
Better yet, should that copy_to_user() fail, we proceed to move an already
freed op back to request list.  And no, get_op()/put_op() pair won't be
enough - submitters just go ahead an call op_release(), freeing the op,
refcount be damned.  They'd need to switch to put_op() for that to work.
Note that the same applies for existing use of get_op/put_op mechanism -
if that signal comes just as the orangefs_devreq_write_iter() has picked
the op from hash, we'll end up freeing the sucker right under copy_from_iter()
despite get_op().

What's more, a failure of that copy_to_user() leaves us in a nasty situation -
how do we tell whether to put the request back to the list or just quietly
drop it?  Check the refcount?  But what if the submitter (believing, and
for a good reason, that it had removed the sucker from hash) hadn't gotten
around to dropping its reference?

Another piece of fun assumes malicious daemon; sure, it's already a FUBAR
situation, but userland shouldn't be able to corrupt the kernel memory.
Look: daemon spawns a couple of threads.  One of those issues read() on
your character device, with just 16 bytes of destination mmapped and filled
with zeroes.  Another spins until it sees ->tag becoming non-zero and
immediately does writev() now that it has the tag value.  What we get is
submitter seeing op serviced and proceeding to free it while ->read() hits
the copy_to_user() failure and reinserts the already freed op into request
list.

I'm not sure we have any business inserting the op to hash until we are
done with copy_to_user() and know we won't fail.  Would simplify the
failure recovery as well...

I think we ought to flag the op when submitter decides to give up on it
and have the return-to-request-list logics check that (under op->lock),
returning it to the list only in case it's not flagged *and* decrementing
refcount before dropping op->lock in that case - we are guaranteed that
this is not the last reference.  In case it's flagged, we drop ->op_lock,
do put_op() and _not_ refile the sucker to request list.

I'm not sure if this
                /* NOTE: for I/O operations we handle releasing the op
                 * object except in the case of timeout.  the reason we
                 * can't free the op in timeout cases is that the op
                 * service logic in the vfs retries operations using
                 * the same op ptr, thus it can't be freed.
                 */
                if (!timed_out)
                        op_release(op);
is safe, while we are at it...  I suspect we'd be better off if we did
put_op() there.  Unconditionally.  After the entire
	if (op->downcall.type == ORANGEFS_VFS_OP_FILE_IO)
thing, so that the total change of refcount is zero in all cases.  At least
that makes for regular lifetime rules (with s/op_release/put_op/ in submitters)

	The thing I really don't understand is the situation with
wait_for_cancellation_downcall().  The comments in there flat-out
contradict the code - if called with signal already pending (as the
comment would seem to indicate), we might easily leave and free
the damn op without it ever being seen by daemon.  What's intended there?
And what happens if we reach schedule_timeout() in there (i.e. if we get
there without a signal pending) and operation completes successfully?
AFAICS, you'll get -ETIMEDOUT, not that the caller even looked at that
return value...  What's going on there?

	Another unpleasantness is that your locking hierarchy leads to
to races in orangefs_clean_up_interrupted_operation(); you take op->lock,
look at the state, decide what needs to be locked to remove the sucker from,
drop op->lock, lock the relevant data structure and proceed to search op in
it, removing it in case of success.  Fun, but what happens if it gets refiled
as soon as you drop op->lock?

	I suspect that we ought to make refiling conditional on the same
"submitter has given up on that one" flag *and* move hash lock outside of
op->lock.  IOW, once that flag had been set, only submitter can do anything
with op->list.  Then orangefs_clean_up_interrupted_operation() could simply
make sure that flag had been set, drop op->lock, grab the relevant spinlock
and do list_del().  And to hell with "search and remove if it's there"
complications. 

	One more problem is with your open_access_count; it should be
tristate, not boolean.  As it is, another open() of your character
device is possible as soon as ->release() has decremented open_access_count.
IOW, purge_inprogress_ops() and purge_waiting_ops() can happen when we
already have reopened the sucker.  It should have three states - "not opened",
"opened" and "shutting down", the last one both preventing open() and
being treated as "not in service".

	Input validation got a lot saner (and documentation is quite welcome).
One place where I see a problem is listxattr -
                if (total + new_op->downcall.resp.listxattr.lengths[i] > size)
                        goto done;
is *not* enough; you need to check that it's not going backwards (i.e. isn't
less than total).  total needs to be size_t, BTW - ssize_t is wrong here.

	Another issue is that you are doing register_chrdev() too early -
it should be just before register_filesystem(); definitely without any
failure exits in between.

	Debugfs stuff is FUBAR - it's sufficiently isolated, so I hadn't
spent much time on it, but for example debugfs_create_file() in
orangefs_debugfs_init() fails, we'll call orangefs_debugfs_cleanup()
right there.  On module exit it will be called again, with
debugfs_remove_recursive(debug_dir) called *again*.  At the very least it
should clear debug_dir in orangefs_debugfs_cleanup().  Worse,
orangefs_kernel_debug_init() will try to create an object in freed
directory...

	I've dumped some of the massage into vfs.git#orangefs-untested, but
the locking and lifetime changes are not there yet; I'll add more of that
tomorrow.  I would really appreciate details on wait_for_cancellation_downcall;
that's the worst gap I have in understanding that code.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html