Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org)

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Mon, 12 Oct 2015 20:16:38 +0100

On Mon, Oct 12, 2015 at 12:20:42PM -0400, Mike Marshall wrote:
>  > ... the ABI is ... utterly... "idiosyncratic"...
> 
> Al, do you mean idiosyncratic in a "distinctive and special" way,
> or a "strange and bizarre" way? <g>

The latter, but much stronger ;-)

> When I first started working with the kernel module, I grew
> to think of the service_operation as the event horizon to
> a black hole, but I figured that was just me...

*snort*

Anything you get back is, by definition, from before the event horizon...

> First off, Al, I want to thank you for spending so much
> time on the review. You have pointed out numerous ways
> to make the kernel module better. Most importantly, though,
> I believe you are saying (and I'm purposely restating
> your words here to make sure we're on the same page) that there
> will be no ack until we provide a usable detailed definition of
> what is in every upcall and downcall and ioctl between
> the kernel module and the userspace daemon. Not only do we need,
> for example, to specify how LOOKUP_FOLLOW needs to be provided
> in a lookup request, but it would be nice to also say why.

LOOKUP_FOLLOW is a separate story - the issue here is that you are
exposing internal details of the pathname resolution to the userland
and that's essentially a demand to keep them stable from now on; however,
I'm seriously at loss as to _what_ to keep stable.  "This call of ->lookup()
gets LOOKUP_FOLLOW in flags" doesn't map to any natural semantics I can
think of; it might accidentally happen to match something useful, but I
don't see what it is.

Your userland code does something rather odd with it, but it's buried inside
your state machine and I don't understand it well enough to tell what's going
on.  Is it trying to follow symlinks on its own upon lookup?  If so, it's
_very_ bogus.  If nothing else, userland code doesn't have enough information
to do that - it has no idea what the originating process is chrooted to,
for starters.  As the result, having your filesystem mounted inside a chroot
jail would open a nice way for non-priveleged process to break out - just
create an absolute symlink somewhere on that filesystem and go through it.
Again, I'm not sure what that userland code is doing, so it (hopefully)
might be something completely different.

So yes, I really do want details on this one, but for a different reason.

What I want from API documentation is something along the lines of "writev()
on /dev/pvfs2-req is used to pass responses to the requests made by the
kernel side; the data structure being passed is represented in the following
way <...>; representations do not statisfying constraints <...> are to be
rejected".

What you do with the payload _after_ it's been transferred is a separate
story; it's also nice to have, but right now we don't even have the calling
conventions, nevermind the description of behaviour.

How is one supposed to review the marshalling code?  Sure, I can tell that
if you have the iov[4].iov_len greater than .trailer_size, you end up with
uninitialized tail of vmalloc'ed buffer.  I can also see that your userland
code doesn't ever pass a positive .trailer_size with iov[4].iov_len different
from it.  What's the proper fix?  To declare that they must always be equal,
and reject mismatches, as you already do for inequality in other direction?
Or should we quietly zero-pad the buffer?  Either will work for your current
userland code, but only you can tell which is the right fix - I don't know
what the older versions used to do and I don't know what changes are planned.

And yes, I understand the desire to separate a potentially large payload
kernel-side; is there any deep reason why one doesn't simply use .trailer_size
to decide how much in the end of the area being transferred should go there?

Are there any plans for having the first 4 segments shorter than the full
16 + sizeof(struct pvfs2_downcall_s)?  If so, what's the minimal amount to
be expected in all cases?  I'm not nitpicking here - having e.g. short
packets ending at (non-zero) .status would work with the current kernel
code and isn't inconceivable as a planned change.

For that matter, what's .type for?  It looks like a selector for that
union you have in the end, but it is not used that way - originating
upcall is used for that instead.  Is it just a junk, or is it a currently
unimplemented sanity check (if .type in downcall must be the same as in
matching upcall)?

> You're not promising an ack in return, but we can't go further
> until we provide the definition.

*shrug*

Sure, I can try and pull it out of you guys bit by bit instead (see above
for some initial questions), but I suspect that it would be faster and less
painful for everyone involved if you just describe the calling conventions
and we'll see what might be missing.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html