Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org)

Mike Marshall <hubcap@xxxxxxxxxxxx> · Tue, 13 Oct 2015 13:46:54 -0400

 > *shrug*

Al, I guess what I wrote sounded pi**y, but I promise my attitude
is 180 degrees away from that... I'm just trying to focus on what is
most important... you've identified numerous issues, all of which
need to be addressed, but the state of the API used by userspace
is a show stopper: it needs documented, and then probably improved...

When we have something coherent started, we'll send it your way
so you can help make sure we're on the right track...

Thanks!

-Mike

On Mon, Oct 12, 2015 at 3:16 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Mon, Oct 12, 2015 at 12:20:42PM -0400, Mike Marshall wrote:
>>  > ... the ABI is ... utterly... "idiosyncratic"...
>>
>> Al, do you mean idiosyncratic in a "distinctive and special" way,
>> or a "strange and bizarre" way? <g>
>
> The latter, but much stronger ;-)
>
>> When I first started working with the kernel module, I grew
>> to think of the service_operation as the event horizon to
>> a black hole, but I figured that was just me...
>
> *snort*
>
> Anything you get back is, by definition, from before the event horizon...
>
>> First off, Al, I want to thank you for spending so much
>> time on the review. You have pointed out numerous ways
>> to make the kernel module better. Most importantly, though,
>> I believe you are saying (and I'm purposely restating
>> your words here to make sure we're on the same page) that there
>> will be no ack until we provide a usable detailed definition of
>> what is in every upcall and downcall and ioctl between
>> the kernel module and the userspace daemon. Not only do we need,
>> for example, to specify how LOOKUP_FOLLOW needs to be provided
>> in a lookup request, but it would be nice to also say why.
>
> LOOKUP_FOLLOW is a separate story - the issue here is that you are
> exposing internal details of the pathname resolution to the userland
> and that's essentially a demand to keep them stable from now on; however,
> I'm seriously at loss as to _what_ to keep stable.  "This call of ->lookup()
> gets LOOKUP_FOLLOW in flags" doesn't map to any natural semantics I can
> think of; it might accidentally happen to match something useful, but I
> don't see what it is.
>
> Your userland code does something rather odd with it, but it's buried inside
> your state machine and I don't understand it well enough to tell what's going
> on.  Is it trying to follow symlinks on its own upon lookup?  If so, it's
> _very_ bogus.  If nothing else, userland code doesn't have enough information
> to do that - it has no idea what the originating process is chrooted to,
> for starters.  As the result, having your filesystem mounted inside a chroot
> jail would open a nice way for non-priveleged process to break out - just
> create an absolute symlink somewhere on that filesystem and go through it.
> Again, I'm not sure what that userland code is doing, so it (hopefully)
> might be something completely different.
>
> So yes, I really do want details on this one, but for a different reason.
>
> What I want from API documentation is something along the lines of "writev()
> on /dev/pvfs2-req is used to pass responses to the requests made by the
> kernel side; the data structure being passed is represented in the following
> way <...>; representations do not statisfying constraints <...> are to be
> rejected".
>
> What you do with the payload _after_ it's been transferred is a separate
> story; it's also nice to have, but right now we don't even have the calling
> conventions, nevermind the description of behaviour.
>
> How is one supposed to review the marshalling code?  Sure, I can tell that
> if you have the iov[4].iov_len greater than .trailer_size, you end up with
> uninitialized tail of vmalloc'ed buffer.  I can also see that your userland
> code doesn't ever pass a positive .trailer_size with iov[4].iov_len different
> from it.  What's the proper fix?  To declare that they must always be equal,
> and reject mismatches, as you already do for inequality in other direction?
> Or should we quietly zero-pad the buffer?  Either will work for your current
> userland code, but only you can tell which is the right fix - I don't know
> what the older versions used to do and I don't know what changes are planned.
>
> And yes, I understand the desire to separate a potentially large payload
> kernel-side; is there any deep reason why one doesn't simply use .trailer_size
> to decide how much in the end of the area being transferred should go there?
>
> Are there any plans for having the first 4 segments shorter than the full
> 16 + sizeof(struct pvfs2_downcall_s)?  If so, what's the minimal amount to
> be expected in all cases?  I'm not nitpicking here - having e.g. short
> packets ending at (non-zero) .status would work with the current kernel
> code and isn't inconceivable as a planned change.
>
> For that matter, what's .type for?  It looks like a selector for that
> union you have in the end, but it is not used that way - originating
> upcall is used for that instead.  Is it just a junk, or is it a currently
> unimplemented sanity check (if .type in downcall must be the same as in
> matching upcall)?
>
>> You're not promising an ack in return, but we can't go further
>> until we provide the definition.
>
> *shrug*
>
> Sure, I can try and pull it out of you guys bit by bit instead (see above
> for some initial questions), but I suspect that it would be faster and less
> painful for everyone involved if you just describe the calling conventions
> and we'll see what might be missing.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html