Re: orangefs: NAK until the ABI is documented (was Re: updated orangefs tree at kernel.org)

Mike Marshall <hubcap@xxxxxxxxxxxx> · Mon, 12 Oct 2015 12:20:42 -0400

 > ... the ABI is ... utterly... "idiosyncratic"...

Al, do you mean idiosyncratic in a "distinctive and special" way,
or a "strange and bizarre" way? <g>

When I first started working with the kernel module, I grew
to think of the service_operation as the event horizon to
a black hole, but I figured that was just me...

First off, Al, I want to thank you for spending so much
time on the review. You have pointed out numerous ways
to make the kernel module better. Most importantly, though,
I believe you are saying (and I'm purposely restating
your words here to make sure we're on the same page) that there
will be no ack until we provide a usable detailed definition of
what is in every upcall and downcall and ioctl between
the kernel module and the userspace daemon. Not only do we need,
for example, to specify how LOOKUP_FOLLOW needs to be provided
in a lookup request, but it would be nice to also say why.

You're not promising an ack in return, but we can't go further
until we provide the definition.

 > Once [the userspace interface is] in the tree, it's cast in
 > stone, so we'd better get it right and we need it explicitly
 > documented.

Along these lines, we've done some well-intentioned "thrashing
around" (not Linus' favorite development model) towards ensuring
that an upstream kernel module will continue to work into the
future. The "khandle" code is an attempt to ensure that things
will continue to work when userspace transitions to 128 bit
uuids for handles, instead of the current 64 bit handles - we
use handles in the kernel module as inode numbers. And I added
an ioctl that allows the kernel module to become aware of
userspace's ever changing list of debug modes when the
userspace daemon is first started. Our intent to keep the
upstream kernel module working with userspace, current and future,
is also reflected in the upstream kernel modules' handling of
the the protocol version check - if userspace sees that the
kernel module is the upstream version then charge ahead because
we've planned not to change the protocol.

Anyhow, Martin and I plan to work on the protocol definition,
which may/might/probably-will lead to changes in the code to make
the protocol more sane. This will probably lead to us missing the
next merge window. And, as we proceed, we might ask for
feedback on what we are doing from Al and fs-devel...

Al, does this sound like how you think we should focus our
energy at this point?

-Mike

On Sat, Oct 10, 2015 at 7:10 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Sat, Oct 10, 2015 at 03:31:57AM +0100, Al Viro wrote:
>>       * atrocious userland API for downcalls (response to everything
>> other than readdir must come in 4-element iovec array passed to writev(),
>> no matter where the segment boundaries are; response to readdir must come
>> in 5-element iovec array passed to writev(), the boundaries among the
>> first 4 segments do not matter, the 5th segment covering exactly the payload).
>> IMO that needs to be fixed before we merge the damn thing - it's really too
>> ugly to live.  I would really like to hear from somebody familiar with the
>> userland side - what responses does it actually submit?  The validation
>> kernel-side is basically inexistent, and while I suspect that we could handle
>> the actually sent responses much cleaner, the full set of everything that
>> would be accepted by the current code is a bloody mess and would be much
>> harder to handle in a clean way.  What's more, the response layouts are
>> messy, and if it is still possible to change that API, we'd be much better off
>> if we cleaned them up.
>
> Another API bogosity: when the 5th segment is present, successful writev()
> returns the sum of sizes of the first 4.  Userland side just ignores that -
> everything positive is treated as "everything's written".
>
> Another piece of fun: header too large by less than sizeof(long) => print
> a warning, leave op->downcall completely uninitialized and proceed.
>
> Below is what the kernel side is actually doing:
>         * less than 4 segments or more than 5 => whine and fail
>         * if concatenation of the first 4 segments is longer than
> 16 + sizeof(struct pvfs2_downcall_s) + sizeof(long) => whine and fail
>         * if concatenation of the first 4 segments is longer than
> 16 + sizeof(struct pvfs2_downcall_s) by no more than sizeof(long) => whine
> and proceed with garbage.
>         * if concatenation of the first 4 segments is no longer than
> 16 + sizeof(struct pvfs2_downcall_s), it is zero-padded to that size,
> then the first 16 bytes are interpreted as 32bit protocol version (ignored)
> + 32bit magic (checked for one specific value) + 64bit tag (used to find the
> matching request).  The rest is copied into op->downcall.
>         * if the 32bit value 4 bytes into op->downcall is zero and 64bit
> value following it is non-zero, the latter is interpreted as the size of
> trailer data.
>         * if there's no trailer, the 5th segment (if present) is completely
> ignored.  Otherwise it must be present and is expected to contain the trailer
> data.
>         * if the 5th segment is longer than expected trailer data => whine
> and fail (note that if the amount of expected trailer data is zero a non-empty
> 5th segment is quietly ignored).
>         * if trailer expected, vmalloc a buffer for it and copy the 5th
> segment contents in there; in case of the 5th segment being *shorter* than
> expected trailer data, leave the rest of the buffer uninitialized (i.e.
> expose the contents of random previously freed pages).
>         * if vmalloc fails, act as if status (32bit at offset 5 into
> op->downcall) had been -ENOMEM and don't look at the 5th segment at all.
>         * in all cases ignore the amount of data taken from the 5th segment;
> the return value is the sum of sizes of the first 4.
>
>         What the current userland side appears to sends is 4-segment
> version/magic/tag/struct pvfs2_downcall_s + possibly the 5th segment covering
> the readdir results.  Trailer size in struct pvfs2_downcall_s actually is
> equal to the size of the 5th segment if present and 0 otherwise.
>         The 4th segment (struct pvfs2_downcall_s) contains a bunch of garbage
> never used by the kernel (pointers and misalignment gaps).
>         How much of that is guaranteed is, of course, known only to the
> orangefs folks.  I only looked at the current version of the userland code;
> hadn't checked the older ones and have no idea what kind of changes might
> be planned.
>
>         Folks, userland interfaces need to be documented (and preferably -
> contain less ugliness).  At that point I think that we need the damn API
> written up and reviewed, or it's a strong NAK.  Once it's in the tree, it's
> cast in stone, so we'd better get it right and we need it explicitly
> documented.  "Whatever orangefs userland code does" doesn't cut it, especially
> since your protocol version check is inexistent.  Digging through the entire
> history of your userland code and trying to deduce what would and what would
> not be OK to change kernel-side is far past reasonable.
>
>         I'm not asking for the description of semantics for individual requests
> and responses (it would be nice to have as well, but that's a separate story),
> but the rules for data marshalling are really needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html