On Sat, Oct 10, 2015 at 03:31:57AM +0100, Al Viro wrote: > * atrocious userland API for downcalls (response to everything > other than readdir must come in 4-element iovec array passed to writev(), > no matter where the segment boundaries are; response to readdir must come > in 5-element iovec array passed to writev(), the boundaries among the > first 4 segments do not matter, the 5th segment covering exactly the payload). > IMO that needs to be fixed before we merge the damn thing - it's really too > ugly to live. I would really like to hear from somebody familiar with the > userland side - what responses does it actually submit? The validation > kernel-side is basically inexistent, and while I suspect that we could handle > the actually sent responses much cleaner, the full set of everything that > would be accepted by the current code is a bloody mess and would be much > harder to handle in a clean way. What's more, the response layouts are > messy, and if it is still possible to change that API, we'd be much better off > if we cleaned them up. Another API bogosity: when the 5th segment is present, successful writev() returns the sum of sizes of the first 4. Userland side just ignores that - everything positive is treated as "everything's written". Another piece of fun: header too large by less than sizeof(long) => print a warning, leave op->downcall completely uninitialized and proceed. Below is what the kernel side is actually doing: * less than 4 segments or more than 5 => whine and fail * if concatenation of the first 4 segments is longer than 16 + sizeof(struct pvfs2_downcall_s) + sizeof(long) => whine and fail * if concatenation of the first 4 segments is longer than 16 + sizeof(struct pvfs2_downcall_s) by no more than sizeof(long) => whine and proceed with garbage. * if concatenation of the first 4 segments is no longer than 16 + sizeof(struct pvfs2_downcall_s), it is zero-padded to that size, then the first 16 bytes are interpreted as 32bit protocol version (ignored) + 32bit magic (checked for one specific value) + 64bit tag (used to find the matching request). The rest is copied into op->downcall. * if the 32bit value 4 bytes into op->downcall is zero and 64bit value following it is non-zero, the latter is interpreted as the size of trailer data. * if there's no trailer, the 5th segment (if present) is completely ignored. Otherwise it must be present and is expected to contain the trailer data. * if the 5th segment is longer than expected trailer data => whine and fail (note that if the amount of expected trailer data is zero a non-empty 5th segment is quietly ignored). * if trailer expected, vmalloc a buffer for it and copy the 5th segment contents in there; in case of the 5th segment being *shorter* than expected trailer data, leave the rest of the buffer uninitialized (i.e. expose the contents of random previously freed pages). * if vmalloc fails, act as if status (32bit at offset 5 into op->downcall) had been -ENOMEM and don't look at the 5th segment at all. * in all cases ignore the amount of data taken from the 5th segment; the return value is the sum of sizes of the first 4. What the current userland side appears to sends is 4-segment version/magic/tag/struct pvfs2_downcall_s + possibly the 5th segment covering the readdir results. Trailer size in struct pvfs2_downcall_s actually is equal to the size of the 5th segment if present and 0 otherwise. The 4th segment (struct pvfs2_downcall_s) contains a bunch of garbage never used by the kernel (pointers and misalignment gaps). How much of that is guaranteed is, of course, known only to the orangefs folks. I only looked at the current version of the userland code; hadn't checked the older ones and have no idea what kind of changes might be planned. Folks, userland interfaces need to be documented (and preferably - contain less ugliness). At that point I think that we need the damn API written up and reviewed, or it's a strong NAK. Once it's in the tree, it's cast in stone, so we'd better get it right and we need it explicitly documented. "Whatever orangefs userland code does" doesn't cut it, especially since your protocol version check is inexistent. Digging through the entire history of your userland code and trying to deduce what would and what would not be OK to change kernel-side is far past reasonable. I'm not asking for the description of semantics for individual requests and responses (it would be nice to have as well, but that's a separate story), but the rules for data marshalling are really needed. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html