Re: Orangefs ABI documentation

Mike Marshall <hubcap@xxxxxxxxxxxx> · Thu, 11 Feb 2016 18:54:43 -0500

> Sure, the kernel won't take that (the
> op with the matching tag has been gone already), but the data is stored
> into shared memory *before* writev() on the control device that would pass
> the response to the kernel, so it still gets overwritten.  Right under
> decoding readdir()...

The readdir buffer isn't a shared buffer like the IO buffer is.
The readdir buffer is preallocated when the client-core starts up
though. The kernel module picks which readdir buffer slot that
the client-core fills, but gets back a copy of that buffer - the
trailer. Unless the kernel module isn't managing the buffer slots
properly, the client core shouldn't have more than one upcall
on-hand that specifies any particular buffer slot. The "kill -9"
on a ls (or whatever) might lead to such mis-management,
but since readdir decoding is happening on a discrete copy
of the buffer slot that was filled by the client-core, it doesn't
seem to me like it could be overwritten during a decode...

I believe there's nothing in userspace that guarantees that
readdirs are replied to in the same order they are received...

-Mike

On Wed, Feb 10, 2016 at 11:44 AM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> On Tue, Feb 09, 2016 at 11:13:28PM +0000, Al Viro wrote:
>> On Tue, Feb 09, 2016 at 10:40:50PM +0000, Al Viro wrote:
>>
>> > And the version in orangefs-2.9.3.tar.gz (your Frankenstein module?) is
>> > vulnerable to the same race.  2.8.1 isn't - it ignores signals on the
>> > cancel, but that means waiting for cancel to be processed (or timed out)
>> > on any interrupted read() before we return to userland.  We can return
>> > to that behaviour, of course, but I suspect that offloading it to something
>> > async (along with freeing the slot used by original operation) would be
>> > better from QoI point of view.
>>
>> That breakage had been introduced between 2.8.5 and 2.8.6 (at some point
>> during the spring of 2012).  AFAICS, all versions starting with 2.8.6 are
>> vulnerable...
>
> BTW, what about kill -9 delivered to readdir in progress?  There's no
> cancel for those (and AFAICS the daemon will reject cancel on anything
> other than FILE_IO), so what's to stop another thread from picking the
> same readdir slot and getting (daemon-side) two of them spewing into
> the same area of shared memory?  Is it simply that daemon-side the shared
> memory on readdir is touched only upon request completion in completely
> serialized process_vfs_requests()?  That doesn't seem to be enough -
> suppose the second readdir request completes (daemon-side) first, its results
> get packed into shared memory slot and it is reported to kernel, which
> proceeds to repack and copy that data to userland.  In the meanwhile,
> daemon completes the _earlier_ readdir and proceeds to pack its results into
> the same slot of shared memory.  Sure, the kernel won't take that (the
> op with the matching tag has been gone already), but the data is stored
> into shared memory *before* writev() on the control device that would pass
> the response to the kernel, so it still gets overwritten.  Right under
> decoding readdir()...
>
> Or is there something in the daemon that would guarantee readdir responses
> to happen in the same order in which it had picked the requests?  I'm not
> familiar enough with that beast (and overall control flow in there is, er,
> not the most transparent one I've seen), so I might be missing something,
> but I don't see anything obvious that would guarantee such ordering.
>
> Please, clarify.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html