Re: [PATCH 3/4] IB/uverbs: ex_query_device: check request's comp_mask

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Le jeudi 29 janvier 2015 à 12:14 -0700, Jason Gunthorpe a écrit :
> On Thu, Jan 29, 2015 at 07:35:14PM +0100, Yann Droneaud wrote:
> 
> > Unfortunately, the userspace don't get the size of the returned data:
> > it's only a single "write()" syscall after all.
> 
> A write syscall that behaves nothing like write() actually should, so
> I don't see why we can't have
> 
> resp_len = write(fd,inout_buf,sizeof(input_len));
> 
> Returning 0 from write makes no sense at all.
> 

0 is not the result of the write() syscall, as for extended uverbs
I've ensure the return value of the write() syscall was the size
it was given. See
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n599

    705                 err = uverbs_ex_cmd_table[command](file,
    706                                                    &ucore,
    707                                                    &uhw);
    708 
    709                 if (err)
    710                         return err;
    711 
    712                 return written_count;

See commit f21519b23c1 ("IB/core: extended command: an improved
infrastructure for uverbs commands")

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f21519b23c1b6fa25366be4114ccf7fcf1c190f9

> In the fullness of your patchset it will maintain the invariant that
> resp_len <= sizeof(input_len)
> 

I don't catch your point: the response can be bigger than the request.

> Which seems OK to me considering what we have to work with, and a
> significantly better choice than 0.
> 
> > So the kernel is left with the comp_mask in the response to express
> > the returned size.
> 
> It was never the intent that the size should be computed from
> comp_mask. If the size is necessary it must be explicit.
> 
> In this instance if the size is not returned then libibverbs will have
> to zero the entire user buffer before passing it to the kernel,
> because it has to ensure any tail for the user app is 0'd.
> 

The proposed patch ensure the integrity of the response regarding
comp_mask: if a bit is set in response's comp_mask that means the
related fields are presents (and valid).

So before parsing the response fields, userspace have to check
response's comp_mask: fields access must be protected by correct
check on comp_mask ... but it might be useful for the userspace
developer to clear the response buffer just in case he/she decided
to be lazy with the check.

> Remember for santity we want comp mask bits for things that can't be 0

For me, it's better if a bit is set in response's comp_mask by the 
kernel when the kernel have written something in the related fields 
even if the those fields are all 0.

> and we want 0 for things that are not set.
> 
> struct ib_query_device_ex res;
> ibv_query_device_ex(..,res,sizeof(res));
> 
> printf("%u",res.foo_cap); // 0 if unsupported is OK
> if (res.comp_mask & COMP_BAR)
>   printf("%u",res.bar_thingy);  // 0 has meaning, must use COMP_BAR
> 
> Ideally we want to minimize the number of COMP bits because it is a
> huge burden on the end user to work with them.
> 

Sure. So I think comp_mask is just an overly complicated way of
expressing the version and the size of the response.

> > > The purpose of the output comp_mask is to allow drivers to declare
> > > they do not support the new structure members, and comp_mask bits
> > > should only be used with new structure members do not have a natural
> > > 'null' value.
> 
> > It's not (yet) about drivers as the output's comp_mask (in the 
> > response), is set by uverbs layer.
> > 
> > Do you think we have to enforce a 1 to 1 relation between the request 
> > comp_mask's bits and the response comp_mask's bits ?
> 
> In the query context I would be happy enough to just treat the in
> bound buffer as uninitialized buffer space, but certainly generally
> speaking the comp_mask+size should refer to the structure - so
> input/output are not directly related.
> 

OK.

> > > Further, we need to distinguish cases where additional data is
> > > mandatory or optional.
> > > 
> > > query_device is clearly optional, the only purpose the input comp mask
> > > serves is to reduce expensive work in the driver by indicating that
> > > some result bits are not needed.
> > 
> > That's not how I've understand the purpose of the request's comp_mask
> > after reading the presentation: request's comp_mask describe the content
> > of the request. Then that additional content can trigger the presence 
> > of additional fields in the response if needed.
> 
> Agreed - what I described was an abuse that some early Mellanox
> patches for a query_ex included and it was just a shortcut to avoid
> defining a dedicated input structure. It seems that same scheme was
> copied into these new patches.
> 

OK

> > >  It is perfectly OK for the kernel to
> > > ignore the input comp mask, and OK for userspace to typically request
> > > all bits. (indeed, I think this is a pretty silly optimization myself,
> > > and the original patch that motivated this was restructured so it is
> > > not needed)
> > > 
> > 
> > It's not at all OK to ignore request's comp_mask: it has to be checked
> > to find if there's a feature requested the kernel cannot fullfil, and 
> > any unknown bit is a possible feature request. So the kernel has to 
> > reject the request as a whole as it don't know how to process it.
> 
> In the context of the above scheme the input structure was just this:
> 
> struct query_input
> {
>   u64 requested_output;
> };
> 
> ie it wasn't actually a comp_mask, it just overlapped the comp_mask
> bytes on output.
> 
> Such a use was never explicitly documented, and IIRC was never
> actually included in libibverbs.
> 

OK

> Unless someone has a strong reason we need to do this I am inclined to
> think that your interpretation is the better one, and we could always
> add a requested_output to the input someday if it became urgent.
> 

Sure one field can be added later: in such case, one bit of request's
comp_mask will gain the meaning: "request_output field is present in the
request".

> In any event, you are right, we can't ingore the input bytes today and
> expect to give them meaning tomorrow.
> 

Sure.

> > As we don't know yet how we would extend the extended QUERY_DEVICE
> > uverbs, the kernel have to refuse any random value.
> > 
> > http://blog.ffwll.ch/2013/11/botching-up-ioctls.html
> 
> or the kernel treats QUERY_DEVICE as an output only function and never
> inspects the in/out buffer at all.
> 

It's something we could think of but the extended uverbs is already
merge with the comp_mask field in the request and it cost only 4 bytes,
while allowing us to make mistake on the first iteration of the extended
QUERY_DEVICE uverb (provided we manage to add the check for the field
being 0).

> > > > Thanks for the link to the presentation.
> > > 
> > > If I recall the presentation is old and had some flaws that were
> > > addressed before things made it into libibverbs..
> > 
> > I have to have a look to this part of libibverbs: I'm not sure
> > the extended QUERY_DEVICE is already implemented.
> 
> The patches turned out to be unnecessary and were dropped, IIRC.
>  

OK.

> > > Thank you for looking at this, it is very important that this scheme
> > > is used properly, and it is very easy to make mistakes. I haven't had
> > > time to review any of these new patches myself.
> 
> > I hope you would find some time to review my latest patchset so that
> > we don't miss a corner case. It's starting to become urgent.
> 
> I have and will, thank you
> 

:)

Thanks.

Regards.

-- 
Yann Droneaud
OPTEYA


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux