Le jeudi 29 janvier 2015 à 12:14 -0700, Jason Gunthorpe a écrit : > On Thu, Jan 29, 2015 at 07:35:14PM +0100, Yann Droneaud wrote: > > > Unfortunately, the userspace don't get the size of the returned data: > > it's only a single "write()" syscall after all. > > A write syscall that behaves nothing like write() actually should, so > I don't see why we can't have > > resp_len = write(fd,inout_buf,sizeof(input_len)); > > Returning 0 from write makes no sense at all. > 0 is not the result of the write() syscall, as for extended uverbs I've ensure the return value of the write() syscall was the size it was given. See http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/drivers/infiniband/core/uverbs_main.c?id=v3.19-rc6#n599 705 err = uverbs_ex_cmd_table[command](file, 706 &ucore, 707 &uhw); 708 709 if (err) 710 return err; 711 712 return written_count; See commit f21519b23c1 ("IB/core: extended command: an improved infrastructure for uverbs commands") http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f21519b23c1b6fa25366be4114ccf7fcf1c190f9 > In the fullness of your patchset it will maintain the invariant that > resp_len <= sizeof(input_len) > I don't catch your point: the response can be bigger than the request. > Which seems OK to me considering what we have to work with, and a > significantly better choice than 0. > > > So the kernel is left with the comp_mask in the response to express > > the returned size. > > It was never the intent that the size should be computed from > comp_mask. If the size is necessary it must be explicit. > > In this instance if the size is not returned then libibverbs will have > to zero the entire user buffer before passing it to the kernel, > because it has to ensure any tail for the user app is 0'd. > The proposed patch ensure the integrity of the response regarding comp_mask: if a bit is set in response's comp_mask that means the related fields are presents (and valid). So before parsing the response fields, userspace have to check response's comp_mask: fields access must be protected by correct check on comp_mask ... but it might be useful for the userspace developer to clear the response buffer just in case he/she decided to be lazy with the check. > Remember for santity we want comp mask bits for things that can't be 0 For me, it's better if a bit is set in response's comp_mask by the kernel when the kernel have written something in the related fields even if the those fields are all 0. > and we want 0 for things that are not set. > > struct ib_query_device_ex res; > ibv_query_device_ex(..,res,sizeof(res)); > > printf("%u",res.foo_cap); // 0 if unsupported is OK > if (res.comp_mask & COMP_BAR) > printf("%u",res.bar_thingy); // 0 has meaning, must use COMP_BAR > > Ideally we want to minimize the number of COMP bits because it is a > huge burden on the end user to work with them. > Sure. So I think comp_mask is just an overly complicated way of expressing the version and the size of the response. > > > The purpose of the output comp_mask is to allow drivers to declare > > > they do not support the new structure members, and comp_mask bits > > > should only be used with new structure members do not have a natural > > > 'null' value. > > > It's not (yet) about drivers as the output's comp_mask (in the > > response), is set by uverbs layer. > > > > Do you think we have to enforce a 1 to 1 relation between the request > > comp_mask's bits and the response comp_mask's bits ? > > In the query context I would be happy enough to just treat the in > bound buffer as uninitialized buffer space, but certainly generally > speaking the comp_mask+size should refer to the structure - so > input/output are not directly related. > OK. > > > Further, we need to distinguish cases where additional data is > > > mandatory or optional. > > > > > > query_device is clearly optional, the only purpose the input comp mask > > > serves is to reduce expensive work in the driver by indicating that > > > some result bits are not needed. > > > > That's not how I've understand the purpose of the request's comp_mask > > after reading the presentation: request's comp_mask describe the content > > of the request. Then that additional content can trigger the presence > > of additional fields in the response if needed. > > Agreed - what I described was an abuse that some early Mellanox > patches for a query_ex included and it was just a shortcut to avoid > defining a dedicated input structure. It seems that same scheme was > copied into these new patches. > OK > > > It is perfectly OK for the kernel to > > > ignore the input comp mask, and OK for userspace to typically request > > > all bits. (indeed, I think this is a pretty silly optimization myself, > > > and the original patch that motivated this was restructured so it is > > > not needed) > > > > > > > It's not at all OK to ignore request's comp_mask: it has to be checked > > to find if there's a feature requested the kernel cannot fullfil, and > > any unknown bit is a possible feature request. So the kernel has to > > reject the request as a whole as it don't know how to process it. > > In the context of the above scheme the input structure was just this: > > struct query_input > { > u64 requested_output; > }; > > ie it wasn't actually a comp_mask, it just overlapped the comp_mask > bytes on output. > > Such a use was never explicitly documented, and IIRC was never > actually included in libibverbs. > OK > Unless someone has a strong reason we need to do this I am inclined to > think that your interpretation is the better one, and we could always > add a requested_output to the input someday if it became urgent. > Sure one field can be added later: in such case, one bit of request's comp_mask will gain the meaning: "request_output field is present in the request". > In any event, you are right, we can't ingore the input bytes today and > expect to give them meaning tomorrow. > Sure. > > As we don't know yet how we would extend the extended QUERY_DEVICE > > uverbs, the kernel have to refuse any random value. > > > > http://blog.ffwll.ch/2013/11/botching-up-ioctls.html > > or the kernel treats QUERY_DEVICE as an output only function and never > inspects the in/out buffer at all. > It's something we could think of but the extended uverbs is already merge with the comp_mask field in the request and it cost only 4 bytes, while allowing us to make mistake on the first iteration of the extended QUERY_DEVICE uverb (provided we manage to add the check for the field being 0). > > > > Thanks for the link to the presentation. > > > > > > If I recall the presentation is old and had some flaws that were > > > addressed before things made it into libibverbs.. > > > > I have to have a look to this part of libibverbs: I'm not sure > > the extended QUERY_DEVICE is already implemented. > > The patches turned out to be unnecessary and were dropped, IIRC. > OK. > > > Thank you for looking at this, it is very important that this scheme > > > is used properly, and it is very easy to make mistakes. I haven't had > > > time to review any of these new patches myself. > > > I hope you would find some time to review my latest patchset so that > > we don't miss a corner case. It's starting to become urgent. > > I have and will, thank you > :) Thanks. Regards. -- Yann Droneaud OPTEYA -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html