On Wed, Aug 26, 2020 at 12:49:03AM +0000, Saleem, Shiraz wrote: > > Subject: Re: [PATCH rdma-next 01/10] RDMA: Restore ability to fail on PD > > deallocate > > > > On 25/08/2020 16:44, Jason Gunthorpe wrote: > > > On Tue, Aug 25, 2020 at 04:32:57PM +0300, Gal Pressman wrote: > > >>> For uverbs it will go into an infinite loop in > > >>> uverbs_destroy_ufile_hw() if destroy doesn't eventually succeed. > > >> > > >> The code breaks the loop in such cases, why infinite loop? > > > > > > Oh, that is a bug, it should WARN_ON when that happens, because the > > > driver has triggered a permanent memory leak. > > > > Well, a WARN_ON won't do much good if you're stuck in an infinite loop :), the > > break is definitely needed there. > > > > >>> For kernel it will trigger WARN_ON's and then a permanent memory leak. > > >>> > > >>>> I agree that drivers shouldn't fail destroy commands, but you > > >>>> know.. bugs/errors happen (especially when dealing with hardware), > > >>>> and we have a way to propagate them, why do it for only some of the > > drivers? > > >>> > > >>> There is no way to propogate them. > > >>> > > >>> All destroy must eventually succeed. > > >> > > >> There is no way to propagate them on process cleanup, but the destroy > > >> verbs have a return code all the way back to libibverbs, which we can > > >> use for error propagation. > > > > > > It is sort of OK for a driver to fail during RDMA_REMOVE_DESTROY. > > > > > > All other reason codes must eventually succeed. > > > > > >> The cleanup flow can either ignore the return value, or we can add > > >> another parameter that explicitly means the call shouldn't fail and > > >> all allocated memory/state should be freed. > > > > > > I don't really see the value to return the error code to userspace, it > > > would require churning all the drivers and all the destroy functions > > > to pass the existing reason in. > > > > > > Since all the details of the FW failure reason are lost to some EINVAL > > > (or already logged to dmesg) I don't see much point. > > > > Right, as always, the error code would probably not contain much information, but > > there's a big difference between returning error code X/Y vs returning success > > instead of an error. To me that just feels wrong, at least in cases where we can > > prevent that. > > > > The API is quite confusing now. If drivers are not expected to fail the destroy > and there is no way to propagate the device failures, then the return type should be a void. > > Do we really want to have mixed/ambiguous definition of the API to support the quirks of one type of device? This is in-kernel API and it can be imperfect, because we are not bounded to not-break-userspace rule. I don't like the current situation either, just don't know how to solve it differently. Thanks > > Shiraz >