On Thu, Oct 01, 2020 at 04:38:55PM +0200, Greg KH wrote: > On Thu, Oct 01, 2020 at 11:33:34AM -0300, Jason Gunthorpe wrote: > > On Thu, Oct 01, 2020 at 02:14:23PM +0200, Greg KH wrote: > > > On Thu, Oct 01, 2020 at 08:58:47AM -0300, Jason Gunthorpe wrote: > > > > On Thu, Oct 01, 2020 at 01:05:51PM +0200, Greg KH wrote: > > > > > > > > > You have to be _VERY_ careful after calling > > > > > ancillary_device_initialize(), as now you can not just free up the > > > > > memory if something goes wrong before ancillary_device_add() is called, > > > > > right? > > > > > > > > I've looked at way too many versions of this patch and related. This > > > > is the only one so far that I didn't find various bugs on the error > > > > cases. > > > > > > But you haven't seen the callers of this function. Without this > > > documented, you will have problems. > > > > I've seen the Intel irdma, both versions of the SOF stuff and an > > internal mlx5 patch.. > > > > Look at the SOF example, it has perfectly paired error unwinds. Each > > function has unwind that cleans up exactly what it creates. Every > > 'free' unwind is paired with an 'alloc' in the same function. Simple. > > Easy to audit. Easy to correctly enhance down the road. > > > > This is the common kernel goto error design pattern. > > But that's where people get this wrong. People get everything wrong :( At least this pattern is easy to notice and review. > Once device_initialize() is called, the "free" can not be called, > something else must be, device_put(). Yep! However, with the one step device_register() pattern code usually makes this class of mistake: https://elixir.bootlin.com/linux/latest/source/drivers/firewire/core-device.c#L722 'goto skip_unit' does kfree() on something that already has been device_initialized(). This is a real bug because this code called dev_set_name() on line 713 and not doing the put_device() leaked the name allocation. I think < v10 had this mistake. dev_set_name() is a common error, here is another version: https://elixir.bootlin.com/linux/latest/source/drivers/dma/idxd/cdev.c#L226 This correctly gets the switch to put_device() after device_register(), but it calls kfree on line 220 after dev_set_name(). This leaks memory too. Something like v16 of this series had this bug as well. BTW, want a patch to add a kref_read(dev->kref) == 0 warning to dev_set_name() ? This seems pretty common, these were the first two random choices from LXR I checked :\ > Sure, but without a real user that _NEEDS_ this two-step process, let's > not include it. Why bake complexity into the system from the start that > is never used? It just needs to not have these common error unwind bugs :( Jason