Re: RFC: chasing all idr_remove() misses

Dan Carpenter <dan.carpenter@xxxxxxxxxx> · Sat, 16 Nov 2024 22:43:30 +0300

Matthew Wilcox maintains IDR so it's really up to him.  I don't think adding
a WARN_ON_ONCE() in idr_remove() for NULL returns is a bad idea but we could
hide it behind a #if DEBUG_IDR or something and try run syzkaller on it first.

regards,
dan carpenter

On Sat, Nov 16, 2024 at 06:45:37PM +0100, Alexandre Ferrieux wrote:
> Hi,
> 
> In the recent fix of u32's IDR leaks:
> 
>   73af53d82076 net: sched: cls_u32: Fix u32's systematic failure to free IDR
>                entries for hnodes.
> 
> ... one side remark is that the problem went unnoticed for 7 years due to the
> NULL result from idr_remove() being ignored at this call site.
> 
> Now, a cursory grep over the whole Linux tree shows 306 out of 386 call sites
> (excluding those hidden in macros, if any) don't bother to extract the value
> returned by idr_remove().
> 
> Indeed, a failed IDR removal is "mostly harmless" since IDs are not pointers so
> the mismatch is detectable (and is detected, returning NULL). However, in racy
> situations you may end up killing an innocent fresh entry, which may really
> break things a bit later. And in all cases, a true bug is the root cause.
> 
> So, unless we have reasons to think cls_u32 was the only place where two ID
> encodings might lend themselves to confusion, I'm wondering if it wouldn't make
> sense to chase the issue more systematically:
> 
>  - either with WARN_ON[_ONCE](idr_remove()==NULL) on each call site individually
> (a year-long endeavor implying tens of maintainers)
> 
>  - or with WARN_ON[_ONCE] just before returning NULL within idr_remove() itself,
> or even radix_tree_delete_item() (quicker but possibly disruptive)
> 
>  - a variant of the latter being to do it only for harsh bug-hunting builds (the
> ones typically used by patrolling bots)
> 
> Opinions ?