Re: [PATCH rdma-rc 0/4] RDMA mlx4/mlx5 fixes

Doug Ledford <dledford@xxxxxxxxxx> · Wed, 17 Jan 2018 15:31:16 -0500

On Fri, 2018-01-12 at 09:58 -0700, Jason Gunthorpe wrote:
> On Fri, Jan 12, 2018 at 09:34:02AM -0600, Chien Tin Tung wrote:
> > On Fri, Jan 12, 2018 at 05:06:02PM +0200, Leon Romanovsky wrote:
> > > On Fri, Jan 12, 2018 at 08:15:24AM -0600, Chien Tin Tung wrote:
> > > > On Fri, Jan 12, 2018 at 07:58:38AM +0200, Leon Romanovsky wrote:
> > > > > Hi,
> > > > > 
> > > > > There is small set of fixes targeted for rdma-rc.
> > > > 
> > > > For RC?  Are these fixing regressions?  We are already in RC7.
> > > 
> > > Jason was clear last time, he wants to work like Dave, fixes go always
> > > without any relation to -RC.
> > 
> > I won't claim to know Dave's process since I don't drectly submit
> > patches to Netdev.  However given Dave's history with the kernel,
> > I highly doubt he would accept patches late in RC cycle that would
> > jeopardize a kernel release.
> 
> Well, my definition of 'fixes' is pretty high. For instance fixing a
> performance regression is not a 'fix', IMHO.
> 
> Fixing a user triggerable out-of-bound oops would be though - as in
> these modern times I'm sure someone smart can escalate stuff like that
> to a full security compromise of a trusted boot system..
> 
> This is why I keep asking for good commit messages for -rc
> patches. Explain why you think this deserves to be in -rc, in terms
> other people can understand:
> 
>  - Can userspace trigger an oops or memory corruption? How?
>  - Can the kernel oops or memory corrupt in a actual demonstrated way
>    (not theoretical)?
>  - Did this actually get hit durring real world testing?
>  - Could this escalate to some kind of security issue for a
>    trusted-boot kernel?
>  
> etc.
> 
> So, when I've said 'fixes should go to rc', I mean fixes that qualify
> as important here:
> 
> ...  Bugs that have always existed are not regressions, so
>      only push these kinds of fixes if they are important.  ...
> 
> and I haven't ment 'just anything with a Fixes: tag'.
> 
> If I think I can't defend why your patch is 'important' to Linus,
> then the likely hood of taking it decreases as we move along the rc
> cycle. It is up to the submitter to write a good commit message.
> 
> Doug also likes to see patches with a small LOC late in the cycle.

Indeed.  The larger the line count, the harder it is to guarantee that
you aren't just introducing a new bug.  At some point, the devil you
know is better than the devil you don't.  So, if the bug in question is
an oops, or a locking issue resulting in deadlock, or a use after free
that can result in nothing all the way up to silent data corruption or
oops, then those obviously are the ones that I'm going to take on up
into late rc cycles, and even if their LOC count is higher than I like,
if testing shows it solves the problem, I'll take it.

For lesser issues, I'm more likely to take it late if the problem is
obvious and the fix is small so that we can have a high degree of
certainty that the fix won't introduce any other problems.

> eg if we look at the recent patches:
>  - Two security related issues for SRP
>  - A locking issue leading to user triggerable memory corruption
>    (eg add/remove rxe devices in parallel with netlink queries)
>  - In-kernel oops under error situations potentially triggerable
>  - Real world oops caused by SE Linux checking, hit during testing
>  - A issue that damages our potential future ABI compatability in uverbs
>  - User triggerable kernel data corruption due to missing locking
> 
> etc..
> 
> and some counter examples that went to -next:
> 
> - 'RDMA/cma: Fix rdma_cm path querying for RoCE'
>   The uAPI has been broken here since v4.12 and nobody noticed
>   The use case in user space is very obscure
>   Probably would have taken this to -rc around rc1/2/3
> - 'net/mlx5: Fix race for multiple RoCE enable'
>   This has been broken since v4.13
>   Unclear from commit message if this is theoretical, user
>   triggerable, or seen in the real world
>   Could have been -rc if the risks were identified as real world
> - 'IB/{hfi1, qib}: Fix a concurrency issue with device name in
>   logging'
>   Seems kinda -rc worthy but the LOC is high, the bug has existed
>   for soemthing like a decade, and the commit message doesn't
>   explain why computing the wrong string is 'important'. Doesn't
>   seem to be a oops or security issue.
> 
> At the very least this is the best thinking I've come up with after
> talking to several people - advice welcome of course.
> 
> Jason

-- 
Doug Ledford <dledford@xxxxxxxxxx>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD
Attachment:
signature.asc

Description: This is a digitally signed message part