Re: [stable/4.14.y PATCH 0/3] mmc: Fix a potential resource leak when shutting down request queue.

Raul Rangel <rrangel@xxxxxxxxxxxx> · Wed, 19 Jun 2019 10:46:25 -0600

On Tue, May 14, 2019 at 11:19:34AM +0200, Greg Kroah-Hartman wrote:
> On Mon, May 13, 2019 at 11:55:18AM -0600, Raul E Rangel wrote:
> > I think we should cherry-pick 41e3efd07d5a02c80f503e29d755aa1bbb4245de
> > https://lore.kernel.org/patchwork/patch/856512/ into 4.14. It fixes a
> > potential resource leak when shutting down the request queue.
> 
> Potential meaning "it does happen", or "it can happen if we do this", or
> just "maybe it might happen, we really do not know?"
It does happen if the AMD SDHCI patches are cherry-picked into 4.14.
https://lkml.org/lkml/2019/5/1/398
It can be mitigated by changing the line in the patch with
`mmc_detect_change(host, 0)` to mmc_detect_change(host, 200)`, but
that's just a workaround to play with the timing so the race condition
doesn't happen.
> 
> > Once this patch is applied, there is a potential for a null pointer dereference.
> > That's what the second patch fixes.
> 
> What is the git id of that upstream fix?
So there is no specific upstream fix. There was a large patch set that
migrated mmc to using blk-mq, so the bug just kind of went away.
https://lwn.net/Articles/739774/ or 0fbfd12518303e9b32ac9fd231439459eac848f9

> 
> > The third patch is just an optimization to stop processing earlier.
> 
> That's not how stable kernels work :(
Oops, I guess we can ignore that patch. It just prevents mmc_init_request
from being called, but it doesn't matter since the 2nd patch actually
checks for NULL now.
> 
> > See https://patchwork.kernel.org/patch/10925469/ for the initial motivation.
> 
> I don't understand the motivation from that link at all :(
> 
> > This commit applies to v4.14.116. It is already included in 4.19. 4.19 doesn't
> > suffer from the null pointer dereference because later commits migrate the mmc
> > stack to blk-mq.
> 
> What are those later commits?
Commit 0fbfd12518303e9b32ac9fd231439459eac848f9 specifically deletes the
code for the 2nd patch. As I said above, the NULL pointer dereference
just kind of went away as part of the blk-mq migration, so there is no
upstream fix :(

> 
> > I tested this patch set by randomly connecting/disconnecting the SD
> > card. I got over 189650 itarations without a problem.
> 
> And if you do not have these patches, on 4.14.y, how many iterations
> cause a problem?  If you just apply the first patch, does that work?
If I apply the AMD SDHCI patches and nothing else, then I can cause a
resource leak within 10 iterations. If I apply just the first patch then
I can cause a NULL pointer error within 10 iterations. If I apply both 1
and 2, then everything works as expected and I can't cause a problem.
> 
> _EVERY_ time we take a patch that is not upstream, something usually is
> broken and needs to be fixed.  We have a long long long history of this,
> so if you want to have a patch that is not upstream applied to a stable
> kernel release, you need a whole lot of justification and explanation
> and begging.  And you need to be around to fix the fallout for when it
> breaks :)

It also looks like 2361bfb055f948eac6583fa3c75a014da84fe554 includes a
fix for 41e3efd07d5a02c80f503e29d755aa1bbb4245de, so that would need to
be cherry picked in.

I guess I should have included a fixes: line in my second patch.

So to summarize:
- cherry-pick 41e3efd07d5a02c80f503e29d755aa1bbb4245de
- cherry-pick 2361bfb055f948eac6583fa3c75a014da84fe554
- apply 2nd patch but add to commit message:
  Fixes: 41e3efd07d5a ("mmc: block: Simplify cleaning up the queue")
- Ignore patch 3 since it's an optimization.
> 
> thanks,
> 
> greg k-h
Thanks again for your time, and sorry for the really late response!

Raul