Re: [PATCH] Kernel OOPS in xen_netbk_rx_action / xenvif_gop_skb

Wei Liu <wei.liu2@xxxxxxxxxx> · Fri, 11 Jul 2014 10:53:07 +0100

On Fri, Jul 11, 2014 at 11:41:22AM +0200, Philipp Hahn wrote:
> Hello Wei Liu,
> 
> On 10.07.2014 14:41, Wei Liu wrote:
> > On Wed, Jul 02, 2014 at 09:45:44AM +0200, Philipp Hahn wrote:
> >> @Wei Liu: You said that the patch is only a quick hack to detect, if my
> >> analysis is correct and a proper fix would be needed. For us the
> >> attached patch works, as the problem does not happen that often and is
> >> hard to reproduce anyway, so spending more time on that issue is
> >> probably not worth it. And that flag doesn't look that ugly.
> ...
> > I agree that we would like to avoid spending too much time on this
> > issue.
> 
> This is also what I'm thinking.
> 
> > Since the problem is confirmed, I think a proper fix will be to
> > reference count vif and prevent it from unmapping the ring before all
> > queued SKBs are consumed.
> 
> vif is already ref-counted; see attached *untested* patch for a start.
> 

What I meant was that, in xenvif_free, we will wait for refcnt to become
0 (or 1?) before freeing vif structure. I think we should do similar
thing before unmapping the ring.

> What I don't like is xenbus_unmap_ring_vfree() potentially printing an
> error on double-free when called from xen_netbk_unmap_frontend_rings().
> 

How can it be called twice? Does it happen before or after applying my
patch?

> > But it might require much more work than that quick hack.
> 
> I'm no network driver/Xen expert, but that sounds like more work for no
> gain: We would still copy packets for a guest, which is already dead. If
> the ring get full, we would probably need to go to sleep and wait for an
> answer we will never get anymore.
> 

That's true, so dropping those packets seems to be a good solution.

> > FWIW this bug doesn't exist in kernel >=3.12.
> 
> That is even one more point for the hack, as there the problem is
> properly fixed and the problem is very obscure to trigger.
> 
> > Would you up for writing a patch? I won't be able to write
> > one in the near future.
> > Further more, you're the only party now can verify a fix.
> 
> I had a look and created the attached patch, which is untested, as I
> currently can't access the faulting system and have been unable to
> reproduce it in my development environment.
> 
> The quick hack now runs on that system for several weeks now without
> problems.
> 

Thanks for testing.

I will need to think further if this hack is really a proper solution. I
will keep you posted if there's any update on this. In the mean time you
can carry that patch in your own tree for a while.

Wei.
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html