On Thu, 2015-02-12 at 20:10 +0200, Or Gerlitz wrote: > On 01/23/2015 12:02 AM, Doug Ledford wrote: > > The introduction of garbage collection for neighbors in the ipoib stack > > caused a leak of resources. In particular, an ah can have a refcount > > from a mcast struct, from a neigh struct, and from a path struct. When > > we put an ah ref, the ah gets moved to a dead list to be reaped later if > > the refcount goes to 0. > > > > During ipoib_ib_dev_stop we attempt to reap these ah entries, but since > > that function does not force the neigh garbage collection to be run, it > > can leak any entries held by a neighbor. And even though the code > > attempts to warn us that we will leak entries, that warning is broken > > because the entries we are leaking will not yet be on the dead_ahs list > > because they still have valid references held elsewhere. It is during > > shutdown, when we finally kill the neighbor garbage collection, that our > > final ahs will get put on the dead_ahs list, but after the point at > > which we run the neighbor garbage collection, no attempt is made to reap > > the newly added ahs, and because during ipoib_ib_dev_stop we killed the > > ah_reap task, no timed attempt will be made to clear them either. > > > > Instead, create a an ipoib_flush_ah and ipoib_stop_ah routines to use at > > appropriate times to flush out all remaining ah entries before we shut > > the device down. > > > > Additionally do one final flush of our priv->wq before we start tearing > > down ib resources to make sure all possible queued work is done and > > all possible resources freed. > > > > This is done to prevent resource leaks on shutdown, which manifest with > > this message on ib_ipoib module remove: > > > > <ibdev>: ib_dealloc_pd failed > > Doug, > > If you make this bug fix apply-able before the rest of the series we can > push it to stable > kernels too, can you make it? I'm not sure that's possible. This patch relies upon the fact that we have a per device workqueue to be able to flush safely. Prior to the rest of the patches, we still had one global ipoib_workqueue. Depending on the context in which the device is being downed, attempts to flush might result in a deadlock. > > > > Signed-off-by: Doug Ledford <dledford@xxxxxxxxxx> > > Please add above your signature a "Fixes:" line which specifies which > commit you are fixing, is it this one? > > b63b70d87741 IPoIB: Use a private hash table for path lookup in xmit path Yes, this is the one it fixes. -- Doug Ledford <dledford@xxxxxxxxxx> GPG KeyID: 0E572FDD
Attachment:
signature.asc
Description: This is a digitally signed message part