Re: storage lock issues

Cedric Bosdonnat <cbosdonnat@xxxxxxxx> · Thu, 02 Nov 2017 14:10:39 +0100

Hi John,

On Thu, 2017-11-02 at 07:03 -0400, John Ferlan wrote:
> 
> On 11/02/2017 06:20 AM, Cedric Bosdonnat wrote:
> > Hi all,
> > 
> > I'm investigating a crash in the storage driver, and it seems the solution
> > is not as obvious as I wanted it.
> > 
> > Basically what happens is that libvirtd crashes due to the pool refresh thread
> > clearing the list of pool volumes while someone is adding a new volume. Here
> > is the valgrind log I have for that:
> > 
> > http://paste.opensuse.org/58733866
> > 
> 
> What version of libvirt?  Your line numbers don't match up with current
> HEAD.

It's a 3.8.0.

> > I tried adding a call to virStoragePoolObjLock() in the
> > virStorageVolPoolRefreshThread(), since the storageVolCreateXML()
> > appears to have it, but I'm getting a dead lock. Here are the waiting
> > threads I got from gdb:
> 
> virStoragePoolObjFindByName returns a locked pool obj, so not sure
> where/why you added this...

Oh I overlooked that... then there is already a lock on it, the solution
is elsewhere

> > 
> > http://paste.opensuse.org/45961070
> > 
> > Any idea what would be a proper fix for that? My vision of the storage
> > drivers locks is too limited it seems ;)
> 
> From just a quick look...
> 
> The virStorageVolPoolRefreshThread will take storageDriverLock, get a
> locked pool object (virStoragePoolObjFindByName), clear out the pool,
> add back volumes that it knows about, unlock the pool object, and call
> storageDriverUnlock.
> 
> If you were adding a new pool... You'd take storageDriverLock, then
> eventually virStoragePoolObjAssignDef would be called and the pool's
> object lock taken and unlocked, and returns a locked pool object which
> gets later unlocked in cleanup, followd by a storageDriverUnlock.
> 
> If you're adding a new volume object, you'd get a locked pool object via
> virStoragePoolObjFromStoragePool, then if building, that lock gets
> dropped after increasing the async job count and setting the building
> flag, the volume is built, then the object lock retaken while
> temporarily holding the storageDriverLock, the async job count gets
> decremented and the building flag cleared, eventually we fall into
> cleanup with unlocks the pool again.

Thanks for the details: this is really useful to globally understand how
locks are working in the storage driver. Maybe worth turning it into a doc
somewhere?

> So how to fix - well seems to me the storageDriverLock in VolCreateXML
> may be the way since the refresh thread takes the driver lock first,
> then the pool object second.  Perhaps even like the build code where it
> takes it temporarily while getting the pool object. I'd have to think a
> bit more about though. Still might be something to try since the Vol
> Refresh thread takes it while running...

Ok, I'll look into that direction.

> John
> 
> Not related to this problem per se, but what may help even more is if I
> can get the storage driver usage of a common object model patches
> completely reviewed, but that's a different problem ;-)... I'll have to
> go look and see if I may have fixed this there. The newer model uses
> hash tables, RW locks, and reduces the storage driver hammer lock, but
> this one condition may not have been addressed.

I can have a go at it and see if I can reproduce the crash with it. Not sure
I'll be the one skilled enough for the full review though ;)

--
Cedric

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list