Hi John, On Thu, 2017-11-02 at 07:03 -0400, John Ferlan wrote: > > On 11/02/2017 06:20 AM, Cedric Bosdonnat wrote: > > Hi all, > > > > I'm investigating a crash in the storage driver, and it seems the solution > > is not as obvious as I wanted it. > > > > Basically what happens is that libvirtd crashes due to the pool refresh thread > > clearing the list of pool volumes while someone is adding a new volume. Here > > is the valgrind log I have for that: > > > > http://paste.opensuse.org/58733866 > > > > What version of libvirt? Your line numbers don't match up with current > HEAD. It's a 3.8.0. > > I tried adding a call to virStoragePoolObjLock() in the > > virStorageVolPoolRefreshThread(), since the storageVolCreateXML() > > appears to have it, but I'm getting a dead lock. Here are the waiting > > threads I got from gdb: > > virStoragePoolObjFindByName returns a locked pool obj, so not sure > where/why you added this... Oh I overlooked that... then there is already a lock on it, the solution is elsewhere > > > > http://paste.opensuse.org/45961070 > > > > Any idea what would be a proper fix for that? My vision of the storage > > drivers locks is too limited it seems ;) > > From just a quick look... > > The virStorageVolPoolRefreshThread will take storageDriverLock, get a > locked pool object (virStoragePoolObjFindByName), clear out the pool, > add back volumes that it knows about, unlock the pool object, and call > storageDriverUnlock. > > If you were adding a new pool... You'd take storageDriverLock, then > eventually virStoragePoolObjAssignDef would be called and the pool's > object lock taken and unlocked, and returns a locked pool object which > gets later unlocked in cleanup, followd by a storageDriverUnlock. > > If you're adding a new volume object, you'd get a locked pool object via > virStoragePoolObjFromStoragePool, then if building, that lock gets > dropped after increasing the async job count and setting the building > flag, the volume is built, then the object lock retaken while > temporarily holding the storageDriverLock, the async job count gets > decremented and the building flag cleared, eventually we fall into > cleanup with unlocks the pool again. Thanks for the details: this is really useful to globally understand how locks are working in the storage driver. Maybe worth turning it into a doc somewhere? > So how to fix - well seems to me the storageDriverLock in VolCreateXML > may be the way since the refresh thread takes the driver lock first, > then the pool object second. Perhaps even like the build code where it > takes it temporarily while getting the pool object. I'd have to think a > bit more about though. Still might be something to try since the Vol > Refresh thread takes it while running... Ok, I'll look into that direction. > John > > Not related to this problem per se, but what may help even more is if I > can get the storage driver usage of a common object model patches > completely reviewed, but that's a different problem ;-)... I'll have to > go look and see if I may have fixed this there. The newer model uses > hash tables, RW locks, and reduces the storage driver hammer lock, but > this one condition may not have been addressed. I can have a go at it and see if I can reproduce the crash with it. Not sure I'll be the one skilled enough for the full review though ;) -- Cedric -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list