On 10/17/2018 01:46 PM, Daniel P. Berrangé wrote: > On Wed, Oct 17, 2018 at 12:52:50PM +0200, Michal Privoznik wrote: >> On 10/17/2018 11:45 AM, Daniel P. Berrangé wrote: >>> On Wed, Oct 17, 2018 at 11:06:46AM +0200, Michal Privoznik wrote: >>>> Trying to use virlockd to lock metadata turns out to be too big >>>> gun. Since we will always spawn a separate process for relabeling >>>> we are safe to use thread unsafe POSIX locks and take out >>>> virtlockd completely out of the picture. >>> >>> Note safety of POSIX locks is not merely about other threads. >>> >>> If *any* code we invoke from security_dac or security_selinux >>> were to open() & close() the file we're labelling we'll >>> loose the locks. >>> >>> This is a little concerning since we call out to a few functions >>> in src/util that might not be aware they should *never* open)( >>> and close() the path they're given. It looks like we lucky at >>> the moment, but this has future gun -> foot -> ouch potential. >>> >>> Likewise we also call out to libselinux APIs. I've looked at >>> the code for the APIs we call and again, right now, we look >>> to be safe, but there's some risk there. >>> >>> I'm not sure whether this is big enough risk to invalidate this >>> forking approach or not though. This kind of problem was the >>> reason why virtlockd was created to hold locks in a completely >>> separate process from the code with the critical section. >> >> But unless we check for hard links virtlockd will suffer from the POSIX >> locks too. For instance, assume that B is a hardlink to A. Then, if a >> process opens A, locks it an the other opens B and closes it the lock is >> released. Even though A and B look distinct. > > Yes, that is correct - there's an implicit assumption that all VMs > are using the same path if there's multiple hard links. Foruntately > this scenario is fairly rare - more common is symlinks which can be > canonicalized. > >> The problem with virlockd is that our current virlockspace impl is not >> aware of different offsets. We can try to make it so, but that would >> complicate the code a lot. Just imagine that there's a domain already >> running and it holds disk content lock over file C. If an user wants to >> start a domain (which too has disk C), libvirtd will try to lock C again >> (even though on different offset) but it will fail to do so because C is >> already locked. > > Yep, we would need intelligence around offsets to avoid opening it > twice for each offset. > >> Another problem with virtlockd that I faced was that with current >> implementation the connection to virlockd is opened in metadataLock(), >> where the connection FD is dup()-ed to prevent connection closing. The >> connection is then closed in metadataUnlock() where the dup()-ed FD is >> closed. This works pretty much good, except for fork(). If a fork() >> occurs between metadataLock() and metadataUnlock() the duplicated FD is >> leaked into the child and we can't be certain when will the child close >> it. The worst case scenario is that the child will close the FD when we >> are holding some resources, which results in virtlockd killing libvirtd >> (= registered owner of the locks). > > I don't think that's the case actually. If you have an FD that is a > UNIX/TCP socket that gets dup(), the remote end of the socket won't > see HUP until all duplicates of the socket are closed. IOW, both the > parent & child process would need to close their respective dups. It is the case. I've seen it out in the wild (when starting several domains at once): https://www.redhat.com/archives/libvir-list/2018-September/msg01138.html > >> Alternatively, we can switch to OFD locks and have the feature work only >> on Linux. If anybody from BSD users will want the feature they will have >> to push their kernel developers to implement some sensible file locking. >> >> Also, one of the questions is what we want metadata locking to be. So >> far I am aiming on having exclusive access to the file that we are >> chown()-ing (or setfcon()-ing) so that I can put a code around it that >> manipulates XATTR where the original owner would be stored. This way the >> metadata lock is held only for a fraction of a second. >> I guess we don't want the metadata lock be held through whole run of a >> domain, i.e. obtained at domain start and released at domain shutdown. I >> don't see much benefit in that. > > Yes, we must only lock it for the short time that we're reading or writing > the metadata. We must not hold the long long term, because other threads > or processes need access to the metadata while the VM is running, for > example to deal with validating shared disk labels. > > > On further reflection, I think we'll be safe enough with what you're > proposing. For it to go wrong, we would have to have a bug in the code > which accidentally open+close the socket, and be on an old kernel that > lacks OFD locks, and two processes must be in the critical section at > the same time in that short < 1 second window. Cool. Thanks. Let me just say that I find it disturbing that POSIX doesn't specify file locks that are usable in a multithreaded app. Sure, we can have some abstraction layer over lockf() + open() + close() but it can hardly be something usable. Or something that every app should do. Michal -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list