On 11.10.2018 13:20, Daniel P. Berrangé wrote: > On Mon, Sep 24, 2018 at 10:41:37AM +0300, Nikolay Shirokovskiy wrote: >> Hi, all. >> >> On fat hosts which are capable to run hundreds of VMs restarting libvirtd >> makes it's services unavailable for a long time if VMs use network filters. In >> my tests each of 100 VMs has no-promisc [1] and no-mac-spoofing filters and >> executing virsh list right after daemon restart takes appoximately 140s if no >> firewalld is running (that is ebtables/iptables/ip6tables commands are used to >> configure kernel tables). > > Yep, this is not entirely surprising given the huge number of rules we try > to create. > >> The problem is daemon does not even start to read from client connections >> because state drivers are not initialized. Initialization is blocked in state >> drivers autostart which grabs VMs locks. And VMs locks are hold by VMs >> reconnection code. Each VM reloads network tables on reconnection and this >> reloading is serialized on updateMutex in gentech nwfilter driver. >> Workarounding autostart won't help much because even if state drivers will >> initialize listing VM won't be possible because listing VMs takes each VM lock >> one by one too. However managing VM that passed reconnection phase will be >> possible which takes same 140s in worst case. > > In the QEMU driver we call qemuProcesssReconnectAll(). This spawns one > thread per VM that needs connecting to. > > So AFAICT, state driver initialization should not be blocked. Libvirtd > should accept commands, but any APIs that touch a virDomainObj will of > course still be blocked until reconnect is completed. > >> Note that this issue is only applicable if we use filters configuration that >> don't need ip learning. In the latter case situation is different because >> reconnection code spawns new thread that apply network rules only after ip is >> learned from traffic and this thread does not grab VM lock. As result VMs are >> managable but reloading filters in background takes appoximately those same >> 140s. I guess managing network filters during this period can have issues too. > > I believe it should be possible to still change network filters while this > reconnect is taking place - it would mean that any VMs affected by the filter > change would have their iptables rules re-built a second time. However... > >> Anyway this situation does not look good so fixing the described issue by >> spawning threads even without ip learning does not look nice to me. > > this is tricky - we want to know filters are built before we allow the VM > to start running, so we don't want to spawn a background thread in general. > IP learning has special code that sets up a minimal safe ruleset before > spawning the thread and waiting todo the real ruleset creation based on > the learn IP address. So I don't think it is desirable to change all > rule creation to run in a background thread. > >> What speed up is possible on conservative approach? First we can remove for >> test purpuses firewall ruleLock, gentech dirver updateMutex and filter object >> mutex which do not serve function in restart scenario. This gives 36s restart >> time. The speed up is archived because heavy fork/preexec steps are now run >> concurrently. > > To me the most obvious speedup is not to run any commands at all. > > 99% of the time when we restart libvirtd and re-build network filters > we are achieving nothing useful. We tear down the rules and replace > them by exactly the same rules. > > The idea behind rebuilding at startup is that someone might have changed > the config on diks while libvirtd was not running, and we want to ensure > that the VMs have the correct live config after this. > > Alternatively libvirtd has been upgraded to a newer version, and we want > to rebuild with the new code in case old code had a bug causing it to > create incorrect rules. > > I think we should look at how to optimize this to be more intelligent. > > We could somehow record a hash of what rule contents was used > originally. Only rebuild the rules if we see the hash of nwfilter > rules has changed, or if libvirtd binary has had code changes. > >> Next we can try to reduce fork/preexec time. To estimate its contibution alone >> let's bring back the above locks. It turns out the most time takes fork itself >> and closing 8k (on my system) file descriptors in preexec. Using vfork gives >> 2x boost and so does dropping mass close. (I check this mass close contribution >> because I not quite understand the purpose of this step - libvirt typically set >> close-on-exec flag on it's descriptors). So this two optimizations alone can >> result in restart time of 30s. > > I don't like the idea of using vfork(). It is already hard to do stuff between > fork() and execve() safely as you're restricted to async signal safe functions. > vfork() places an even greater number of restrictions on code that can be > run. It is especially dangerous in threaded programs as only the calling thread > is suspended. > > Unfortunately we cannot rely on close-on-exec. Depending on OS and/or version, > clsoe-on-exec setting may or may not be atomic. eg > > open(...., O_CLOEXEC) > > vs > > open(...) > fcntl(O_CLOSEXEC) > > the later has a race we'd hit. > > In addition libvirt links to a huge number of 3rd party libraries and I have > essentially zero confidence in them setting O_CLOEXEC correctly all the time. > > IOW, the mass close is inescapable IMHO. > > The only thing I could see us doing is to spawn some minimalist helper process > which then spawns iptables as needed. We'd only need the mass-close once for > the helper process, which can then just spawn iptables without mass-close, > or simply set a tiny max files ulimit for this helper process so that it > only 'mass close' 20 FDs. > > We'd have to write the set of desired iptables rules into some data format, > send it to the helper to execute and wait for results. Probably not too > difficult to achieve. > >> Unfortunately combining the above two approaches does not give boost multiple >> of them along. The reason is due to concurrency and high number of VMs (100) >> preexec boost does not have significant role and using vfork dininishes >> concurrency as it freezes all parent threads before execve. So dropping locks >> and closes gives 33s restart time and adding vfork to this gives 25s restart >> time. >> >> Another approach is to use --atomic-file option for ebtables >> (iptables/ip6tables unfortunately does not have one). The idea is to save table >> to file/edit file/commit table to kernel. I hoped this could give performance >> boost because we don't need to load/store kernel network table for a single >> rule update. In order to isolate approaches I also dropped all ip/ip6 updates >> which can not be done this way. In this approach we can not drop ruleLock in >> firewall because no other VM threads should change tables between save/commit. >> This approach gives restart time 25s. But this approach is broken anyway as we >> can not be sure another application doesn't change newtork table between >> save/commit in which case these changes will be lost. >> >> After all I think we need to move in a different direction. We can add API to >> all binaries and firewalld to execute many commands in one run. We can pass >> commands as arguments or wrote them into file which is then given to binary. > > This sounds like 'iptables-restore' command which accepts a batch of commands > in a file. There's also ip6tables-restore and ebtables-restore. > > There's no way to execute these via firewalld though. > > It is also tricky when we need to check the output of certan commands > before making a decision on which other commands to run. It would > certainly speed things up though to load batches of rules at once. > > > IMHO the most immediately useful thing would be to work on optimization > to skip re-creating nwfilter rules on startup, unless we detect there > are likely changes. > Thanx! Quite simple and effective indeed. Nikolay -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list