On 09/18/2012 06:03 AM, Andrew Theurer wrote: > On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote: >> On 09/14/2012 12:30 AM, Andrew Theurer wrote: >> >> > The concern I have is that even though we have gone through changes to >> > help reduce the candidate vcpus we yield to, we still have a very poor >> > idea of which vcpu really needs to run. The result is high cpu usage in >> > the get_pid_task and still some contention in the double runqueue lock. >> > To make this scalable, we either need to significantly reduce the >> > occurrence of the lock-holder preemption, or do a much better job of >> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus >> > which do not need to run). >> > >> > On reducing the occurrence: The worst case for lock-holder preemption >> > is having vcpus of same VM on the same runqueue. This guarantees the >> > situation of 1 vcpu running while another [of the same VM] is not. To >> > prove the point, I ran the same test, but with vcpus restricted to a >> > range of host cpus, such that any single VM's vcpus can never be on the >> > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, >> > vcpu-1's are on host cpus 5-9, and so on. Here is the result: >> > >> > kvm_cpu_spin, and all >> > yield_to changes, plus >> > restricted vcpu placement: 8823 +/- 3.20% much, much better >> > >> > On picking a better vcpu to yield to: I really hesitate to rely on >> > paravirt hint [telling us which vcpu is holding a lock], but I am not >> > sure how else to reduce the candidate vcpus to yield to. I suspect we >> > are yielding to way more vcpus than are prempted lock-holders, and that >> > IMO is just work accomplishing nothing. Trying to think of way to >> > further reduce candidate vcpus.... >> >> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing. >> That other vcpu gets work done (unless it is in pause loop itself) and >> the yielding vcpu gets put to sleep for a while, so it doesn't spend >> cycles spinning. While we haven't fixed the problem at least the guest >> is accomplishing work, and meanwhile the real lock holder may get >> naturally scheduled and clear the lock. > > OK, yes, if the other thread gets useful work done, then it is not > wasteful. I was thinking of the worst case scenario, where any other > vcpu would likely spin as well, and the host side cpu-time for switching > vcpu threads was not all that productive. Well, I suppose it does help > eliminate potential lock holding vcpus; it just seems to be not that > efficient or fast enough. If we have N-1 vcpus spinwaiting on 1 vcpu, with N:1 overcommit then yes, we must iterate over N-1 vcpus until we find Mr. Right. Eventually it's not-a-timeslice will expire and we go through this again. If N*y_yield is comparable to the timeslice, we start losing efficiency. Because of lock contention, t_yield can scale with the number of host cpus. So in this worst case, we get quadratic behaviour. One way out is to increase the not-a-timeslice. Can we get spinning vcpus to do that for running vcpus, if they cannot find a runnable-but-not-running vcpu? That's not guaranteed to help, if we boost a running vcpu too much it will skew how vcpu runtime is distributed even after the lock is released. > >> The main problem with this theory is that the experiments don't seem to >> bear it out. > > Granted, my test case is quite brutal. It's nothing but over-committed > VMs which always have some spin lock activity. However, we really > should try to fix the worst case scenario. Yes. And other guests may not scale as well as Linux, so they may show this behaviour more often. > >> So maybe one of the assumptions is wrong - the yielding >> vcpu gets scheduled early. That could be the case if the two vcpus are >> on different runqueues - you could be changing the relative priority of >> vcpus on the target runqueue, but still remain on top yourself. Is this >> possible with the current code? >> >> Maybe we should prefer vcpus on the same runqueue as yield_to targets, >> and only fall back to remote vcpus when we see it didn't help. >> >> Let's examine a few cases: >> >> 1. spinner on cpu 0, lock holder on cpu 0 >> >> win! >> >> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0 >> >> Spinner gets put to sleep, random vcpus get to work, low lock contention >> (no double_rq_lock), by the time spinner gets scheduled we might have won >> >> 3. spinner on cpu 0, another spinner on cpu 0 >> >> Worst case, we'll just spin some more. Need to detect this case and >> migrate something in. > > Well, we can certainly experiment and see what we get. > > IMO, the key to getting this working really well on the large VMs is > finding the lock-holding cpu -quickly-. What I think is happening is > that we go through a relatively long process to get to that one right > vcpu. I guess I need to find a faster way to get there. pvspinlocks will find the right one, every time. Otherwise I see no way to do this. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html