In ple handler code, last_boosted_vcpu (lbv) variable is serving as reference point to start when we enter. lbv = kvm->lbv; for each vcpu i of kvm if i is eligible if yield_to(i) is success lbv = i currently this variable is per VM and it is set after we do yield_to(target), unfortunately it may take little longer than we expect to come back again (depending on its lag in rb tree) on successful yield and set the value. So when several ple_handle entry happens before it is set, all of them start from same place. (and overall RR is also slower). Also statistical analysis (below) is showing lbv is not very well distributed with current approach. naturally, first approach is to move lbv before yield_to, without bothering failure case to make RR fast. (was in Rik's V4 vcpu_on_spin patch series). But when I did performance analysis, in no-overcommit scenario, I saw violent/cascaded directed yield happening, leading to more wastage of cpu in spinning. (huge degradation in 1x and improvement in 3x, I assume this was the reason it was moved after yield_to in V5 of vcpu_on_spin series.) Second approach, I tried was, (1) get rid of per kvm lbv variable (2) everybody who enters handler start from a random vcpu as reference point. The above gave good distribution of starting point,(and performance improvement in 32 vcpu guest I tested) and also IMO, it scales well for larger VM's. Analysis ============= Four 32 vcpu guest running with one of them running kernbench. PLE handler yield stat is the statistics for successfully yielded case (for 32 vcpus) PLE handler start stat is the statistics for frequency of each vcpu index as starting point (for 32 vcpus) snapshot1 ============= PLE handler yield stat : 274391 33088 32554 46688 46653 48742 48055 37491 38839 31799 28974 30303 31466 45936 36208 51580 32754 53441 28956 30738 37940 37693 26183 40022 31725 41879 23443 35826 40985 30447 37352 35445 PLE handler start stat : 433590 383318 204835 169981 193508 203954 175960 139373 153835 125245 118532 140092 135732 134903 119349 149467 109871 160404 117140 120554 144715 125099 108527 125051 111416 141385 94815 138387 154710 116270 123130 173795 snapshot2 ============ PLE handler yield stat : 1957091 59383 67866 65474 100335 77683 80958 64073 53783 44620 80131 81058 66493 56677 74222 74974 42398 132762 48982 70230 78318 65198 54446 104793 59937 57974 73367 96436 79922 59476 58835 63547 PLE handler start stat : 2555089 611546 461121 346769 435889 452398 407495 314403 354277 298006 364202 461158 344783 288263 342165 357270 270887 451660 300020 332120 378403 317848 307969 414282 351443 328501 352840 426094 375050 330016 347540 371819 So questions I have in mind is, 1. Do you think going for randomizing last_boosted_vcpu and get rid of per VM variable is better? 2. Can/Do we have a mechanism, from which we will be able to decide not to yield to vcpu who is doing frequent PLE exit (possibly because he is doing unnecessary busy-waits) OR doing yield_to better candidate? On a side note: With pv patches I have tried doing yield_to a kicked VCPU, in vcpu_block path and is giving some performance improvement. Please let me know if you have any comments/suggestions. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html