KY Srinivasan <kys@xxxxxxxxxxxxx> writes: >> -----Original Message----- >> From: Vitaly Kuznetsov [mailto:vkuznets@xxxxxxxxxx] >> Sent: Friday, April 24, 2015 2:05 AM >> To: Dexuan Cui >> Cc: KY Srinivasan; Haiyang Zhang; devel@xxxxxxxxxxxxxxxxxxxxxx; linux- >> kernel@xxxxxxxxxxxxxxx >> Subject: Re: [PATCH 5/6] Drivers: hv: vmbus: distribute subchannels among >> all vcpus >> >> Dexuan Cui <decui@xxxxxxxxxxxxx> writes: >> >> >> -----Original Message----- >> >> From: Vitaly Kuznetsov [mailto:vkuznets@xxxxxxxxxx] >> >> Sent: Tuesday, April 21, 2015 22:28 >> >> To: KY Srinivasan >> >> Cc: Haiyang Zhang; devel@xxxxxxxxxxxxxxxxxxxxxx; linux- >> >> kernel@xxxxxxxxxxxxxxx; Dexuan Cui >> >> Subject: [PATCH 5/6] Drivers: hv: vmbus: distribute subchannels among all >> >> vcpus >> >> >> >> Primary channels are distributed evenly across all vcpus we have. When >> the >> >> host asks us to create subchannels it usually makes us num_cpus-1 offers >> > >> > Hi Vitaly, >> > AFAIK, in the VSP of storvsc, the number of subchannel is >> > (the_number_of_vcpus - 1) / 4. >> > >> > This means for a 8-vCPU guest, there is only 1 subchannel. >> > >> > Your new algorithm tends to make the vCPUs with small-number busier: >> > e.g., in the 8-vCPU case, assuming we have 4 SCSI controllers: >> > vCPU0: scsi0's PrimaryChannel (P) >> > vCPU1: scsi0's SubChannel (S) + scsi1's P >> > vCPU2: scsi1's S + scsi2's P >> > vCPU3: scsi2's S + scsi3's P >> > vCPU4: scsi3's S >> > vCPU5, 6 and 7 are idle. >> > >> > In this special case, the existing algorithm is better. :-) >> > >> > However, I do like this idea in your patch, that is, making sure a device's >> > primary/sub channels are assigned to differents vCPUs. >> >> Under special circumstances with the current code we can end up with >> having all subchannels on the same vCPU with the primary channel I guess >> :-) This is not something common, but possible. >> >> > >> > I'm just wondering if we should use an even better (and complex) >> > algorithm :-) >> >> The question here is - does sticking to the current vCPU help? If it >> does, I can suggest the following (I think I even mentioned that in my >> PATCH 00): first we try to find a (sub)channel with target_cpu == >> current_vcpu and only when we fail we do the round robin. I'd like to >> hear K.Y.'s opinion here as he's the original author :-) > > Sorry for the delayed response. Initially I had implemented a scheme that would > pick an outgoing CPU that was closest to the CPU on which the request came (to maintain > cache locality especially on NUMA systems). I changed this algorithm to spread the load > more uniformly as we were trying to improve Linux IOPS on Azure XIO > (premium storage). We are currently testing > this code on our Converged Offering - CPS and I am finding that the perf as measured by IOS has regressed. > I have not narrowed the reason for this regression and it may very well be the change in the > algorithm for selecting the outgoing channel. In general, I don't think the logic here needs to be > exact and locality (being on the same CPU or within the same NUMA node) is important. Any change > to this algorithm will have to be validated on different MSFT > environments (Azure XIO, CPS etc.). Thanks, can you please compare two algorythms here: 1) Simple round robin (the one my patch series implement but with issues fixed, I'll send v2). 2) Try to find a (sub)channel with matching VCPU and round-robin when we fail (I can actually include it in v2). We can later decide something based on these testing results. > > Regards, > > K. Y -- Vitaly _______________________________________________ devel mailing list devel@xxxxxxxxxxxxxxxxxxxxxx http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel