RE: [PATCH 5/6] Drivers: hv: vmbus: distribute subchannels among all vcpus

KY Srinivasan <kys@xxxxxxxxxxxxx> · Fri, 24 Apr 2015 16:46:52 +0000

> -----Original Message-----
> From: Vitaly Kuznetsov [mailto:vkuznets@xxxxxxxxxx]
> Sent: Friday, April 24, 2015 2:05 AM
> To: Dexuan Cui
> Cc: KY Srinivasan; Haiyang Zhang; devel@xxxxxxxxxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH 5/6] Drivers: hv: vmbus: distribute subchannels among
> all vcpus
> 
> Dexuan Cui <decui@xxxxxxxxxxxxx> writes:
> 
> >> -----Original Message-----
> >> From: Vitaly Kuznetsov [mailto:vkuznets@xxxxxxxxxx]
> >> Sent: Tuesday, April 21, 2015 22:28
> >> To: KY Srinivasan
> >> Cc: Haiyang Zhang; devel@xxxxxxxxxxxxxxxxxxxxxx; linux-
> >> kernel@xxxxxxxxxxxxxxx; Dexuan Cui
> >> Subject: [PATCH 5/6] Drivers: hv: vmbus: distribute subchannels among all
> >> vcpus
> >>
> >> Primary channels are distributed evenly across all vcpus we have. When
> the
> >> host asks us to create subchannels it usually makes us num_cpus-1 offers
> >
> > Hi Vitaly,
> > AFAIK, in the VSP of storvsc, the number of subchannel is
> >  (the_number_of_vcpus - 1) / 4.
> >
> > This means for a 8-vCPU guest, there is only 1 subchannel.
> >
> > Your new algorithm tends to make the vCPUs with small-number busier:
> > e.g., in the 8-vCPU case, assuming we have 4 SCSI controllers:
> > vCPU0: scsi0's PrimaryChannel (P)
> > vCPU1: scsi0's SubChannel (S) + scsi1's P
> > vCPU2: scsi1's S + scsi2's P
> > vCPU3: scsi2's S + scsi3's P
> > vCPU4: scsi3's S
> > vCPU5, 6 and 7 are idle.
> >
> > In this special case, the existing algorithm is better. :-)
> >
> > However, I do like this idea in your patch, that is, making sure a device's
> > primary/sub channels are assigned to differents vCPUs.
> 
> Under special circumstances with the current code we can end up with
> having all subchannels on the same vCPU with the primary channel I guess
> :-) This is not something common, but possible.
> 
> >
> > I'm just wondering if we should use an even better (and complex)
> > algorithm :-)
> 
> The question here is - does sticking to the current vCPU help? If it
> does, I can suggest the following (I think I even mentioned that in my
> PATCH 00): first we try to find a (sub)channel with target_cpu ==
> current_vcpu and only when we fail we do the round robin. I'd like to
> hear K.Y.'s opinion here as he's the original author :-)

Sorry for the delayed response. Initially I had implemented a scheme that would 
pick an outgoing CPU that was closest to the CPU on which the request came (to maintain
cache locality especially on NUMA systems). I changed this algorithm to spread the load
more uniformly as we were trying to improve Linux IOPS on Azure XIO
(premium storage). We are currently testing
this code on our Converged Offering - CPS and I am finding that the perf as measured by IOS has regressed.
I have not narrowed the reason for this regression and it may very well be the change in the 
algorithm for selecting the outgoing channel. In general, I don't think the logic here needs to be 
exact and locality (being on the same CPU or within the same NUMA node) is important. Any change
to this algorithm will have to be validated on different MSFT environments (Azure XIO, CPS etc.).

Regards,

K. Y

_______________________________________________
devel mailing list
devel@xxxxxxxxxxxxxxxxxxxxxx
http://driverdev.linuxdriverproject.org/mailman/listinfo/driverdev-devel