Re: Fw: Benchmarking for vhost polling patch

Razya Ladelsky <RAZYA@xxxxxxxxxx> · Wed, 14 Jan 2015 17:01:05 +0200

"Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote on 12/01/2015 12:36:13 PM:

> From: "Michael S. Tsirkin" <mst@xxxxxxxxxx>
> To: Razya Ladelsky/Haifa/IBM@IBMIL
> Cc: Alex Glikson/Haifa/IBM@IBMIL, Eran Raichstein/Haifa/IBM@IBMIL, 
> Yossi Kuperman1/Haifa/IBM@IBMIL, Joel Nider/Haifa/IBM@IBMIL, 
> abel.gordon@xxxxxxxxx, kvm@xxxxxxxxxxxxxxx, Eyal 
Moscovici/Haifa/IBM@IBMIL
> Date: 12/01/2015 12:36 PM
> Subject: Re: Fw: Benchmarking for vhost polling patch
> 
> On Sun, Jan 11, 2015 at 02:44:17PM +0200, Razya Ladelsky wrote:
> > > Hi Razya,
> > > Thanks for the update.
> > > So that's reasonable I think, and I think it makes sense
> > > to keep working on this in isolation - it's more
> > > manageable at this size.
> > > 
> > > The big questions in my mind:
> > > - What happens if system is lightly loaded?
> > >   E.g. a ping/pong benchmark. How much extra CPU are
> > >   we wasting?
> > > - We see the best performance on your system is with 10usec worth of 

> > polling.
> > >   It's OK to be able to tune it for best performance, but
> > >   most people don't have the time or the inclination.
> > >   So what would be the best value for other CPUs?
> > 
> > The extra cpu waste vs throughput gains depends on the polling timeout 

> > value(poll_stop_idle).
> > The best value to chose is dependant on the workload and the system 
> > hardware and configuration.
> > There is nothing that we can say about this value in advance. The 
system's 
> > manager/administrator should use this optimization with the awareness 
that 
> > polling
> > consumes extra cpu cycles, as documented. 
> > 
> > > - Should this be tunable from usespace per vhost instance?
> > >   Why is it only tunable globally?
> > 
> > It should be tunable per vhost thread.
> > We can do it in a subsequent patch.
> 
> So I think whether the patchset is appropriate upstream
> will depend exactly on coming up with a reasonable
> interface for enabling and tuning the functionality.
> 

How about adding a new ioctl for each vhost device that 
sets the poll_stop_idle (the timeout)? 
This should be aligned with the QEMU "way" of doing things.

> I was hopeful some reasonable default value can be
> derived from e.g. cost of the exit.
> If that is not the case, it becomes that much harder
> for users to select good default values.
> 

Our suggestion would be to use the maximum (a large enough) value,
so that vhost is polling 100% of the time.
The polling optimization mainly addresses users who want to maximize their 
performance, even on the expense of wasting cpu cycles. The maximum value 
will produce the biggest impact on performance.
However, using the maximum default value will be valuable even for users 
who care more about the normalized throughput/cpu criteria. Such users, 
interested in a finer tuning of the polling timeout need to look for an 
optimal timeout value for their system. The maximum value serves as the 
upper limit of the range that needs to be searched for such optimal 
timeout value.

> There are some cases where networking stack already
> exposes low-level hardware detail to userspace, e.g.
> tcp polling configuration. If we can't come up with
> a way to abstract hardware, maybe we can at least tie
> it to these existing controls rather than introducing
> new ones?
> 

We've spent time thinking about the possible interfaces that 
could be appropriate for such an optimization(including tcp polling).
We think that using the ioctl as interface to "configure" the virtual 
device/vhost, 
in the same manner that e.g. SET_NET_BACKEND is configured, makes a lot of 
sense, and
is consistent with the existing mechanism. 

Thanks,
Razya

> 
> > > - How bad is it if you don't pin vhost and vcpu threads?
> > >   Is the scheduler smart enough to pull them apart?
> > > - What happens in overcommit scenarios? Does polling make things
> > >   much worse?
> > >   Clearly polling will work worse if e.g. vhost and vcpu
> > >   share the host cpu. How can we avoid conflicts?
> > > 
> > >   For two last questions, better cooperation with host scheduler 
will
> > >   likely help here.
> > >   See e.g. 
> > http://thread.gmane.org/gmane.linux.kernel/1771791/focus=1772505
> > >   I'm currently looking at pushing something similar upstream,
> > >   if it goes in vhost polling can do something similar.
> > > 
> > > Any data points to shed light on these questions?
> > 
> > I ran a simple apache benchmark, with an over commit scenario, where 
both 
> > the vcpu and vhost share the same core.
> > In some cases (c>4 in my testcases) polling surprisingly produced a 
better 
> > throughput.
> 
> Likely because latency is hurt, so you get better batching?
> 
> > Therefore, it is hard to predict how the polling will impact 
performance 
> > in advance. 
> 
> If it's so hard, users will struggle to configure this properly.
> Looks like an argument for us developers to do the hard work,
> and expose simpler controls to users?
> 
> > It is up to whoever is using this optimization to use it wisely.
> > Thanks,
> > Razya 
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html