On Fri, Dec 14, 2018 at 12:29:54PM +0800, Jason Wang wrote: > > On 2018/12/14 上午4:12, Michael S. Tsirkin wrote: > > On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote: > > > Hi: > > > > > > This series tries to access virtqueue metadata through kernel virtual > > > address instead of copy_user() friends since they had too much > > > overheads like checks, spec barriers or even hardware feature > > > toggling. > > > > > > Test shows about 24% improvement on TX PPS. It should benefit other > > > cases as well. > > > > > > Please review > > I think the idea of speeding up userspace access is a good one. > > However I think that moving all checks to start is way too aggressive. > > > So did packet and AF_XDP. Anyway, sharing address space and access them > directly is the fastest way. Performance is the major consideration for > people to choose backend. Compare to userspace implementation, vhost does > not have security advantages at any level. If vhost is still slow, people > will start to develop backends based on e.g AF_XDP. > Let them what's wrong with that? > > Instead, let's batch things up but let's not keep them > > around forever. > > Here are some ideas: > > > > > > 1. Disable preemption, process a small number of small packets > > directly in an atomic context. This should cut latency > > down significantly, the tricky part is to only do it > > on a light load and disable this > > for the streaming case otherwise it's unfair. > > This might fail, if it does just bounce things out to > > a thread. > > > I'm not sure what context you meant here. Is this for TX path of TUN? But a > fundamental difference is my series is targeted for extreme heavy load not > light one, 100% cpu for vhost is expected. Interesting. You only shared a TCP RR result though. What's the performance gain in a heavy load case? > > > > > 2. Switch to unsafe_put_user/unsafe_get_user, > > and batch up multiple accesses. > > > As I said, unless we can batch accessing of two difference places of three > of avail, descriptor and used. It won't help for batching the accessing of a > single place like used. I'm even not sure this can be done consider the case > of packed virtqueue, we have a single descriptor ring. So that's one of the reasons packed should be faster. Single access and you get the descriptor no messy redirects. Somehow your benchmarking so far didn't show a gain with vhost and packed though - do you know what's wrong? > Batching through > unsafe helpers may not help in this case since it's equivalent to safe ones > . And This requires non trivial refactoring of vhost. And such refactoring > itself make give us noticeable impact (e.g it may lead regression). > > > > > > 3. Allow adding a fixup point manually, > > such that multiple independent get_user accesses > > can get a single fixup (will allow better compiler > > optimizations). > > > > So for metadata access, I don't see how you suggest here can help in the > case of heavy workload. > > For data access, this may help but I've played to batch the data copy to > reduce SMAP/spec barriers in vhost-net but I don't see performance > improvement. > > Thanks So how about we try to figure what's going on actually? Can you drop the barriers and show the same gain? E.g. vmap does not use a huge page IIRC so in fact it can be slower than direct access. It's not a magic faster way. > > > > > > > > > > Jason Wang (3): > > > vhost: generalize adding used elem > > > vhost: fine grain userspace memory accessors > > > vhost: access vq metadata through kernel virtual address > > > > > > drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++---- > > > drivers/vhost/vhost.h | 11 ++ > > > 2 files changed, 266 insertions(+), 26 deletions(-) > > > > > > -- > > > 2.17.1