Re: Performance improvements

Vijay Bellur <vbellur@xxxxxxxxxx> · Thu, 24 Jan 2019 23:53:12 -0800

Thank you for the detailed update, Xavi! This looks very interesting. 

On Thu, Jan 24, 2019 at 7:50 AM Xavi Hernandez <xhernandez@xxxxxxxxxx> wrote:
Hi all,
I've just updated a patch [1] that implements a new thread pool based on a wait-free queue provided by userspace-rcu library. The patch also includes an auto scaling mechanism that only keeps running the needed amount of threads for the current workload.

This new approach has some advantages:
It's provided globally inside libglusterfs instead of inside an xlator
This makes it possible that fuse thread and epoll threads transfer the received request to another thread sooner, wating less CPU and reacting sooner to other incoming requests.
Adding jobs to the queue used by the thread pool only requires an atomic operation
This makes the producer side of the queue really fast, almost with no delay.
Contention is reduced
The producer side has negligible contention thanks to the wait-free enqueue operation based on an atomic access. The consumer side requires a mutex, but the duration is very small and the scaling mechanism makes sure that there are no more threads than needed contending for the mutex.

This change disables io-threads, since it replaces part of its functionality. However there are two things that could be needed from io-threads:Prioritization of fops
Currently, io-threads assigns priorities to each fop, so that some fops are handled before than others.
Fair distribution of execution slots between clients
Currently, io-threads processes requests from each client in round-robin.

These features are not implemented right now. If they are needed, probably the best thing to do would be to keep them inside io-threads, but change its implementation so that it uses the global threads from the thread pool instead of its own threads.

These features are indeed useful to have and hence modifying the implementation of io-threads to provide this behavior would be welcome.

These tests have shown that the limiting factor has been the disk in most cases, so it's hard to tell if the change has really improved things. There is only one clear exception: self-heal on a dispersed volume completes 12.7% faster. The utilization of CPU has also dropped drastically:

Old implementation: 12.30 user, 41.78 sys, 43.16 idle,  0.73 wait
New implementation: 4.91 user,  5.52 sys, 81.60 idle,  5.91 wait

Now I'm running some more tests on NVMe to try to see the effects of the change when disk is not limiting performance. I'll update once I've more data.

Will look forward to these numbers.

Regards,
Vijay 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel