Gluster performance updates

Xavi Hernandez <xhernandez@xxxxxxxxxx> · Mon, 1 Oct 2018 13:45:28 +0200

Hi,
this is an update containing some work done regarding performance and consistency during latest weeks. We'll try to build a complete list of all known issues and track them through this email thread. Please, let me know of any performance issue not included in this email so that we can build and track all of them.

New improvements

While testing performance on Red Hat products, we have identified a problem in the way eager-locking was working on replicate volumes for some scenarios (virtualization and database workloads were affected). It caused an unnecessary amount of finodelk and fxattrop requests, that was increasing latency of write operations.

This has already been fixed with patches [1] and [2].

We have also identified some additional settings that provide better performance for database workloads. A patch [3] to update the default database profile with the new settings has been merged.

Combining all these changes (AFR fix and settings), pgbench performance has improved ~300% on bare metal using NVME, and a random I/O fio test running on VM has also improved more than 300%.

Known issues

We have identified two issues in fuse mounts:
Becasue of selinux in client machine, a getxattr request is sent by fuse before each write request. Though it adds some latency, currently this request is directly answered by fuse xlator when selinux is not enabled in gluster (default setting).
When fopen-keep-cache is enabled (default setting), kernel fuse sends stat requests before each read. Even disabling fopen-keep-cache, fuse still sends half of the stat requests. This has been tracked down to the atime update, however mounting a volume with noatime doesn't help to solve the issue because kernel fuse doesn't correctly handle noatime setting.
Some other issues are detected:
Bad performance of write-behind when stat and writes to the same file are mixed. Right now, when a stat is received, all previous cached writes are flushed before processing the new request. The same happens for reads when it overlaps with a cached previous write. This makes write-behind useless in this scenario.
Note: fuse is currently sending stat requests before reads (see previous known issue), making reads almost as problematic as stat requests.
Self-heal seems to be slow. It's still being investigated but there are some indications that we have a considerable amount of contention in io-threads. This contention could be the cause of some other performance issues, but we'll need to investigate more about this. There is already some work [4] trying to reduce it.
'ls' performance is not good in some cases. When the volume has many bricks, 'ls' performance tends to degrade. We are still investigating the cause, but one important factor is that DHT sends readdir(p) requests to all its subvolumes, This means that 'ls' will run at the speed of the slower of the bricks. If any brick has an issue, or a spike in load, even if it's transitory, it will have a bad impact in 'ls' performance. This can be alleviated by enabling parallel-readdir and readdir-ahead option.
Note: There have been reports that enabling parallel-readdir causes some entries to apparently disappear after some time (though they are still present on the bricks). I'm not aware of the root cause yet.
The number of threads in a server is quite high when multiple bricks are present, even if brick-mux is used. There are some efforts [5] trying to reduce this number.

New features

We have recently started the design [6] of a new caching infrastructure that should provide much better performance, specially for small files or metadata intensive workloads. It should also provide a safe infrastructure to keep cached information consistent on all clients.

This framework will make caching features available to any xlator that could need them in an easy and safe way.

The current thinking is that current existing caching xlators (mostly md-cache, io-cache and write-behind) will probably be reworked as a single complete caching xlator, since this makes things easier.

Any feedback or ideas will be highly appreciated.

Xavi

[1] https://review.gluster.org/21107
[2] https://review.gluster.org/21210
[3] https://review.gluster.org/21247
[4] https://review.gluster.org/21039
[5] https://review.gluster.org/20859
[6] https://docs.google.com/document/d/1elX-WZfPWjfTdJxXhgwq37CytRehPO4D23aaVowtiE8/edit?usp=sharing
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel