Re: Poor performance on a server-class system vs. desktop

Amar Tumballi <amar@xxxxxxxxx> · Fri, 27 Nov 2020 14:10:49 +0530

Top posting as my observations are general and doesn't speak anything specific to the problem at hand, and what are our ideas to improve it.

Thanks Dmitry for a good thread :-)

I will try to break this into a long answer, but will give short answer for question.

Does a single thread user app take a huge benefit from larger RAM/CPU ? - NO. 
So, how is distributed storage performance measured? - By running as many threads (and different client mounts) as possible to saturate the n/w on servers.

Let's get to longer look into performance:

First of all, when we talk performance of the local storage Vs network storage Vs distributed storage multiple things needs to be considered:

Local Storage (lets say NVMe/SSD):  User App -> Kernel (ie, a syscall) -> Access harddrive. (This is one way, the call returns in the same path).
Network storage (Say NFS): User App -> kernel (nfs client through syscall) -> network call -> Server process (nfsd) -> kernel (syscall on the storage machine) -> Access harddrive (Reverse path also needs to be traversed to complete the call).
Distributed Storage (Say GlusterFS): User App -> Kernel (syscall to fuse) -> glusterfs client (callback from fuse) -> network call -> glusterfsd -> kernel (syscall) -> access to harddrive (reverse path for completing the call).

Historically, Disk and Network were the slowest part here, so the 'kernel' part was almost non-existent as a bottleneck. Gluster did well with aggregation, and a linear performance improvement as long as this was true. Ie, your network and disk were a significant % bottleneck of your storage stack. The linear scale-out is true even today with NVMe and faster networks, but the % difference from that of individual local storage performance to glusterfs performance has increased mainly because of the more layers it traverses now. What we are observing now with 100Gbps network and NVMe drives is, most of the bottlenecks seen in network layer and disk are going away, and the bottleneck is visible in the way we do certain operations inside of glusterfs performance. Of late, we are noticing the bottlenecks are in number of system calls we do as part of a single call user does. For example, if you enable all the features of gluster, a single open call would translate into 10s of calls on the disk (stat()/getxattr(){s}/open().  This results in some delay. Also with a process which utilizes many CPU cores, there is a penalty when synchronization happens (and being distributed, multi threaded, multi client architecture, glusterfs uses multiple locks).

We are working towards a unified caching translator, which would reduce access to disk, which means we reduce many systemcalls made to disk. Also we are aware network layer is a bottleneck (with XDR formating and the way we process RPC packages). But taking up network layer optimizations (and also use RDMA effectively) is a larger task.  We are looking for volunteers to pick up this network enhancement task which would benefit a lot.

Now, coming back to the subject, more the CPUs, same test is showing lesser performance gain because your locks would be taking more % bottleneck than in your Laptop.  Can you try running the same test with restricting the number of Cores the glusterfsd uses to 4 and retry the test?

Regards,
Amar  

On Fri, Nov 27, 2020 at 11:23 AM Dmitry Antipov <dmantipov@xxxxxxxxx> wrote:
On 11/26/20 8:14 PM, Gionatan Danti wrote:

> So I think you simply are CPU limited. I remember doing some tests with loopback RAM disks and finding that Gluster used 100% CPU (ie: full load on an entire core) when doing 4K random writes. Side 

> note: using synchronized (ie: fsync) 4k writes, I only get ~600 IOPs even when running both bricks on the same machine and backing them with RAM disks (in other words, with no network or disk 

> bottleneck).

Thanks, it seems you're right. Running local replica 3 volume on 3x1Gb ramdisks, I'm seeing:

top - 08:44:35 up 1 day, 11:51,  1 user,  load average: 2.34, 1.94, 1.00

Tasks: 237 total,   2 running, 235 sleeping,   0 stopped,   0 zombie

%Cpu(s): 38.7 us, 29.4 sy,  0.0 ni, 23.6 id,  0.0 wa,  0.4 hi,  7.9 si,  0.0 st

MiB Mem :  15889.8 total,   1085.7 free,   1986.3 used,  12817.8 buff/cache

MiB Swap:      0.0 total,      0.0 free,      0.0 used.  12307.3 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

63651 root      20   0  664124  41676   9600 R 166.7   0.3   0:24.20 fio

63282 root      20   0 1235336  21484   8768 S 120.4   0.1   2:43.73 glusterfsd

63298 root      20   0 1235368  20512   8856 S 120.0   0.1   2:42.43 glusterfsd

63314 root      20   0 1236392  21396   8684 S 119.8   0.1   2:41.94 glusterfsd

So, 32-core server-class system with a lot of RAM can't perform much faster for an

individual I/O client - it just scales better if there are a lot of clients, right?

Dmitry

________

Community Meeting Calendar:

Schedule -

Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC

Bridge: https://meet.google.com/cpu-eiue-hvk

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

-- 
--https://kadalu.io
Container Storage made easy!

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users