Top posting as my observations are general and doesn't speak anything specific to the problem at hand, and what are our ideas to improve it.
Thanks Dmitry for a good thread :-)
I will try to break this into a long answer, but will give short answer for question.
Does a single thread user app take a huge benefit from larger RAM/CPU ? - NO.
So, how is distributed storage performance measured? - By running as many threads (and different client mounts) as possible to saturate the n/w on servers.
Let's get to longer look into performance:
First of all, when we talk performance of the local storage Vs network storage Vs distributed storage multiple things needs to be considered:
Local Storage (lets say NVMe/SSD): User App -> Kernel (ie, a syscall) -> Access harddrive. (This is one way, the call returns in the same path).
Network storage (Say NFS): User App -> kernel (nfs client through syscall) -> network call -> Server process (nfsd) -> kernel (syscall on the storage machine) -> Access harddrive (Reverse path also needs to be traversed to complete the call).
Distributed Storage (Say GlusterFS): User App -> Kernel (syscall to fuse) -> glusterfs client (callback from fuse) -> network call -> glusterfsd -> kernel (syscall) -> access to harddrive (reverse path for completing the call).
Historically, Disk and Network were the slowest part here, so the 'kernel' part was almost non-existent as a bottleneck. Gluster did well with aggregation, and a linear performance improvement as long as this was true. Ie, your network and disk were a significant % bottleneck of your storage stack. The linear scale-out is true even today with NVMe and faster networks, but the % difference from that of individual local storage performance to glusterfs performance has increased mainly because of the more layers it traverses now. What we are observing now with 100Gbps network and NVMe drives is, most of the bottlenecks seen in network layer and disk are going away, and the bottleneck is visible in the way we do certain operations inside of glusterfs performance. Of late, we are noticing the bottlenecks are in number of system calls we do as part of a single call user does. For example, if you enable all the features of gluster, a single open call would translate into 10s of calls on the disk (stat()/getxattr(){s}/open(). This results in some delay. Also with a process which utilizes many CPU cores, there is a penalty when synchronization happens (and being distributed, multi threaded, multi client architecture, glusterfs uses multiple locks).
We are working towards a unified caching translator, which would reduce access to disk, which means we reduce many systemcalls made to disk. Also we are aware network layer is a bottleneck (with XDR formating and the way we process RPC packages). But taking up network layer optimizations (and also use RDMA effectively) is a larger task. We are looking for volunteers to pick up this network enhancement task which would benefit a lot.
Now, coming back to the subject, more the CPUs, same test is showing lesser performance gain because your locks would be taking more % bottleneck than in your Laptop. Can you try running the same test with restricting the number of Cores the glusterfsd uses to 4 and retry the test?
Regards,
Amar
On Fri, Nov 27, 2020 at 11:23 AM Dmitry Antipov <dmantipov@xxxxxxxxx> wrote:
On 11/26/20 8:14 PM, Gionatan Danti wrote:
> So I think you simply are CPU limited. I remember doing some tests with loopback RAM disks and finding that Gluster used 100% CPU (ie: full load on an entire core) when doing 4K random writes. Side
> note: using synchronized (ie: fsync) 4k writes, I only get ~600 IOPs even when running both bricks on the same machine and backing them with RAM disks (in other words, with no network or disk
> bottleneck).
Thanks, it seems you're right. Running local replica 3 volume on 3x1Gb ramdisks, I'm seeing:
top - 08:44:35 up 1 day, 11:51, 1 user, load average: 2.34, 1.94, 1.00
Tasks: 237 total, 2 running, 235 sleeping, 0 stopped, 0 zombie
%Cpu(s): 38.7 us, 29.4 sy, 0.0 ni, 23.6 id, 0.0 wa, 0.4 hi, 7.9 si, 0.0 st
MiB Mem : 15889.8 total, 1085.7 free, 1986.3 used, 12817.8 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 12307.3 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
63651 root 20 0 664124 41676 9600 R 166.7 0.3 0:24.20 fio
63282 root 20 0 1235336 21484 8768 S 120.4 0.1 2:43.73 glusterfsd
63298 root 20 0 1235368 20512 8856 S 120.0 0.1 2:42.43 glusterfsd
63314 root 20 0 1236392 21396 8684 S 119.8 0.1 2:41.94 glusterfsd
So, 32-core server-class system with a lot of RAM can't perform much faster for an
individual I/O client - it just scales better if there are a lot of clients, right?
Dmitry
________
Community Meeting Calendar:
Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://meet.google.com/cpu-eiue-hvk Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users