Hi Trond, I am wondering if we should consider raising the default maximum NFS read ahead size given the facts I mentioned in my previous e-mail. Thank you. Hazem On 29/03/2021, 17:07, "Mohamed Abuelfotoh, Hazem" <abuehaze@xxxxxxxxxx> wrote: Hello Team, -We have got multiple customers complaining about NFS read performance degradation after they upgraded to kernel 5.4.* -After doing some deep dive and testing we figured out that the reason behind the regression was patch NFS: Optimise the default readahead size[1] Which has been merged to Linux kernels 5.4.* and above. -Our customers are using AWS EC2 instances as client mounting EFS export (which is AWS managed NFSV4 service), I am sharing the results that we got before & after the upgrade given that the NFS server(EFS) should be able to achieve between 250-300MB/sec which the clients can achieve without patch[1] while getting quarter of this speed around 70MB/sec with the mentioned patch merged as seen below. ########################################################################################## Before the upgrade: # uname -r 4.14.225-168.357.amzn2.x86_64 [root@ip-172-31-28-135 ec2-user]# sync; echo 3 > /proc/sys/vm/drop_caches [root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs [root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test . test 8,589,934,592 100% 313.20MB/s 0:00:26 (xfr#1, to-chk=0/1) ########################################################################################## After the upgrade using the same client & server: #uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs-readahead show /home/ec2-user/efs/;rsync --progress efs/test . 5.4.0-1.132.64.amzn2.x86_64 /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 128 test 1,073,741,824 100% 68.61MB/s 0:00:14 (xfr#1, to-chk=0/1) -We are recommending[2] EFS users to use rsize=1048576 as mount option for getting the best read performance from their EFS exports given that EC2 to EFS traffic is residing in the same AWS availability zone hence it has low latency and up to 250-300MB/sec throughput however with the mentioned patch merged the customer can’t achieve this throughput after the kernel upgrade because the default NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB so the clients have to manually raise the manually raise the read_ahead_kb parameter from 128 to 15360 to get the same experience they were getting before the upgrade. -We know that the purpose of the mentioned patch was to decrease OS boot time (for netboot users) also decreasing Application start up times in congested & Low throughput networks as mentioned in [3], however this would also cause regression for high throughput & low latency workload especially sequential read workflows. -After doing further debugging we also found that the maximum read ahead size is constant so there is no Autotuning for this configuration even if the client is filling the read ahead window which means any NFS client specially ones using maximum rsize mount option will have to manually tune their maximum NFS read ahead size after the upgrade which in my opinion is some sort of regression from older kernels behaviour. ######################################################################################### After increasing the maximum NFS read ahead size to 15MB it’s clear that read ahead window is expanded as expected and it will be doubled until it reach 15MB. Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256 Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256 Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59 Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512 Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024 Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048 Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 ######################################################################################### With 128KB as maximum NFS read ahead size, the read ahead window size is increasing until it reach the configured maximum Read ahead (128KB). Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 -In my reproduction I used rsync as clarified above and it is always doing read syscalls requesting 256 KB in each call: 15:47:10.780658 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023749> 15:47:10.805467 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023739> 15:47:10.830272 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023664> 15:47:10.854972 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023814> 15:47:10.879837 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023625> 15:47:10.904496 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023645> 15:47:10.929180 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.024072> 15:47:10.954308 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023669> -Looking into the readahead source code and I can see that readahead is doing some heuristics to determine if the access pattern is sequential or random then it modify the read ahead window(amount of data it will prefetch) accordingly, read ahead can't also read beyond the requested file size, this theoretically means that having Large NFS Max read ahead size (15MB) shouldn't have much impact on performance even with random I/O pattern or with data set consists of small files, the only major impact of having large NFS read ahead size would be some networking congestion or bootup delay with hosts using congested or low throughput networks as illustrated in https://bugzilla.kernel.org/show_bug.cgi?id=204939 & https://lore.kernel.org/linux-nfs/0b76213e-426b-0654-5b69-02bfead78031@xxxxxxxxx/T/. -With patch https://www.spinics.net/lists/linux-nfs/msg75018.html the packet captures are showing the client either asking for 128KB or 256 KB in the NFS READ calls and it can't reach even the 1MB configured rsize mount option this is because the ondemand_readahead which should be responsible for moving and scaling the read ahead window has an if condition which was a part of https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html, this patch actually modified read ahead to issue the maximum of the user request size(rsync is doing 256KB read requests), and the read-ahead max size(128KB by default), but capped to the max request size on the device side(1MB in our case). The latter is done to avoid reading ahead too much, if the application asks for a huge read. this is why with 128KB as read ahead size and application asking for 256KB we never exceed 256KB because this patch actually intended to do that, it avoids limiting the requested data to the maximum read ahead size but we are still limited by the minimum between amount of data application is reading which is 256KB as sync in rsync strace output & bdi->io_pages(256 pages=1MB) as configured in the rsize mount option. -Output after adding some debugging to the kernel showing the size of each variable in the mentioned "if" condition: [ 238.387788] req_size= 64 ------>256MB rsync read requests [ 238.387790] io pages= 256----->1MB as supported by EFS and as configured in the rsize mount option. [ 238.390487] max_pages before= 32----->128 KB read ahead size which is the default. [ 238.393177] max_pages after= 64---->raised to 256 KB because of changes mentioned in [4] "max_pages = min(req_size, bdi->io_pages);" https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435 /* * A minimal readahead algorithm for trivial sequential/random reads. */ static void ondemand_readahead(struct readahead_control *ractl, struct file_ra_state *ra, bool hit_readahead_marker, unsigned long req_size) { struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host); unsigned long max_pages = ra->ra_pages; unsigned long add_pages; unsigned long index = readahead_index(ractl); pgoff_t prev_index; /* * If the request exceeds the readahead window, allow the read to * be up to the optimal hardware IO size */ if (req_size > max_pages && bdi->io_pages > max_pages) max_pages = min(req_size, bdi->io_pages); ################################################################################################## -With 128KB as default maximum read ahead size the packet capture from the client side is showing the NFSv4 READ calls showing count in bytes moving between 128KB to 256KB. 73403 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1072955392 Len: 131072 73404 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073086464 Len: 262144 73406 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73403)[Unreassembled Packet] 73415 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073348608 Len: 131072 73416 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73404)[Unreassembled Packet] 73428 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073479680 Len: 131072 73429 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73415)[Unreassembled Packet] 73438 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073610752 Len: 131072 73439 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73428)[Unreassembled Packet] -nfsstat is showing 8183 NFSV4 READ calls required to read 1GB file. # nfsstat Client rpc stats: calls retrans authrefrsh 8204 0 8204 Client nfs v4: null read write commit open open_conf 1 0% 8183 99% 0 0% 0 0% 0 0% 0 0% open_noat open_dgrd close setattr fsinfo renew 1 0% 0 0% 1 0% 0 0% 2 0% 0 0% setclntid confirm lock lockt locku access 0 0% 0 0% 0 0% 0 0% 0 0% 1 0% getattr lookup lookup_root remove rename link 4 0% 1 0% 1 0% 0 0% 0 0% 0 0% symlink create pathconf statfs readlink readdir 0 0% 0 0% 1 0% 0 0% 0 0% 0 0% server_caps delegreturn getacl setacl fs_locations rel_lkowner 3 0% 0 0% 0 0% 0 0% 0 0% 0 0% secinfo exchange_id create_ses destroy_ses sequence get_lease_t 0 0% 0 0% 2 0% 1 0% 0 0% 0 0% reclaim_comp layoutget getdevinfo layoutcommit layoutreturn getdevlist 0 0% 1 0% 0 0% 0 0% 0 0% 0 0% (null) 1 0% ############################################################################################### -When using 15MB as maximum read ahead size, the client is sending 1MB NFSV4 read requests hence it’s able to read the same 1GB file in 1024 NFS READ calls #uname -r; mount -t nfs4 -o nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-readahead show /home/ec2-user/efs/ 5.3.9 /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 15360 #sync; echo 3 > /proc/sys/vm/drop_caches #rsync --progress efs/test . test 1,073,741,824 100% 260.15MB/s 0:00:03 (xfr#1, to-chk=0/1) [root@ip-172-31-17-42 ec2-user]# nfsstat Client rpc stats: calls retrans authrefrsh 1043 0 1043 Client nfs v4: null read write commit open open_conf 1 0% 1024 98% 0 0% 0 0% 0 0% 0 0% open_noat open_dgrd close setattr fsinfo renew 1 0% 0 0% 1 0% 0 0% 2 0% 0 0% setclntid confirm lock lockt locku access 0 0% 0 0% 0 0% 0 0% 0 0% 1 0% getattr lookup lookup_root remove rename link 2 0% 1 0% 1 0% 0 0% 0 0% 0 0% symlink create pathconf statfs readlink readdir 0 0% 0 0% 1 0% 0 0% 0 0% 0 0% server_caps delegreturn getacl setacl fs_locations rel_lkowner 3 0% 0 0% 0 0% 0 0% 0 0% 0 0% secinfo exchange_id create_ses destroy_ses sequence get_lease_t 0 0% 0 0% 2 0% 1 0% 0 0% 0 0% reclaim_comp layoutget getdevinfo layoutcommit layoutreturn getdevlist 0 0% 1 0% 0 0% 0 0% 0 0% 0 0% (null) 1 0% -The packet capture from the client side is showing NFSv4 READ calls with 1MB as read count in bytes when having 15MB as maximum NFS read ahead size. 2021-03-22 14:25:34.984731 9398 172.31.17.42 → 172.31.28.161 NFS 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: 1048576 2021-03-22 14:25:34.984805 9405 172.31.17.42 → 172.31.28.161 NFS 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: 1048576 2021-03-22 14:25:34.984902 9416 172.31.17.42 → 172.31.28.161 NFS 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: 1048576 2021-03-22 14:25:34.984941 9421 172.31.17.42 → 172.31.28.161 NFS 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: 1048576 ############################################################################################### -I think there are 2 options to mitigate this behaviour which I am listing below: A)Raising the default maximum NFS read ahead size because the current default 128KB doesn’t seem to be sufficient for High throughout & low latency workload, I strongly believe that the NFS rsize mount option should be used as variable in deciding the maximum NFS read ahead size which was the case before [1] while now it’s always 128KB regardless the utilized rsize mount option. Also I think clients running in High latency & Low throughout environment shouldn’t use 1MB as rsize in their mount options(i.e they should use smaller rsize) because it may increase their suffering even with low maximum NFS read ahead size. B)Adding some logic to read ahead to have some kind of Autotuning (Similar to TCP Autotuning) where the maximum read ahead size can dynamically increase in case the client/reader is constantly filling up/utilizing the read ahead window size. Links: [1] https://www.spinics.net/lists/linux-nfs/msg75018.html [2] https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html [3] https://bugzilla.kernel.org/show_bug.cgi?id=204939 [4] https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html Thank you. Hazem Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284 Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705