Ok I got that but based on the facts mentioned in https://bugzilla.kernel.org/show_bug.cgi?id=204939 it looks like the bad behaviour was mainly related to using the maximum rsize 1MB while read ahead was 15MB based on the equation of 15*rsize while in congested/low throughput/High latency networks the client shouldn’t use 1MB as rsize in the first place, my point is that it's reasonable to calculate the maximum read ahead size based on the utilized rsize while now it's constant as 128KB regardless the configured networking rsize, We have seen multiple customers reporting this regression after upgrading to kernel 5.4 and I am pretty sure we will get more Linux clients reporting this with more people moving to kernel 5.4 and later. Thank you. Hazem Links [1] https://bugzilla.kernel.org/show_bug.cgi?id=204939 On 31/03/2021, 15:10, "Trond Myklebust" <trondmy@xxxxxxxxxxxxxxx> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. On Wed, 2021-03-31 at 12:53 +0000, Mohamed Abuelfotoh, Hazem wrote: > Hi Trond, > > I am wondering if we should consider raising the default maximum NFS > read ahead size given the facts I mentioned in my previous e-mail. > We can't keep changing the default every time someone tries to measure a new workload. The change in 5.4 was also measurement based, and was due to poor performance in cases where rsize/wsize is smaller and readahead was overshooting. > Thank you. > > Hazem > > On 29/03/2021, 17:07, "Mohamed Abuelfotoh, Hazem" < > abuehaze@xxxxxxxxxx> wrote: > > > Hello Team, > > -We have got multiple customers complaining about NFS read > performance degradation after they upgraded to kernel 5.4.* > > -After doing some deep dive and testing we figured out that the > reason behind the regression was patch NFS: Optimise the default > readahead size[1] Which has been merged to Linux kernels 5.4.* and > above. > -Our customers are using AWS EC2 instances as client mounting EFS > export (which is AWS managed NFSV4 service), I am sharing the results > that we got before & after the upgrade given that the NFS server(EFS) > should be able to achieve between 250-300MB/sec which the clients can > achieve without patch[1] while getting quarter of this speed around > 70MB/sec with the mentioned patch merged as seen below. > > > ##################################################################### > ##################### > > > Before the upgrade: > # uname -r > 4.14.225-168.357.amzn2.x86_64 > [root@ip-172-31-28-135 ec2-user]# sync; echo 3 > > /proc/sys/vm/drop_caches > [root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o > nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,nore > svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs > [root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test . > test > 8,589,934,592 100% 313.20MB/s 0:00:26 (xfr#1, to-chk=0/1) > > > ##################################################################### > ##################### > > After the upgrade using the same client & server: > #uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs- > readahead show /home/ec2-user/efs/;rsync --progress efs/test . > 5.4.0-1.132.64.amzn2.x86_64 > /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 128 > test > 1,073,741,824 100% 68.61MB/s 0:00:14 (xfr#1, to-chk=0/1) > > > -We are recommending[2] EFS users to use rsize=1048576 as mount > option for getting the best read performance from their EFS exports > given that EC2 to EFS traffic is residing in the same AWS > availability zone hence it has low latency and up to 250-300MB/sec > throughput however with the mentioned patch merged the customer can’t > achieve this throughput after the kernel upgrade because the default > NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB so > the clients have to manually raise the manually raise the > read_ahead_kb parameter from 128 to 15360 to get the same experience > they were getting before the upgrade. > -We know that the purpose of the mentioned patch was to decrease > OS boot time (for netboot users) also decreasing Application start up > times in congested & Low throughput networks as mentioned in [3], > however this would also cause regression for high throughput & low > latency workload especially sequential read workflows. > -After doing further debugging we also found that the maximum > read ahead size is constant so there is no Autotuning for this > configuration even if the client is filling the read ahead window > which means any NFS client specially ones using maximum rsize mount > option will have to manually tune their maximum NFS read ahead size > after the upgrade which in my opinion is some sort of regression from > older kernels behaviour. > > > ##################################################################### > #################### > > After increasing the maximum NFS read ahead size to 15MB it’s > clear that read ahead window is expanded as expected and it will be > doubled until it reach 15MB. > > Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256 > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256 > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59 > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512 > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024 > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048 > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > > > ##################################################################### > #################### > > With 128KB as maximum NFS read ahead size, the read ahead window > size is increasing until it reach the configured maximum Read ahead > (128KB). > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > -In my reproduction I used rsync as clarified above and it is > always doing read syscalls requesting 256 KB in each call: > 15:47:10.780658 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023749> > 15:47:10.805467 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023739> > 15:47:10.830272 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023664> > 15:47:10.854972 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023814> > 15:47:10.879837 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023625> > 15:47:10.904496 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023645> > 15:47:10.929180 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.024072> > 15:47:10.954308 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023669> > > > -Looking into the readahead source code and I can see that > readahead is doing some heuristics to determine if the access pattern > is sequential or random then it modify the read ahead window(amount > of data it will prefetch) accordingly, read ahead can't also read > beyond the requested file size, this theoretically means that having > Large NFS Max read ahead size (15MB) shouldn't have much impact on > performance even with random I/O pattern or with data set consists of > small files, the only major impact of having large NFS read ahead > size would be some networking congestion or bootup delay with hosts > using congested or low throughput networks as illustrated in > https://bugzilla.kernel.org/show_bug.cgi?id=204939 & > https://lore.kernel.org/linux-nfs/0b76213e-426b-0654-5b69-02bfead78031@xxxxxxxxx/T/ > . > -With patch > https://www.spinics.net/lists/linux-nfs/msg75018.html the packet > captures are showing the client either asking for 128KB or 256 KB in > the NFS READ calls and it can't reach even the 1MB configured rsize > mount option this is because the ondemand_readahead which should be > responsible for moving and scaling the read ahead window has an if > condition which was a part of > https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html > , this patch actually modified read ahead to issue the maximum of the > user request size(rsync is doing 256KB read requests), and the read- > ahead max size(128KB by default), but capped to the max request size > on the device side(1MB in our case). The latter is done to avoid > reading ahead too much, if the application asks for a huge read. this > is why with 128KB as read ahead size and application asking for 256KB > we never exceed 256KB because this patch actually intended to do > that, it avoids limiting the requested data to the maximum read ahead > size but we are still limited by the minimum between amount of data > application is reading which is 256KB as sync in rsync strace output > & bdi->io_pages(256 pages=1MB) as configured in the rsize mount > option. > > -Output after adding some debugging to the kernel showing the > size of each variable in the mentioned "if" condition: > > [ 238.387788] req_size= 64 ------>256MB rsync read requests > [ 238.387790] io pages= 256----->1MB as supported by EFS and as > configured in the rsize mount option. > [ 238.390487] max_pages before= 32----->128 KB read ahead size > which is the default. > [ 238.393177] max_pages after= 64---->raised to 256 KB because > of changes mentioned in [4] "max_pages = min(req_size, bdi- > >io_pages);" > > > https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435 > > /* > * A minimal readahead algorithm for trivial sequential/random > reads. > */ > static void ondemand_readahead(struct readahead_control *ractl, > struct file_ra_state *ra, bool hit_readahead_marker, > unsigned long req_size) > { > struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping- > >host); > unsigned long max_pages = ra->ra_pages; > unsigned long add_pages; > unsigned long index = readahead_index(ractl); > pgoff_t prev_index; > > /* > * If the request exceeds the readahead window, allow the > read to > * be up to the optimal hardware IO size > */ > if (req_size > max_pages && bdi->io_pages > max_pages) > max_pages = min(req_size, bdi->io_pages); > > > > ##################################################################### > ############################# > > -With 128KB as default maximum read ahead size the packet capture > from the client side is showing the NFSv4 READ calls showing count > in bytes moving between 128KB to 256KB. > > 73403 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call > READ StateID: 0xcec2 Offset: 1072955392 Len: 131072 > 73404 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call > READ StateID: 0xcec2 Offset: 1073086464 Len: 262144 > 73406 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply > (Call In 73403)[Unreassembled Packet] > 73415 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call > READ StateID: 0xcec2 Offset: 1073348608 Len: 131072 > 73416 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply > (Call In 73404)[Unreassembled Packet] > 73428 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call > READ StateID: 0xcec2 Offset: 1073479680 Len: 131072 > 73429 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply > (Call In 73415)[Unreassembled Packet] > 73438 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call > READ StateID: 0xcec2 Offset: 1073610752 Len: 131072 > 73439 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply > (Call In 73428)[Unreassembled Packet] > > -nfsstat is showing 8183 NFSV4 READ calls required to read 1GB > file. > > # nfsstat > Client rpc stats: > calls retrans authrefrsh > 8204 0 8204 > > Client nfs v4: > null read write commit open > open_conf > 1 0% 8183 99% 0 0% 0 0% 0 0% > 0 0% > open_noat open_dgrd close setattr fsinfo > renew > 1 0% 0 0% 1 0% 0 0% 2 0% > 0 0% > setclntid confirm lock lockt locku > access > 0 0% 0 0% 0 0% 0 0% 0 0% > 1 0% > getattr lookup lookup_root remove rename > link > 4 0% 1 0% 1 0% 0 0% 0 0% > 0 0% > symlink create pathconf statfs readlink > readdir > 0 0% 0 0% 1 0% 0 0% 0 0% > 0 0% > server_caps delegreturn getacl setacl fs_locations > rel_lkowner > 3 0% 0 0% 0 0% 0 0% 0 0% > 0 0% > secinfo exchange_id create_ses destroy_ses sequence > get_lease_t > 0 0% 0 0% 2 0% 1 0% 0 0% > 0 0% > reclaim_comp layoutget getdevinfo layoutcommit layoutreturn > getdevlist > 0 0% 1 0% 0 0% 0 0% 0 0% > 0 0% > (null) > 1 0% > > > ##################################################################### > ########################## > > -When using 15MB as maximum read ahead size, the client is > sending 1MB NFSV4 read requests hence it’s able to read the same 1GB > file in 1024 NFS READ calls > > #uname -r; mount -t nfs4 -o > nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,nore > svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-readahead > show /home/ec2-user/efs/ > 5.3.9 > /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 15360 > #sync; echo 3 > /proc/sys/vm/drop_caches > #rsync --progress efs/test . > test > 1,073,741,824 100% 260.15MB/s 0:00:03 (xfr#1, to-chk=0/1) > [root@ip-172-31-17-42 ec2-user]# nfsstat > Client rpc stats: > calls retrans authrefrsh > 1043 0 1043 > > Client nfs v4: > null read write commit open > open_conf > 1 0% 1024 98% 0 0% 0 0% 0 0% > 0 0% > open_noat open_dgrd close setattr fsinfo > renew > 1 0% 0 0% 1 0% 0 0% 2 0% > 0 0% > setclntid confirm lock lockt locku > access > 0 0% 0 0% 0 0% 0 0% 0 0% > 1 0% > getattr lookup lookup_root remove rename > link > 2 0% 1 0% 1 0% 0 0% 0 0% > 0 0% > symlink create pathconf statfs readlink > readdir > 0 0% 0 0% 1 0% 0 0% 0 0% > 0 0% > server_caps delegreturn getacl setacl fs_locations > rel_lkowner > 3 0% 0 0% 0 0% 0 0% 0 0% > 0 0% > secinfo exchange_id create_ses destroy_ses sequence > get_lease_t > 0 0% 0 0% 2 0% 1 0% 0 0% > 0 0% > reclaim_comp layoutget getdevinfo layoutcommit layoutreturn > getdevlist > 0 0% 1 0% 0 0% 0 0% 0 0% > 0 0% > (null) > 1 0% > > -The packet capture from the client side is showing NFSv4 READ > calls with 1MB as read count in bytes when having 15MB as maximum NFS > read ahead size. > > 2021-03-22 14:25:34.984731 9398 172.31.17.42 → 172.31.28.161 NFS > 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: 1048576 > 2021-03-22 14:25:34.984805 9405 172.31.17.42 → 172.31.28.161 NFS > 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: 1048576 > 2021-03-22 14:25:34.984902 9416 172.31.17.42 → 172.31.28.161 NFS > 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: 1048576 > 2021-03-22 14:25:34.984941 9421 172.31.17.42 → 172.31.28.161 NFS > 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: 1048576 > > > ##################################################################### > ########################## > > > -I think there are 2 options to mitigate this behaviour which I > am listing below: > A)Raising the default maximum NFS read ahead size because the > current default 128KB doesn’t seem to be sufficient for High > throughout & low latency workload, I strongly believe that the NFS > rsize mount option should be used as variable in deciding the maximum > NFS read ahead size which was the case before [1] while now it’s > always 128KB regardless the utilized rsize mount option. Also I think > clients running in High latency & Low throughout environment > shouldn’t use 1MB as rsize in their mount options(i.e they should use > smaller rsize) because it may increase their suffering even with low > maximum NFS read ahead size. > B)Adding some logic to read ahead to have some kind of Autotuning > (Similar to TCP Autotuning) where the maximum read ahead size can > dynamically increase in case the client/reader is constantly filling > up/utilizing the read ahead window size. > > > Links: > [1] https://www.spinics.net/lists/linux-nfs/msg75018.html > [2] > https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html > [3] https://bugzilla.kernel.org/show_bug.cgi?id=204939 > [4] > https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html > > > Thank you. > > Hazem > > > > > > > > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 > Luxembourg, R.C.S. Luxembourg B186284 > > Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, > Burlington Road, Dublin 4, Ireland, branch registration number 908705 > > -- Trond Myklebust CTO, Hammerspace Inc 4984 El Camino Real, Suite 208 Los Altos, CA 94022 www.hammer.space Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284 Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705