On Wed, 2021-03-31 at 12:53 +0000, Mohamed Abuelfotoh, Hazem wrote: > Hi Trond, > > I am wondering if we should consider raising the default maximum NFS > read ahead size given the facts I mentioned in my previous e-mail. > We can't keep changing the default every time someone tries to measure a new workload. The change in 5.4 was also measurement based, and was due to poor performance in cases where rsize/wsize is smaller and readahead was overshooting. > Thank you. > > Hazem > > On 29/03/2021, 17:07, "Mohamed Abuelfotoh, Hazem" < > abuehaze@xxxxxxxxxx> wrote: > > > Hello Team, > > -We have got multiple customers complaining about NFS read > performance degradation after they upgraded to kernel 5.4.* > > -After doing some deep dive and testing we figured out that the > reason behind the regression was patch NFS: Optimise the default > readahead size[1] Which has been merged to Linux kernels 5.4.* and > above. > -Our customers are using AWS EC2 instances as client mounting EFS > export (which is AWS managed NFSV4 service), I am sharing the results > that we got before & after the upgrade given that the NFS server(EFS) > should be able to achieve between 250-300MB/sec which the clients can > achieve without patch[1] while getting quarter of this speed around > 70MB/sec with the mentioned patch merged as seen below. > > > ##################################################################### > ##################### > > > Before the upgrade: > # uname -r > 4.14.225-168.357.amzn2.x86_64 > [root@ip-172-31-28-135 ec2-user]# sync; echo 3 > > /proc/sys/vm/drop_caches > [root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o > nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,nore > svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs > [root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test . > test > 8,589,934,592 100% 313.20MB/s 0:00:26 (xfr#1, to-chk=0/1) > > > ##################################################################### > ##################### > > After the upgrade using the same client & server: > #uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs- > readahead show /home/ec2-user/efs/;rsync --progress efs/test . > 5.4.0-1.132.64.amzn2.x86_64 > /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 128 > test > 1,073,741,824 100% 68.61MB/s 0:00:14 (xfr#1, to-chk=0/1) > > > -We are recommending[2] EFS users to use rsize=1048576 as mount > option for getting the best read performance from their EFS exports > given that EC2 to EFS traffic is residing in the same AWS > availability zone hence it has low latency and up to 250-300MB/sec > throughput however with the mentioned patch merged the customer can’t > achieve this throughput after the kernel upgrade because the default > NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB so > the clients have to manually raise the manually raise the > read_ahead_kb parameter from 128 to 15360 to get the same experience > they were getting before the upgrade. > -We know that the purpose of the mentioned patch was to decrease > OS boot time (for netboot users) also decreasing Application start up > times in congested & Low throughput networks as mentioned in [3], > however this would also cause regression for high throughput & low > latency workload especially sequential read workflows. > -After doing further debugging we also found that the maximum > read ahead size is constant so there is no Autotuning for this > configuration even if the client is filling the read ahead window > which means any NFS client specially ones using maximum rsize mount > option will have to manually tune their maximum NFS read ahead size > after the upgrade which in my opinion is some sort of regression from > older kernels behaviour. > > > ##################################################################### > #################### > > After increasing the maximum NFS read ahead size to 15MB it’s > clear that read ahead window is expanded as expected and it will be > doubled until it reach 15MB. > > Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256 > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256 > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59 > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512 > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024 > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048 > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > > > ##################################################################### > #################### > > With 128KB as maximum NFS read ahead size, the read ahead window > size is increasing until it reach the configured maximum Read ahead > (128KB). > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > -In my reproduction I used rsync as clarified above and it is > always doing read syscalls requesting 256 KB in each call: > 15:47:10.780658 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023749> > 15:47:10.805467 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023739> > 15:47:10.830272 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023664> > 15:47:10.854972 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023814> > 15:47:10.879837 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023625> > 15:47:10.904496 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023645> > 15:47:10.929180 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.024072> > 15:47:10.954308 read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > , 262144) = 262144 <0.023669> > > > -Looking into the readahead source code and I can see that > readahead is doing some heuristics to determine if the access pattern > is sequential or random then it modify the read ahead window(amount > of data it will prefetch) accordingly, read ahead can't also read > beyond the requested file size, this theoretically means that having > Large NFS Max read ahead size (15MB) shouldn't have much impact on > performance even with random I/O pattern or with data set consists of > small files, the only major impact of having large NFS read ahead > size would be some networking congestion or bootup delay with hosts > using congested or low throughput networks as illustrated in > https://bugzilla.kernel.org/show_bug.cgi?id=204939 & > https://lore.kernel.org/linux-nfs/0b76213e-426b-0654-5b69-02bfead78031@xxxxxxxxx/T/ > . > -With patch > https://www.spinics.net/lists/linux-nfs/msg75018.html the packet > captures are showing the client either asking for 128KB or 256 KB in > the NFS READ calls and it can't reach even the 1MB configured rsize > mount option this is because the ondemand_readahead which should be > responsible for moving and scaling the read ahead window has an if > condition which was a part of > https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html > , this patch actually modified read ahead to issue the maximum of the > user request size(rsync is doing 256KB read requests), and the read- > ahead max size(128KB by default), but capped to the max request size > on the device side(1MB in our case). The latter is done to avoid > reading ahead too much, if the application asks for a huge read. this > is why with 128KB as read ahead size and application asking for 256KB > we never exceed 256KB because this patch actually intended to do > that, it avoids limiting the requested data to the maximum read ahead > size but we are still limited by the minimum between amount of data > application is reading which is 256KB as sync in rsync strace output > & bdi->io_pages(256 pages=1MB) as configured in the rsize mount > option. > > -Output after adding some debugging to the kernel showing the > size of each variable in the mentioned "if" condition: > > [ 238.387788] req_size= 64 ------>256MB rsync read requests > [ 238.387790] io pages= 256----->1MB as supported by EFS and as > configured in the rsize mount option. > [ 238.390487] max_pages before= 32----->128 KB read ahead size > which is the default. > [ 238.393177] max_pages after= 64---->raised to 256 KB because > of changes mentioned in [4] "max_pages = min(req_size, bdi- > >io_pages);" > > > https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435 > > /* > * A minimal readahead algorithm for trivial sequential/random > reads. > */ > static void ondemand_readahead(struct readahead_control *ractl, > struct file_ra_state *ra, bool hit_readahead_marker, > unsigned long req_size) > { > struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping- > >host); > unsigned long max_pages = ra->ra_pages; > unsigned long add_pages; > unsigned long index = readahead_index(ractl); > pgoff_t prev_index; > > /* > * If the request exceeds the readahead window, allow the > read to > * be up to the optimal hardware IO size > */ > if (req_size > max_pages && bdi->io_pages > max_pages) > max_pages = min(req_size, bdi->io_pages); > > > > ##################################################################### > ############################# > > -With 128KB as default maximum read ahead size the packet capture > from the client side is showing the NFSv4 READ calls showing count > in bytes moving between 128KB to 256KB. > > 73403 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call > READ StateID: 0xcec2 Offset: 1072955392 Len: 131072 > 73404 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call > READ StateID: 0xcec2 Offset: 1073086464 Len: 262144 > 73406 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply > (Call In 73403)[Unreassembled Packet] > 73415 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call > READ StateID: 0xcec2 Offset: 1073348608 Len: 131072 > 73416 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply > (Call In 73404)[Unreassembled Packet] > 73428 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call > READ StateID: 0xcec2 Offset: 1073479680 Len: 131072 > 73429 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply > (Call In 73415)[Unreassembled Packet] > 73438 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call > READ StateID: 0xcec2 Offset: 1073610752 Len: 131072 > 73439 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply > (Call In 73428)[Unreassembled Packet] > > -nfsstat is showing 8183 NFSV4 READ calls required to read 1GB > file. > > # nfsstat > Client rpc stats: > calls retrans authrefrsh > 8204 0 8204 > > Client nfs v4: > null read write commit open > open_conf > 1 0% 8183 99% 0 0% 0 0% 0 0% > 0 0% > open_noat open_dgrd close setattr fsinfo > renew > 1 0% 0 0% 1 0% 0 0% 2 0% > 0 0% > setclntid confirm lock lockt locku > access > 0 0% 0 0% 0 0% 0 0% 0 0% > 1 0% > getattr lookup lookup_root remove rename > link > 4 0% 1 0% 1 0% 0 0% 0 0% > 0 0% > symlink create pathconf statfs readlink > readdir > 0 0% 0 0% 1 0% 0 0% 0 0% > 0 0% > server_caps delegreturn getacl setacl fs_locations > rel_lkowner > 3 0% 0 0% 0 0% 0 0% 0 0% > 0 0% > secinfo exchange_id create_ses destroy_ses sequence > get_lease_t > 0 0% 0 0% 2 0% 1 0% 0 0% > 0 0% > reclaim_comp layoutget getdevinfo layoutcommit layoutreturn > getdevlist > 0 0% 1 0% 0 0% 0 0% 0 0% > 0 0% > (null) > 1 0% > > > ##################################################################### > ########################## > > -When using 15MB as maximum read ahead size, the client is > sending 1MB NFSV4 read requests hence it’s able to read the same 1GB > file in 1024 NFS READ calls > > #uname -r; mount -t nfs4 -o > nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,nore > svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-readahead > show /home/ec2-user/efs/ > 5.3.9 > /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 15360 > #sync; echo 3 > /proc/sys/vm/drop_caches > #rsync --progress efs/test . > test > 1,073,741,824 100% 260.15MB/s 0:00:03 (xfr#1, to-chk=0/1) > [root@ip-172-31-17-42 ec2-user]# nfsstat > Client rpc stats: > calls retrans authrefrsh > 1043 0 1043 > > Client nfs v4: > null read write commit open > open_conf > 1 0% 1024 98% 0 0% 0 0% 0 0% > 0 0% > open_noat open_dgrd close setattr fsinfo > renew > 1 0% 0 0% 1 0% 0 0% 2 0% > 0 0% > setclntid confirm lock lockt locku > access > 0 0% 0 0% 0 0% 0 0% 0 0% > 1 0% > getattr lookup lookup_root remove rename > link > 2 0% 1 0% 1 0% 0 0% 0 0% > 0 0% > symlink create pathconf statfs readlink > readdir > 0 0% 0 0% 1 0% 0 0% 0 0% > 0 0% > server_caps delegreturn getacl setacl fs_locations > rel_lkowner > 3 0% 0 0% 0 0% 0 0% 0 0% > 0 0% > secinfo exchange_id create_ses destroy_ses sequence > get_lease_t > 0 0% 0 0% 2 0% 1 0% 0 0% > 0 0% > reclaim_comp layoutget getdevinfo layoutcommit layoutreturn > getdevlist > 0 0% 1 0% 0 0% 0 0% 0 0% > 0 0% > (null) > 1 0% > > -The packet capture from the client side is showing NFSv4 READ > calls with 1MB as read count in bytes when having 15MB as maximum NFS > read ahead size. > > 2021-03-22 14:25:34.984731 9398 172.31.17.42 → 172.31.28.161 NFS > 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: 1048576 > 2021-03-22 14:25:34.984805 9405 172.31.17.42 → 172.31.28.161 NFS > 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: 1048576 > 2021-03-22 14:25:34.984902 9416 172.31.17.42 → 172.31.28.161 NFS > 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: 1048576 > 2021-03-22 14:25:34.984941 9421 172.31.17.42 → 172.31.28.161 NFS > 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: 1048576 > > > ##################################################################### > ########################## > > > -I think there are 2 options to mitigate this behaviour which I > am listing below: > A)Raising the default maximum NFS read ahead size because the > current default 128KB doesn’t seem to be sufficient for High > throughout & low latency workload, I strongly believe that the NFS > rsize mount option should be used as variable in deciding the maximum > NFS read ahead size which was the case before [1] while now it’s > always 128KB regardless the utilized rsize mount option. Also I think > clients running in High latency & Low throughout environment > shouldn’t use 1MB as rsize in their mount options(i.e they should use > smaller rsize) because it may increase their suffering even with low > maximum NFS read ahead size. > B)Adding some logic to read ahead to have some kind of Autotuning > (Similar to TCP Autotuning) where the maximum read ahead size can > dynamically increase in case the client/reader is constantly filling > up/utilizing the read ahead window size. > > > Links: > [1] https://www.spinics.net/lists/linux-nfs/msg75018.html > [2] > https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html > [3] https://bugzilla.kernel.org/show_bug.cgi?id=204939 > [4] > https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html > > > Thank you. > > Hazem > > > > > > > > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 > Luxembourg, R.C.S. Luxembourg B186284 > > Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, > Burlington Road, Dublin 4, Ireland, branch registration number 908705 > > -- Trond Myklebust CTO, Hammerspace Inc 4984 El Camino Real, Suite 208 Los Altos, CA 94022 www.hammer.space