On Wed, 2021-03-31 at 13:31 +0000, Mohamed Abuelfotoh, Hazem wrote: > Ok I got that but based on the facts mentioned in > https://bugzilla.kernel.org/show_bug.cgi?id=204939 it looks like the > bad behaviour was mainly related to using the maximum rsize 1MB while > read ahead was 15MB based on the equation of 15*rsize while in > congested/low throughput/High latency networks the client shouldn’t > use 1MB as rsize in the first place, my point is that it's reasonable > to calculate the maximum read ahead size based on the utilized rsize > while now it's constant as 128KB regardless the configured networking > rsize, We have seen multiple customers reporting this regression after > upgrading to kernel 5.4 and I am pretty sure we will get more Linux > clients reporting this with more people moving to kernel 5.4 and later. > The readahead algorithm is designed to look at how large an area your application is trying to read, and it tries to optimise for the rsize because we tell it that is the "optimal read block size". If you are seeing a 128K window, then that is because you're hitting the heuristic case, where there is no guidance from the application, and the kernel is basically trying to keep a minimal pipeline filled just in case this turns out to be a sequential read. Your users can change that heuristic value using the per-deviceid entries in /sys/class/bdi. Please see https://www.suse.com/support/kb/doc/?id=000017019 They can also use udev with a generic rule like this (from https://access.redhat.com/solutions/407263 ): SUBSYSTEM=="bdi", ACTION=="add", PROGRAM="/bin/awk -v bdi=$kernel 'BEGIN{ret=1} {if ($4 == bdi) {ret=0}} END{exit ret}' /proc/fs/nfsfs/volumes", ATTR{read_ahead_kb}="1048576" The default setting is 128K because that is a global default setting. Yes we can change it, but that needs to be motivated by explaining what makes NFS so different from all the other filesystems that use the same setting. The fact that some users have better hardware than others isn't sufficient (and no, the default rsize setting does not take that into consideration either). > Thank you. > > Hazem > > > > Links > > [1] https://bugzilla.kernel.org/show_bug.cgi?id=204939 > > On 31/03/2021, 15:10, "Trond Myklebust" <trondmy@xxxxxxxxxxxxxxx> > wrote: > > CAUTION: This email originated from outside of the organization. Do > not click links or open attachments unless you can confirm the sender > and know the content is safe. > > > > On Wed, 2021-03-31 at 12:53 +0000, Mohamed Abuelfotoh, Hazem wrote: > > Hi Trond, > > > > I am wondering if we should consider raising the default maximum > NFS > > read ahead size given the facts I mentioned in my previous e- > mail. > > > > We can't keep changing the default every time someone tries to > measure > a new workload. > > The change in 5.4 was also measurement based, and was due to poor > performance in cases where rsize/wsize is smaller and readahead was > overshooting. > > > > > Thank you. > > > > Hazem > > > > On 29/03/2021, 17:07, "Mohamed Abuelfotoh, Hazem" < > > abuehaze@xxxxxxxxxx> wrote: > > > > > > Hello Team, > > > > -We have got multiple customers complaining about NFS read > > performance degradation after they upgraded to kernel 5.4.* > > > > -After doing some deep dive and testing we figured out that > the > > reason behind the regression was patch NFS: Optimise the default > > readahead size[1] Which has been merged to Linux kernels 5.4.* > and > > above. > > -Our customers are using AWS EC2 instances as client mounting > EFS > > export (which is AWS managed NFSV4 service), I am sharing the > results > > that we got before & after the upgrade given that the NFS > server(EFS) > > should be able to achieve between 250-300MB/sec which the clients > can > > achieve without patch[1] while getting quarter of this speed > around > > 70MB/sec with the mentioned patch merged as seen below. > > > > > > > ##################################################################### > > ##################### > > > > > > Before the upgrade: > > # uname -r > > 4.14.225-168.357.amzn2.x86_64 > > [root@ip-172-31-28-135 ec2-user]# sync; echo 3 > > > /proc/sys/vm/drop_caches > > [root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o > > > nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,nore > > svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs > > [root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test . > > test > > 8,589,934,592 100% 313.20MB/s 0:00:26 (xfr#1, to- > chk=0/1) > > > > > > > ##################################################################### > > ##################### > > > > After the upgrade using the same client & server: > > #uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs- > > readahead show /home/ec2-user/efs/;rsync --progress efs/test . > > 5.4.0-1.132.64.amzn2.x86_64 > > /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = > 128 > > test > > 1,073,741,824 100% 68.61MB/s 0:00:14 (xfr#1, to- > chk=0/1) > > > > > > -We are recommending[2] EFS users to use rsize=1048576 as > mount > > option for getting the best read performance from their EFS > exports > > given that EC2 to EFS traffic is residing in the same AWS > > availability zone hence it has low latency and up to 250- > 300MB/sec > > throughput however with the mentioned patch merged the customer > can’t > > achieve this throughput after the kernel upgrade because the > default > > NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB > so > > the clients have to manually raise the manually raise the > > read_ahead_kb parameter from 128 to 15360 to get the same > experience > > they were getting before the upgrade. > > -We know that the purpose of the mentioned patch was to > decrease > > OS boot time (for netboot users) also decreasing Application > start up > > times in congested & Low throughput networks as mentioned in [3], > > however this would also cause regression for high throughput & > low > > latency workload especially sequential read workflows. > > -After doing further debugging we also found that the maximum > > read ahead size is constant so there is no Autotuning for this > > configuration even if the client is filling the read ahead window > > which means any NFS client specially ones using maximum rsize > mount > > option will have to manually tune their maximum NFS read ahead > size > > after the upgrade which in my opinion is some sort of regression > from > > older kernels behaviour. > > > > > > > ##################################################################### > > #################### > > > > After increasing the maximum NFS read ahead size to 15MB it’s > > clear that read ahead window is expanded as expected and it will > be > > doubled until it reach 15MB. > > > > Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256 > > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256 > > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59 > > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512 > > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024 > > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > > Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048 > > Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840 > > > > > > > ##################################################################### > > #################### > > > > With 128KB as maximum NFS read ahead size, the read ahead > window > > size is increasing until it reach the configured maximum Read > ahead > > (128KB). > > > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32 > > Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32 > > > > -In my reproduction I used rsync as clarified above and it is > > always doing read syscalls requesting 256 KB in each call: > > 15:47:10.780658 read(3, > > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > > , 262144) = 262144 <0.023749> > > 15:47:10.805467 read(3, > > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > > , 262144) = 262144 <0.023739> > > 15:47:10.830272 read(3, > > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > > , 262144) = 262144 <0.023664> > > 15:47:10.854972 read(3, > > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > > , 262144) = 262144 <0.023814> > > 15:47:10.879837 read(3, > > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > > , 262144) = 262144 <0.023625> > > 15:47:10.904496 read(3, > > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > > , 262144) = 262144 <0.023645> > > 15:47:10.929180 read(3, > > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > > , 262144) = 262144 <0.024072> > > 15:47:10.954308 read(3, > > > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"... > > , 262144) = 262144 <0.023669> > > > > > > -Looking into the readahead source code and I can see that > > readahead is doing some heuristics to determine if the access > pattern > > is sequential or random then it modify the read ahead > window(amount > > of data it will prefetch) accordingly, read ahead can't also read > > beyond the requested file size, this theoretically means that > having > > Large NFS Max read ahead size (15MB) shouldn't have much impact > on > > performance even with random I/O pattern or with data set > consists of > > small files, the only major impact of having large NFS read ahead > > size would be some networking congestion or bootup delay with > hosts > > using congested or low throughput networks as illustrated in > > https://bugzilla.kernel.org/show_bug.cgi?id=204939 & > > > https://lore.kernel.org/linux-nfs/0b76213e-426b-0654-5b69-02bfead78031@xxxxxxxxx/T/ > > . > > -With patch > > https://www.spinics.net/lists/linux-nfs/msg75018.html the packet > > captures are showing the client either asking for 128KB or 256 KB > in > > the NFS READ calls and it can't reach even the 1MB configured > rsize > > mount option this is because the ondemand_readahead which should > be > > responsible for moving and scaling the read ahead window has an > if > > condition which was a part of > > > https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html > > , this patch actually modified read ahead to issue the maximum of > the > > user request size(rsync is doing 256KB read requests), and the > read- > > ahead max size(128KB by default), but capped to the max request > size > > on the device side(1MB in our case). The latter is done to avoid > > reading ahead too much, if the application asks for a huge read. > this > > is why with 128KB as read ahead size and application asking for > 256KB > > we never exceed 256KB because this patch actually intended to do > > that, it avoids limiting the requested data to the maximum read > ahead > > size but we are still limited by the minimum between amount of > data > > application is reading which is 256KB as sync in rsync strace > output > > & bdi->io_pages(256 pages=1MB) as configured in the rsize mount > > option. > > > > -Output after adding some debugging to the kernel showing the > > size of each variable in the mentioned "if" condition: > > > > [ 238.387788] req_size= 64 ------>256MB rsync read requests > > [ 238.387790] io pages= 256----->1MB as supported by EFS and > as > > configured in the rsize mount option. > > [ 238.390487] max_pages before= 32----->128 KB read ahead > size > > which is the default. > > [ 238.393177] max_pages after= 64---->raised to 256 KB > because > > of changes mentioned in [4] "max_pages = min(req_size, bdi- > > >io_pages);" > > > > > > > https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435 > > > > /* > > * A minimal readahead algorithm for trivial > sequential/random > > reads. > > */ > > static void ondemand_readahead(struct readahead_control > *ractl, > > struct file_ra_state *ra, bool hit_readahead_marker, > > unsigned long req_size) > > { > > struct backing_dev_info *bdi = inode_to_bdi(ractl- > >mapping- > > >host); > > unsigned long max_pages = ra->ra_pages; > > unsigned long add_pages; > > unsigned long index = readahead_index(ractl); > > pgoff_t prev_index; > > > > /* > > * If the request exceeds the readahead window, allow the > > read to > > * be up to the optimal hardware IO size > > */ > > if (req_size > max_pages && bdi->io_pages > max_pages) > > max_pages = min(req_size, bdi->io_pages); > > > > > > > > > ##################################################################### > > ############################# > > > > -With 128KB as default maximum read ahead size the packet > capture > > from the client side is showing the NFSv4 READ calls showing > count > > in bytes moving between 128KB to 256KB. > > > > 73403 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 > Call > > READ StateID: 0xcec2 Offset: 1072955392 Len: 131072 > > 73404 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 > Call > > READ StateID: 0xcec2 Offset: 1073086464 Len: 262144 > > 73406 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 > Reply > > (Call In 73403)[Unreassembled Packet] > > 73415 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 > Call > > READ StateID: 0xcec2 Offset: 1073348608 Len: 131072 > > 73416 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 > Reply > > (Call In 73404)[Unreassembled Packet] > > 73428 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 > Call > > READ StateID: 0xcec2 Offset: 1073479680 Len: 131072 > > 73429 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 > Reply > > (Call In 73415)[Unreassembled Packet] > > 73438 29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 > Call > > READ StateID: 0xcec2 Offset: 1073610752 Len: 131072 > > 73439 29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 > Reply > > (Call In 73428)[Unreassembled Packet] > > > > -nfsstat is showing 8183 NFSV4 READ calls required to read > 1GB > > file. > > > > # nfsstat > > Client rpc stats: > > calls retrans authrefrsh > > 8204 0 8204 > > > > Client nfs v4: > > null read write commit open > > open_conf > > 1 0% 8183 99% 0 0% 0 0% 0 > 0% > > 0 0% > > open_noat open_dgrd close setattr fsinfo > > renew > > 1 0% 0 0% 1 0% 0 0% 2 > 0% > > 0 0% > > setclntid confirm lock lockt locku > > access > > 0 0% 0 0% 0 0% 0 0% 0 > 0% > > 1 0% > > getattr lookup lookup_root remove rename > > link > > 4 0% 1 0% 1 0% 0 0% 0 > 0% > > 0 0% > > symlink create pathconf statfs readlink > > readdir > > 0 0% 0 0% 1 0% 0 0% 0 > 0% > > 0 0% > > server_caps delegreturn getacl setacl > fs_locations > > rel_lkowner > > 3 0% 0 0% 0 0% 0 0% 0 > 0% > > 0 0% > > secinfo exchange_id create_ses destroy_ses sequence > > get_lease_t > > 0 0% 0 0% 2 0% 1 0% 0 > 0% > > 0 0% > > reclaim_comp layoutget getdevinfo layoutcommit > layoutreturn > > getdevlist > > 0 0% 1 0% 0 0% 0 0% 0 > 0% > > 0 0% > > (null) > > 1 0% > > > > > > > ##################################################################### > > ########################## > > > > -When using 15MB as maximum read ahead size, the client is > > sending 1MB NFSV4 read requests hence it’s able to read the same > 1GB > > file in 1024 NFS READ calls > > > > #uname -r; mount -t nfs4 -o > > > nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,nore > > svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs- > readahead > > show /home/ec2-user/efs/ > > 5.3.9 > > /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = > 15360 > > #sync; echo 3 > /proc/sys/vm/drop_caches > > #rsync --progress efs/test . > > test > > 1,073,741,824 100% 260.15MB/s 0:00:03 (xfr#1, to- > chk=0/1) > > [root@ip-172-31-17-42 ec2-user]# nfsstat > > Client rpc stats: > > calls retrans authrefrsh > > 1043 0 1043 > > > > Client nfs v4: > > null read write commit open > > open_conf > > 1 0% 1024 98% 0 0% 0 0% 0 > 0% > > 0 0% > > open_noat open_dgrd close setattr fsinfo > > renew > > 1 0% 0 0% 1 0% 0 0% 2 > 0% > > 0 0% > > setclntid confirm lock lockt locku > > access > > 0 0% 0 0% 0 0% 0 0% 0 > 0% > > 1 0% > > getattr lookup lookup_root remove rename > > link > > 2 0% 1 0% 1 0% 0 0% 0 > 0% > > 0 0% > > symlink create pathconf statfs readlink > > readdir > > 0 0% 0 0% 1 0% 0 0% 0 > 0% > > 0 0% > > server_caps delegreturn getacl setacl > fs_locations > > rel_lkowner > > 3 0% 0 0% 0 0% 0 0% 0 > 0% > > 0 0% > > secinfo exchange_id create_ses destroy_ses sequence > > get_lease_t > > 0 0% 0 0% 2 0% 1 0% 0 > 0% > > 0 0% > > reclaim_comp layoutget getdevinfo layoutcommit > layoutreturn > > getdevlist > > 0 0% 1 0% 0 0% 0 0% 0 > 0% > > 0 0% > > (null) > > 1 0% > > > > -The packet capture from the client side is showing NFSv4 > READ > > calls with 1MB as read count in bytes when having 15MB as maximum > NFS > > read ahead size. > > > > 2021-03-22 14:25:34.984731 9398 172.31.17.42 → 172.31.28.161 > NFS > > 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: > 1048576 > > 2021-03-22 14:25:34.984805 9405 172.31.17.42 → 172.31.28.161 > NFS > > 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: > 1048576 > > 2021-03-22 14:25:34.984902 9416 172.31.17.42 → 172.31.28.161 > NFS > > 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: > 1048576 > > 2021-03-22 14:25:34.984941 9421 172.31.17.42 → 172.31.28.161 > NFS > > 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: > 1048576 > > > > > > > ##################################################################### > > ########################## > > > > > > -I think there are 2 options to mitigate this behaviour which > I > > am listing below: > > A)Raising the default maximum NFS read ahead size because > the > > current default 128KB doesn’t seem to be sufficient for High > > throughout & low latency workload, I strongly believe that the > NFS > > rsize mount option should be used as variable in deciding the > maximum > > NFS read ahead size which was the case before [1] while now it’s > > always 128KB regardless the utilized rsize mount option. Also I > think > > clients running in High latency & Low throughout environment > > shouldn’t use 1MB as rsize in their mount options(i.e they should > use > > smaller rsize) because it may increase their suffering even with > low > > maximum NFS read ahead size. > > B)Adding some logic to read ahead to have some kind of > Autotuning > > (Similar to TCP Autotuning) where the maximum read ahead size can > > dynamically increase in case the client/reader is constantly > filling > > up/utilizing the read ahead window size. > > > > > > Links: > > [1] https://www.spinics.net/lists/linux-nfs/msg75018.html > > [2] > > > https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html > > [3] https://bugzilla.kernel.org/show_bug.cgi?id=204939 > > [4] > > > https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html > > > > > > Thank you. > > > > Hazem > > > > > > > > > > > > > > > > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 > > Luxembourg, R.C.S. Luxembourg B186284 > > > > Amazon Web Services EMEA SARL, Irish Branch, One Burlington > Plaza, > > Burlington Road, Dublin 4, Ireland, branch registration number > 908705 > > > > > > -- > Trond Myklebust > CTO, Hammerspace Inc > 4984 El Camino Real, Suite 208 > Los Altos, CA 94022 > > www.hammer.space > > > > > > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 > Luxembourg, R.C.S. Luxembourg B186284 > > Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, > Burlington Road, Dublin 4, Ireland, branch registration number 908705 > > -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx