Re: NFS read performance degradation after upgrading to kernel 5.4.*

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Wed, 31 Mar 2021 13:09:07 +0000

On Wed, 2021-03-31 at 12:53 +0000, Mohamed Abuelfotoh, Hazem wrote:
> Hi Trond,
> 
> I am wondering if we should consider raising the default maximum NFS
> read ahead size given the facts I mentioned in my previous e-mail.
> 

We can't keep changing the default every time someone tries to measure
a new workload.

The change in 5.4 was also measurement based, and was due to poor
performance in cases where rsize/wsize is smaller and readahead was
overshooting.

> Thank you.
> 
> Hazem
> 
> On 29/03/2021, 17:07, "Mohamed Abuelfotoh, Hazem" <
> abuehaze@xxxxxxxxxx> wrote:
> 
> 
>     Hello Team,
> 
>     -We have got multiple customers complaining about NFS read
> performance degradation after they upgraded to kernel 5.4.*
> 
>     -After doing some deep dive and testing we figured out that the
> reason behind the regression was patch  NFS: Optimise the default
> readahead size[1] Which has been merged to Linux kernels 5.4.* and
> above.
>     -Our customers are using AWS EC2 instances as client mounting EFS
> export (which is AWS managed NFSV4 service), I am sharing the results
> that we got before & after the upgrade given that the NFS server(EFS)
> should be able to achieve between 250-300MB/sec which the clients can
> achieve without patch[1] while getting quarter of this speed around
> 70MB/sec with the mentioned patch merged as seen below.
> 
>     
> #####################################################################
> #####################
> 
> 
>     Before the upgrade:
>     # uname -r
>     4.14.225-168.357.amzn2.x86_64
>     [root@ip-172-31-28-135 ec2-user]# sync; echo 3 >
> /proc/sys/vm/drop_caches
>     [root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o
> nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,nore
> svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs
>     [root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test .
>     test
>       8,589,934,592 100%  313.20MB/s    0:00:26 (xfr#1, to-chk=0/1)
> 
>    
> #####################################################################
> #####################
> 
>     After the upgrade using the same client & server:
>     #uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs-
> readahead show /home/ec2-user/efs/;rsync --progress efs/test .
>     5.4.0-1.132.64.amzn2.x86_64
>     /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 128
>     test
>       1,073,741,824 100%   68.61MB/s    0:00:14 (xfr#1, to-chk=0/1)
> 
> 
>     -We are  recommending[2] EFS users to use rsize=1048576 as mount
> option for getting the best read performance from their EFS exports
> given that EC2 to EFS traffic is residing in the same AWS
> availability zone hence it has low latency and up to 250-300MB/sec
> throughput however with the mentioned patch merged the customer can’t
> achieve this throughput after the kernel upgrade because the default
> NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB so
> the clients have to manually raise the manually raise the
> read_ahead_kb parameter from 128 to 15360 to get the same experience
> they were getting before the upgrade.
>     -We know that the purpose of the mentioned patch was to decrease
> OS boot time (for netboot users) also decreasing Application start up
> times in congested & Low throughput networks as mentioned in [3],
> however this would also cause regression for high throughput & low
> latency workload especially sequential read workflows.
>     -After doing further debugging we also found that the maximum
> read ahead size is constant so there is no Autotuning for this
> configuration even if the client is filling the read ahead window
> which means any NFS client specially ones using maximum rsize mount
> option will have to manually tune their maximum NFS read ahead size
> after the upgrade which in my opinion is some sort of regression from
> older kernels behaviour.
> 
>     
> #####################################################################
> ####################
> 
>     After increasing the maximum NFS read ahead size to 15MB it’s
> clear that read ahead window is expanded as expected and it will be
> doubled until it reach 15MB.
> 
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048
>     Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
> 
>    
> #####################################################################
> ####################
> 
>     With 128KB as maximum NFS read ahead size,  the read ahead window
> size is increasing until it reach the configured maximum Read ahead
> (128KB).
> 
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
>     Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
> 
>     -In my reproduction I used rsync as clarified above and it is
> always doing read syscalls requesting 256 KB in each call:
>     15:47:10.780658 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023749>
>     15:47:10.805467 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023739>
>     15:47:10.830272 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023664>
>     15:47:10.854972 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023814>
>     15:47:10.879837 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023625>
>     15:47:10.904496 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023645>
>     15:47:10.929180 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.024072>
>     15:47:10.954308 read(3,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...
> , 262144) = 262144 <0.023669>
> 
> 
>     -Looking into the readahead source code and I can see that
> readahead is doing some heuristics to determine if the access pattern
> is sequential or random then it modify the read ahead window(amount
> of data it will prefetch) accordingly, read ahead can't also read
> beyond the requested file size, this theoretically means that having
> Large NFS Max read ahead size (15MB) shouldn't have much impact on
> performance even with random I/O pattern or with data set consists of
> small files, the only major impact of having large NFS read ahead
> size would be some networking congestion or bootup delay with hosts
> using congested or low throughput networks as illustrated in 
> https://bugzilla.kernel.org/show_bug.cgi?id=204939 &; 
> https://lore.kernel.org/linux-nfs/0b76213e-426b-0654-5b69-02bfead78031@xxxxxxxxx/T/
> .
>     -With patch
> https://www.spinics.net/lists/linux-nfs/msg75018.html the packet
> captures are showing the client either asking for 128KB or 256 KB in
> the NFS READ calls and it can't reach even the 1MB configured rsize
> mount option this is because the ondemand_readahead which should be
> responsible for moving and scaling the read ahead window has an if
> condition which was a part of 
> https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html
> , this patch actually modified read ahead to issue the maximum of the
> user request size(rsync is doing 256KB read requests), and the read-
> ahead max size(128KB by default), but capped to the max request size
> on the device side(1MB in our case). The latter is done to avoid
> reading ahead too much, if the application asks for a huge read. this
> is why with 128KB as read ahead size and application asking for 256KB
> we never exceed 256KB because this patch actually intended to do
> that, it avoids limiting the requested data to the maximum read ahead
> size but we are still limited by the minimum between amount of data
> application is reading which is 256KB as sync in rsync strace output
> & bdi->io_pages(256 pages=1MB) as configured in the rsize mount
> option.
> 
>     -Output after adding some debugging to the kernel showing the
> size of each variable in the mentioned "if" condition:
> 
>     [  238.387788] req_size= 64 ------>256MB rsync read requests
>     [  238.387790] io pages= 256----->1MB as supported by EFS and as
> configured in the rsize mount option.
>     [  238.390487] max_pages before= 32----->128 KB read ahead size
> which is the default.
>     [  238.393177] max_pages after= 64---->raised to 256 KB because
> of changes mentioned in [4] "max_pages = min(req_size, bdi-
> >io_pages);"
> 
>     
> https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435
> 
>     /*
>      * A minimal readahead algorithm for trivial sequential/random
> reads.
>      */
>     static void ondemand_readahead(struct readahead_control *ractl,
>             struct file_ra_state *ra, bool hit_readahead_marker,
>             unsigned long req_size)
>     {
>         struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping-
> >host);
>         unsigned long max_pages = ra->ra_pages;
>         unsigned long add_pages;
>         unsigned long index = readahead_index(ractl);
>         pgoff_t prev_index;
> 
>         /*
>          * If the request exceeds the readahead window, allow the
> read to
>          * be up to the optimal hardware IO size
>          */
>         if (req_size > max_pages && bdi->io_pages > max_pages)
>             max_pages = min(req_size, bdi->io_pages);
> 
> 
>     
> #####################################################################
> #############################
> 
>     -With 128KB as default maximum read ahead size the packet capture
> from the client side is showing the NFSv4 READ calls  showing count
> in bytes  moving between 128KB to 256KB.
> 
>     73403         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1072955392 Len: 131072
>     73404         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073086464 Len: 262144
>     73406         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73403)[Unreassembled Packet]
>     73415         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073348608 Len: 131072
>     73416         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73404)[Unreassembled Packet]
>     73428         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073479680 Len: 131072
>     73429         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73415)[Unreassembled Packet]
>     73438         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call
> READ StateID: 0xcec2 Offset: 1073610752 Len: 131072
>     73439         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply
> (Call In 73428)[Unreassembled Packet]
> 
>     -nfsstat is showing 8183 NFSV4 READ calls required to read 1GB
> file.
> 
>     # nfsstat 
>     Client rpc stats:
>     calls      retrans    authrefrsh
>     8204       0          8204    
> 
>     Client nfs v4:
>     null         read         write        commit       open        
> open_conf    
>     1         0% 8183     99% 0         0% 0         0% 0         0%
> 0         0% 
>     open_noat    open_dgrd    close        setattr      fsinfo      
> renew        
>     1         0% 0         0% 1         0% 0         0% 2         0%
> 0         0% 
>     setclntid    confirm      lock         lockt        locku       
> access       
>     0         0% 0         0% 0         0% 0         0% 0         0%
> 1         0% 
>     getattr      lookup       lookup_root  remove       rename      
> link         
>     4         0% 1         0% 1         0% 0         0% 0         0%
> 0         0% 
>     symlink      create       pathconf     statfs       readlink    
> readdir      
>     0         0% 0         0% 1         0% 0         0% 0         0%
> 0         0% 
>     server_caps  delegreturn  getacl       setacl       fs_locations
> rel_lkowner  
>     3         0% 0         0% 0         0% 0         0% 0         0%
> 0         0% 
>     secinfo      exchange_id  create_ses   destroy_ses  sequence    
> get_lease_t  
>     0         0% 0         0% 2         0% 1         0% 0         0%
> 0         0% 
>     reclaim_comp layoutget    getdevinfo   layoutcommit layoutreturn
> getdevlist   
>     0         0% 1         0% 0         0% 0         0% 0         0%
> 0         0% 
>     (null)       
>     1 0%
> 
>     
> #####################################################################
> ##########################
> 
>     -When using 15MB as maximum read ahead size, the client is
> sending 1MB NFSV4 read requests hence it’s able to read the same 1GB
> file in 1024 NFS READ calls 
> 
>     #uname -r; mount -t nfs4 -o
> nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,nore
> svport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-readahead
> show /home/ec2-user/efs/
>     5.3.9
>     /home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 15360
>     #sync; echo 3 > /proc/sys/vm/drop_caches
>     #rsync --progress efs/test .
>     test
>       1,073,741,824 100%  260.15MB/s    0:00:03 (xfr#1, to-chk=0/1)
>     [root@ip-172-31-17-42 ec2-user]# nfsstat 
>     Client rpc stats:
>     calls      retrans    authrefrsh
>     1043       0          1043    
> 
>     Client nfs v4:
>     null         read         write        commit       open        
> open_conf    
>     1         0% 1024     98% 0         0% 0         0% 0         0%
> 0         0% 
>     open_noat    open_dgrd    close        setattr      fsinfo      
> renew        
>     1         0% 0         0% 1         0% 0         0% 2         0%
> 0         0% 
>     setclntid    confirm      lock         lockt        locku       
> access       
>     0         0% 0         0% 0         0% 0         0% 0         0%
> 1         0% 
>     getattr      lookup       lookup_root  remove       rename      
> link         
>     2         0% 1         0% 1         0% 0         0% 0         0%
> 0         0% 
>     symlink      create       pathconf     statfs       readlink    
> readdir      
>     0         0% 0         0% 1         0% 0         0% 0         0%
> 0         0% 
>     server_caps  delegreturn  getacl       setacl       fs_locations
> rel_lkowner  
>     3         0% 0         0% 0         0% 0         0% 0         0%
> 0         0% 
>     secinfo      exchange_id  create_ses   destroy_ses  sequence    
> get_lease_t  
>     0         0% 0         0% 2         0% 1         0% 0         0%
> 0         0% 
>     reclaim_comp layoutget    getdevinfo   layoutcommit layoutreturn
> getdevlist   
>     0         0% 1         0% 0         0% 0         0% 0         0%
> 0         0% 
>     (null)       
>     1 0% 
> 
>     -The packet capture from the client side is showing NFSv4 READ
> calls with 1MB as read count in bytes when having 15MB as maximum NFS
> read ahead size.
> 
>     2021-03-22 14:25:34.984731  9398 172.31.17.42 → 172.31.28.161 NFS
> 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: 1048576 
>     2021-03-22 14:25:34.984805  9405 172.31.17.42 → 172.31.28.161 NFS
> 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: 1048576 
>     2021-03-22 14:25:34.984902  9416 172.31.17.42 → 172.31.28.161 NFS
> 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: 1048576 
>     2021-03-22 14:25:34.984941  9421 172.31.17.42 → 172.31.28.161 NFS
> 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: 1048576 
> 
>     
> #####################################################################
> ##########################
> 
> 
>     -I think there are 2 options to mitigate this behaviour which I
> am listing below:
>     A)Raising the default maximum NFS read ahead  size because the
> current default 128KB doesn’t seem to be sufficient for High
> throughout & low latency workload, I strongly believe that the NFS
> rsize mount option should be used as variable in deciding the maximum
> NFS read ahead  size which was the case before [1] while now it’s
> always 128KB regardless the utilized rsize mount option. Also I think
> clients running in High latency & Low throughout environment
> shouldn’t use 1MB as rsize in their mount options(i.e they should use
> smaller rsize) because it may increase their suffering even with low
> maximum NFS read ahead size.
>     B)Adding some logic to read ahead to have some kind of Autotuning
> (Similar to TCP Autotuning) where the maximum read ahead size can
> dynamically increase in case the client/reader is constantly filling
> up/utilizing the read ahead window size.
> 
> 
>     Links:
>     [1] https://www.spinics.net/lists/linux-nfs/msg75018.html
>     [2] 
> https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html
>     [3] https://bugzilla.kernel.org/show_bug.cgi?id=204939
>     [4] 
> https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html
> 
> 
>     Thank you.
> 
>     Hazem
> 
> 
> 
> 
> 
> 
> 
> Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855
> Luxembourg, R.C.S. Luxembourg B186284
> 
> Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza,
> Burlington Road, Dublin 4, Ireland, branch registration number 908705
> 
> 

-- 
Trond Myklebust
CTO, Hammerspace Inc
4984 El Camino Real, Suite 208
Los Altos, CA 94022

www.hammer.space