Re: NFS read performance degradation after upgrading to kernel 5.4.*

"Mohamed Abuelfotoh, Hazem" <abuehaze@xxxxxxxxxx> · Mon, 29 Mar 2021 15:07:05 +0000

Hello Team,

-We have got multiple customers complaining about NFS read performance degradation after they upgraded to kernel 5.4.*

-After doing some deep dive and testing we figured out that the reason behind the regression was patch  NFS: Optimise the default readahead size[1] Which has been merged to Linux kernels 5.4.* and above.
-Our customers are using AWS EC2 instances as client mounting EFS export (which is AWS managed NFSV4 service), I am sharing the results that we got before & after the upgrade given that the NFS server(EFS) should be able to achieve between 250-300MB/sec which the clients can achieve without patch[1] while getting quarter of this speed around 70MB/sec with the mentioned patch merged as seen below.

 ##########################################################################################

Before the upgrade:
# uname -r
4.14.225-168.357.amzn2.x86_64
[root@ip-172-31-28-135 ec2-user]# sync; echo 3 > /proc/sys/vm/drop_caches
[root@ip-172-31-28-135 ec2-user]# mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs
[root@ip-172-31-28-135 ec2-user]# rsync --progress efs/test .
test
  8,589,934,592 100%  313.20MB/s    0:00:26 (xfr#1, to-chk=0/1)

##########################################################################################

After the upgrade using the same client & server:
#uname -r; sync; echo 3 > /proc/sys/vm/drop_caches; ./nfs-readahead show /home/ec2-user/efs/;rsync --progress efs/test .
5.4.0-1.132.64.amzn2.x86_64
/home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 128
test
  1,073,741,824 100%   68.61MB/s    0:00:14 (xfr#1, to-chk=0/1)

-We are  recommending[2] EFS users to use rsize=1048576 as mount option for getting the best read performance from their EFS exports given that EC2 to EFS traffic is residing in the same AWS availability zone hence it has low latency and up to 250-300MB/sec throughput however with the mentioned patch merged the customer can’t achieve this throughput after the kernel upgrade because the default NFS read ahead has been decreased from (15*rsize)=15 MB to 128KB so the clients have to manually raise the manually raise the read_ahead_kb parameter from 128 to 15360 to get the same experience they were getting before the upgrade.
-We know that the purpose of the mentioned patch was to decrease OS boot time (for netboot users) also decreasing Application start up times in congested & Low throughput networks as mentioned in [3], however this would also cause regression for high throughput & low latency workload especially sequential read workflows.
-After doing further debugging we also found that the maximum read ahead size is constant so there is no Autotuning for this configuration even if the client is filling the read ahead window which means any NFS client specially ones using maximum rsize mount option will have to manually tune their maximum NFS read ahead size after the upgrade which in my opinion is some sort of regression from older kernels behaviour.

 #########################################################################################

After increasing the maximum NFS read ahead size to 15MB it’s clear that read ahead window is expanded as expected and it will be doubled until it reach 15MB.

Mar 29 11:25:18 ip-172-31-17-191 kernel: init_ra_size 256
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 256
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 59
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 512
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 1024
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840
Mar 29 11:25:18 ip-172-31-17-191 kernel: current ra 2048
Mar 29 11:25:18 ip-172-31-17-191 kernel: max ra 3840

#########################################################################################

With 128KB as maximum NFS read ahead size,  the read ahead window size is increasing until it reach the configured maximum Read ahead (128KB).

Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 40
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: init_ra_size 4
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 4
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 64
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 59
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: current ra 32
Mar 29 11:35:37 ip-172-31-17-191 kernel: max ra 32

-In my reproduction I used rsync as clarified above and it is always doing read syscalls requesting 256 KB in each call:
15:47:10.780658 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023749>
15:47:10.805467 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023739>
15:47:10.830272 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023664>
15:47:10.854972 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023814>
15:47:10.879837 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023625>
15:47:10.904496 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023645>
15:47:10.929180 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.024072>
15:47:10.954308 read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262144) = 262144 <0.023669>

-Looking into the readahead source code and I can see that readahead is doing some heuristics to determine if the access pattern is sequential or random then it modify the read ahead window(amount of data it will prefetch) accordingly, read ahead can't also read beyond the requested file size, this theoretically means that having Large NFS Max read ahead size (15MB) shouldn't have much impact on performance even with random I/O pattern or with data set consists of small files, the only major impact of having large NFS read ahead size would be some networking congestion or bootup delay with hosts using congested or low throughput networks as illustrated in https://bugzilla.kernel.org/show_bug.cgi?id=204939 & https://lore.kernel.org/linux-nfs/0b76213e-426b-0654-5b69-02bfead78031@xxxxxxxxx/T/.
-With patch https://www.spinics.net/lists/linux-nfs/msg75018.html the packet captures are showing the client either asking for 128KB or 256 KB in the NFS READ calls and it can't reach even the 1MB configured rsize mount option this is because the ondemand_readahead which should be responsible for moving and scaling the read ahead window has an if condition which was a part of https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html, this patch actually modified read ahead to issue the maximum of the user request size(rsync is doing 256KB read requests), and the read-ahead max size(128KB by default), but capped to the max request size on the device side(1MB in our case). The latter is done to avoid reading ahead too much, if the application asks for a huge read. this is why with 128KB as read ahead size and application asking for 256KB we never exceed 256KB because this patch actually intended to do that, it avoids limiting the requested data to the maximum read ahead size but we are still limited by the minimum between amount of data application is reading which is 256KB as sync in rsync strace output & bdi->io_pages(256 pages=1MB) as configured in the rsize mount option.

-Output after adding some debugging to the kernel showing the size of each variable in the mentioned "if" condition:

[  238.387788] req_size= 64 ------>256MB rsync read requests
[  238.387790] io pages= 256----->1MB as supported by EFS and as configured in the rsize mount option.
[  238.390487] max_pages before= 32----->128 KB read ahead size which is the default.
[  238.393177] max_pages after= 64---->raised to 256 KB because of changes mentioned in [4] "max_pages = min(req_size, bdi->io_pages);"

https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L435

/*
 * A minimal readahead algorithm for trivial sequential/random reads.
 */
static void ondemand_readahead(struct readahead_control *ractl,
        struct file_ra_state *ra, bool hit_readahead_marker,
        unsigned long req_size)
{
    struct backing_dev_info *bdi = inode_to_bdi(ractl->mapping->host);
    unsigned long max_pages = ra->ra_pages;
    unsigned long add_pages;
    unsigned long index = readahead_index(ractl);
    pgoff_t prev_index;

    /*
     * If the request exceeds the readahead window, allow the read to
     * be up to the optimal hardware IO size
     */
    if (req_size > max_pages && bdi->io_pages > max_pages)
        max_pages = min(req_size, bdi->io_pages);

 ##################################################################################################

-With 128KB as default maximum read ahead size the packet capture from the client side is showing the NFSv4 READ calls  showing count in bytes  moving between 128KB to 256KB.

73403         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1072955392 Len: 131072
73404         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073086464 Len: 262144
73406         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73403)[Unreassembled Packet]
73415         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073348608 Len: 131072
73416         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73404)[Unreassembled Packet]
73428         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073479680 Len: 131072
73429         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73415)[Unreassembled Packet]
73438         29 172.31.17.191 -> 172.31.28.161 NFS 394 V4 Call READ StateID: 0xcec2 Offset: 1073610752 Len: 131072
73439         29 172.31.28.161 -> 172.31.17.191 NFS 8699 V4 Reply (Call In 73428)[Unreassembled Packet]

-nfsstat is showing 8183 NFSV4 READ calls required to read 1GB file.

# nfsstat 
Client rpc stats:
calls      retrans    authrefrsh
8204       0          8204    

Client nfs v4:
null         read         write        commit       open         open_conf    
1         0% 8183     99% 0         0% 0         0% 0         0% 0         0% 
open_noat    open_dgrd    close        setattr      fsinfo       renew        
1         0% 0         0% 1         0% 0         0% 2         0% 0         0% 
setclntid    confirm      lock         lockt        locku        access       
0         0% 0         0% 0         0% 0         0% 0         0% 1         0% 
getattr      lookup       lookup_root  remove       rename       link         
4         0% 1         0% 1         0% 0         0% 0         0% 0         0% 
symlink      create       pathconf     statfs       readlink     readdir      
0         0% 0         0% 1         0% 0         0% 0         0% 0         0% 
server_caps  delegreturn  getacl       setacl       fs_locations rel_lkowner  
3         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
secinfo      exchange_id  create_ses   destroy_ses  sequence     get_lease_t  
0         0% 0         0% 2         0% 1         0% 0         0% 0         0% 
reclaim_comp layoutget    getdevinfo   layoutcommit layoutreturn getdevlist   
0         0% 1         0% 0         0% 0         0% 0         0% 0         0% 
(null)       
1 0%

 ###############################################################################################

-When using 15MB as maximum read ahead size, the client is sending 1MB NFSV4 read requests hence it’s able to read the same 1GB file in 1024 NFS READ calls 

#uname -r; mount -t nfs4 -o nfsvers=4.1,rsize=1052672,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-6700f553.efs.eu-west-1.amazonaws.com:/ efs; ./nfs-readahead show /home/ec2-user/efs/
5.3.9
/home/ec2-user/efs 0:40 /sys/class/bdi/0:40/read_ahead_kb = 15360
#sync; echo 3 > /proc/sys/vm/drop_caches
#rsync --progress efs/test .
test
  1,073,741,824 100%  260.15MB/s    0:00:03 (xfr#1, to-chk=0/1)
[root@ip-172-31-17-42 ec2-user]# nfsstat 
Client rpc stats:
calls      retrans    authrefrsh
1043       0          1043    

Client nfs v4:
null         read         write        commit       open         open_conf    
1         0% 1024     98% 0         0% 0         0% 0         0% 0         0% 
open_noat    open_dgrd    close        setattr      fsinfo       renew        
1         0% 0         0% 1         0% 0         0% 2         0% 0         0% 
setclntid    confirm      lock         lockt        locku        access       
0         0% 0         0% 0         0% 0         0% 0         0% 1         0% 
getattr      lookup       lookup_root  remove       rename       link         
2         0% 1         0% 1         0% 0         0% 0         0% 0         0% 
symlink      create       pathconf     statfs       readlink     readdir      
0         0% 0         0% 1         0% 0         0% 0         0% 0         0% 
server_caps  delegreturn  getacl       setacl       fs_locations rel_lkowner  
3         0% 0         0% 0         0% 0         0% 0         0% 0         0% 
secinfo      exchange_id  create_ses   destroy_ses  sequence     get_lease_t  
0         0% 0         0% 2         0% 1         0% 0         0% 0         0% 
reclaim_comp layoutget    getdevinfo   layoutcommit layoutreturn getdevlist   
0         0% 1         0% 0         0% 0         0% 0         0% 0         0% 
(null)       
1 0% 

-The packet capture from the client side is showing NFSv4 READ calls with 1MB as read count in bytes when having 15MB as maximum NFS read ahead size.

2021-03-22 14:25:34.984731  9398 172.31.17.42 → 172.31.28.161 NFS 0.000375 V4 Call READ StateID: 0x3640 Offset: 94371840 Len: 1048576 
2021-03-22 14:25:34.984805  9405 172.31.17.42 → 172.31.28.161 NFS 0.000074 V4 Call READ StateID: 0x3640 Offset: 95420416 Len: 1048576 
2021-03-22 14:25:34.984902  9416 172.31.17.42 → 172.31.28.161 NFS 0.000097 V4 Call READ StateID: 0x3640 Offset: 96468992 Len: 1048576 
2021-03-22 14:25:34.984941  9421 172.31.17.42 → 172.31.28.161 NFS 0.000039 V4 Call READ StateID: 0x3640 Offset: 97517568 Len: 1048576 

 ###############################################################################################

-I think there are 2 options to mitigate this behaviour which I am listing below:
A)Raising the default maximum NFS read ahead  size because the current default 128KB doesn’t seem to be sufficient for High throughout & low latency workload, I strongly believe that the NFS rsize mount option should be used as variable in deciding the maximum NFS read ahead  size which was the case before [1] while now it’s always 128KB regardless the utilized rsize mount option. Also I think clients running in High latency & Low throughout environment shouldn’t use 1MB as rsize in their mount options(i.e they should use smaller rsize) because it may increase their suffering even with low maximum NFS read ahead size.
B)Adding some logic to read ahead to have some kind of Autotuning (Similar to TCP Autotuning) where the maximum read ahead size can dynamically increase in case the client/reader is constantly filling up/utilizing the read ahead window size.

Links:
[1] https://www.spinics.net/lists/linux-nfs/msg75018.html
[2] https://docs.aws.amazon.com/efs/latest/ug/mounting-fs-nfs-mount-settings.html
[3] https://bugzilla.kernel.org/show_bug.cgi?id=204939
[4] https://www.mail-archive.com/linux-kernel@xxxxxxxxxxxxxxx/msg1274743.html

Thank you.

Hazem

Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705