On 19/06/2024, Andrew Paniakin wrote: > Commit 60e3318e3e900 ("cifs: use fs_context for automounts") was > released in v6.1.54 and broke the failover when one of the servers > inside DFS becomes unavailable. We reproduced the problem on the EC2 > instances of different types. Reverting aforementioned commint on top of > the latest stable verison v6.1.94 helps to resolve the problem. > > Earliest working version is v6.2-rc1. There were two big merges of CIFS fixes: > [1] and [2]. We would like to ask for the help to investigate this problem and > if some of those patches need to be backported. Also, is it safe to just revert > problematic commit until proper fixes/backports will be available? > > We will help to do testing and confirm if fix works, but let me also list the > steps we used to reproduce the problem if it will help to identify the problem: > 1. Create Active Directory domain eg. 'corp.fsxtest.local' in AWS Directory > Service with: > - three AWS FSX file systems filesystem1..filesystem3 > - three Windows servers; They have DFS installed as per > https://learn.microsoft.com/en-us/windows-server/storage/dfs-namespaces/dfs-overview: > - dfs-srv1: EC2AMAZ-2EGTM59 > - dfs-srv2: EC2AMAZ-1N36PRD > - dfs-srv3: EC2AMAZ-0PAUH2U > > 2. Create DFS namespace eg. 'dfs-namespace' in Windows server 2008 mode > and three folders targets in it: > - referral-a mapped to filesystem1.corp.local > - referral-b mapped to filesystem2.corp.local > - referral-c mapped to filesystem3.corp.local > - local folders dfs-srv1..dfs-srv3 in C:\DFSRoots\dfs-namespace of every > Windows server. This helps to quickly define underlying server when > DFS is mounted. > > 3. Enabled cifs debug logs: > ``` > echo 'module cifs +p' > /sys/kernel/debug/dynamic_debug/control > echo 'file fs/cifs/* +p' > /sys/kernel/debug/dynamic_debug/control > echo 7 > /proc/fs/cifs/cifsFYI > ``` > > 4. Mount DFS namespace on Amazon Linux 2023 instance running any vanilla > kernel v6.1.54+: > ``` > dmesg -c &>/dev/null > cd /mnt > mount -t cifs -o cred=/mnt/creds,echo_interval=5 \ > //corp.fsxtest.local/dfs-namespace \ > ./dfs-namespace > ``` > > 5. List DFS root, it's also required to avoid recursive mounts that happen > during regular 'ls' run: > ``` > sh -c 'ls dfs-namespace' > dfs-srv2 referral-a referral-b > ``` > > The DFS server is EC2AMAZ-1N36PRD, it's also listed in mount: > ``` > [root@ip-172-31-2-82 mnt]# mount | grep dfs > //corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) > //EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) > ``` > > List files in first folder: > ``` > sh -c 'ls dfs-namespace/referral-a' > filea.txt.txt > ``` > > 6. Shutdown DFS server-2. > List DFS root again, server changed from dfs-srv2 to dfs-srv1 EC2AMAZ-2EGTM59: > ``` > sh -c 'ls dfs-namespace' > dfs-srv1 referral-a referral-b > ``` > > 7. Try to list files in another folder, this causes ls to fail with error: > ``` > sh -c 'ls dfs-namespace/referral-b' > ls: cannot access 'dfs-namespace/referral-b': No route to host``` > > Sometimes it's also 'Operation now in progress' error. > > mount shows the same output: > ``` > //corp.fsxtest.local/dfs-namespace on /mnt/dfs-namespace type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.11.26,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) > //EC2AMAZ-1N36PRD.corp.fsxtest.local/dfs-namespace/referral-a on /mnt/dfs-namespace/referral-a type cifs (rw,relatime,vers=3.1.1,cache=strict,username=Admin,domain=corp.fsxtest.local,uid=0,noforceuid,gid=0,noforcegid,addr=172.31.12.80,file_mode=0755,dir_mode=0755,soft,nounix,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=5,actimeo=1,closetimeo=1) > ``` > > I also attached kernel debug logs from this test. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=851f657a86421 > [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0a924817d2ed9 > > Reported-by: Andrei Paniakin <apanyaki@xxxxxxxxxx> > Bisected-by: Simba Bonga <simbarb@xxxxxxxxxx> > --- > > #regzbot introduced: v6.1.54..v6.2-rc1 Friendly reminder, did anyone had a chance to look into this report?