RE: xfsxyncd in 'D' state

"Earl, Joshua P" <Joshua.Earl@xxxxxxxxxxxxx> · Thu, 17 Sep 2015 21:37:09 +0000

Hi Brian,

Sorry about the top posting thing... I'm not sure how to control that, is my replying somehow messing with that?

With good news, I seem to have figured out what was going on.  I had a cron job which would run every 15 minutes which changed the permissions in a directory: 
chmod -R g+rwx /data/shared/homes/bjanto/*
chmod -R g+rwx /data/shared/homes/lanastor/*
chgrp -hR ilmn /data/nextseq/*
chgrp -hR lab /data/shared/homes/*

Where /data was a directory in the mounted xfs file system.  The script itself would complete in under a minute, and I thought everything was fine.  However, this would trigger the xfssyncd process to go into the 'D' state, and no writing was allowed until that had completed whatever it was doing.  Apparently this would take longer than 15 minutes.  As long as cron was running the drive would never become available for writing.

I've solved the problem by just using a setguid on the directories in question (so anything in those directories get the correct group on creation), so no cron job is needed.  But is this expected behavior?  Should I change any settings on the mount?

I can definitely compress the files if need be, they were the /proc/meminfo etc. outputs requested in the faq.  I'm not sure at this point if they are required.

~josh

-----Original Message-----
From: Brian Foster [mailto:bfoster@xxxxxxxxxx] 
Sent: Thursday, September 17, 2015 3:21 PM
To: Earl, Joshua P <Joshua.Earl@xxxxxxxxxxxxx>
Cc: xfs@xxxxxxxxxxx
Subject: Re: xfsxyncd in 'D' state

On Thu, Sep 17, 2015 at 04:45:03PM +0000, Earl, Joshua P wrote:
> Anyone have any ideas on this?  Is this the right mailing list?  It looks like the email I sent with attachments didn't go through, should I copy and paste the outputs into an email?  We are pretty crippled without the use of this drive, and it took several weeks to figure out that it was this process going into uninterruptible sleep that was the 'cause'.   I don't know what causes this however, I'm not sure how to track that down.  There doesn't seem to be anything accessing the drive as far as processes go... but on a clean reboot within about 5 minutes this pops up:
> 

Were the attachments large? You could try to compress them or perhaps host them somewhere and post a link.

(Also, please try not to top-post).

> root      2216  0.0  0.0      0     0 ?        D    12:24   0:00 [xfssyncd/sdb1]
> 
> And we are dead in the water until it lets go, which is currently hours later.  When we first experienced this problem it would only take a few minutes to get back to a writable state.
> 

That is responsible for writing out metadata and things on older kernels. When was this problem "first experienced" as opposed to the current state? Did performance drop off slowly or rapidly?

> Any help would be greatly appreciated!
> 
> Thanks,
> ~josh
> 
> From: xfs-bounces@xxxxxxxxxxx [mailto:xfs-bounces@xxxxxxxxxxx] On 
> Behalf Of Earl, Joshua P
> Sent: Wednesday, September 16, 2015 12:50 PM
> To: xfs@xxxxxxxxxxx
> Subject: RE: xfsxyncd in 'D' state
> 
> I was also able to get the xfs_info after finally getting the drive remounted:
> 
> [root@ncb-sv-016 ~]# xfs_info /home
> meta-data=/dev/sdb1              isize=256    agcount=70, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=0
> data     =                       bsize=4096   blocks=18554637056, imaxpct=1
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0
> log      =internal               bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 

So this is a 70TB fs with what looks like mostly default settings. Note that no stripe unit/width are set, fwiw.

I don't see current fs utilization (df, df -i) or mount options reported anywhere. Can you provide that information?

> From: Earl, Joshua P
> Sent: Tuesday, September 15, 2015 1:53 PM
> To: 'xfs@xxxxxxxxxxx' <xfs@xxxxxxxxxxx<mailto:xfs@xxxxxxxxxxx>>
> Subject: xfsxyncd in 'D' state
> 
> Hello, I hope I'm writing to the correct list.  I've recently run into a problem, which has me stumped.  I'm running a cluster which shares an xfs filesystem to 10 nodes via nfs.  This has been working for almost two years.  However, I've been running into trouble with the drive where if anything tries to write to it at certain times it will simply hang, and every process trying to write will also hang and go into the 'D' state.  For example (just editing a text file with emacs):
> 

You haven't really described the workload either. If not much is going on from the server itself, what are those 10 nfs clients doing when this occurs? In general, the more information you provide about the environment and workload, the more likely other folks here who might be more familiar with NFS and/or hwraid might chime in with suggestions.

Not being an NFS expert myself, I'd probably unexport the filesystem, mount it locally and run some tests there to see what seems to induce this behavior, if anything. For example, what happens if existing files are read or directories listed? In terms of writes, does a sequential file writer have reasonable performance (dd)? Can you allocate inodes (e.g., create a temp dir somewhere and run a 'touch' loop to create new
files) without any issues? You could also try to untar a tarball, run a short fio/fsstress/whatever workload, etc.

If nothing seems to trigger it locally, I'd start to look at adding back the clients to try identify contributors.

> [root@ncb-sv-016 ~]# ps aux|grep D
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root      2216  0.0  0.0      0     0 ?        D    11:28   0:00 [xfssyncd/sdb1]
> archana   7708  0.0  0.0 249700 13352 pts/0    D    11:35   0:00 emacs things
> root     11453  0.0  0.0 103312   868 pts/1    S+   12:47   0:00 grep D
> 

What's the stack trace for the emacs process when this occurs? I suspect it would eventually get dumped to the logs as a stalled task, but /proc/<pid>/stack should show it as well.

> This will remain like this for hours.  Can't remount/unmount drive 
> (sends the unmount command into 'D' state)
> 
> I have no idea what's going on or how to fix it, but I'm hoping you guys might be able to point me in the right direction. Here is the info that's requested in the FAQ:
> ?  kernel version (uname -a)
> Linux ncb-sv-016.ducom.edu 2.6.32-358.23.2.el6.x86_64 #1 SMP Wed Oct 
> 16 18:37:12 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux ?  xfsprogs 
> version (xfs_repair -V) xfs_repair version 3.1.1 ?  number of CPUs
> 16
> ?  contents of /proc/meminfo
> ?  contents of /proc/mounts
> ?  contents of /proc/partitions
> Attached except for mounts (not currently in the /proc directory, 
> attached the fstab instead (hopefully helpful?) ?  RAID layout (hardware and/or software) The RAID-6 is the problem one
> Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
> ------------------------------------------------------------------------------
> u0    RAID-1    OK             -       -       -       3725.28   RiW    ON
> u1    RAID-6    OK             -       -       64K     70780.3   RiW    ON
> u2    SPARE     OK             -       -       -       3726.01   -      OFF
> 
> VPort Status         Unit Size      Type  Phy Encl-Slot    Model
> ------------------------------------------------------------------------------
> p8    OK             u0   3.63 TB   SATA  -   /c0/e0/slt0  WDC WD4000FYYZ-01UL
> p9    OK             u1   3.63 TB   SATA  -   /c0/e0/slt4  WDC WD4000FYYZ-01UL
> p10   OK             u1   3.63 TB   SATA  -   /c0/e0/slt8  WDC WD4000FYYZ-01UL
> p11   OK             u1   3.63 TB   SATA  -   /c0/e0/slt12 WDC WD4000FYYZ-01UL
> p12   OK             u1   3.63 TB   SATA  -   /c0/e0/slt16 WDC WD4000FYYZ-01UL
> p13   OK             u1   3.63 TB   SATA  -   /c0/e0/slt20 WDC WD4000FYYZ-01UL
> p14   OK             u0   3.63 TB   SATA  -   /c0/e0/slt1  WDC WD4000FYYZ-01UL
> p15   OK             u1   3.63 TB   SATA  -   /c0/e0/slt5  WDC WD4000FYYZ-01UL
> p16   OK             u1   3.63 TB   SATA  -   /c0/e0/slt9  WDC WD4000FYYZ-01UL
> p17   OK             u1   3.63 TB   SATA  -   /c0/e0/slt13 WDC WD4000FYYZ-01UL
> p18   OK             u1   3.63 TB   SATA  -   /c0/e0/slt17 WDC WD4000FYYZ-01UL
> p19   OK             u1   3.63 TB   SATA  -   /c0/e0/slt21 WDC WD4000FYYZ-01UL
> p20   OK             u1   3.63 TB   SATA  -   /c0/e0/slt2  WDC WD4000FYYZ-01UL
> p21   OK             u1   3.63 TB   SATA  -   /c0/e0/slt6  WDC WD4000FYYZ-01UL
> p22   OK             u1   3.63 TB   SATA  -   /c0/e0/slt10 WDC WD4000FYYZ-01UL
> p23   OK             u1   3.63 TB   SATA  -   /c0/e0/slt14 WDC WD4000FYYZ-01UL
> p24   OK             u1   3.63 TB   SATA  -   /c0/e0/slt18 WDC WD4000FYYZ-01UL
> p25   OK             u1   3.63 TB   SATA  -   /c0/e0/slt22 WDC WD4000FYYZ-01UL
> p26   OK             u1   3.63 TB   SATA  -   /c0/e0/slt3  WDC WD4000FYYZ-01UL
> p27   OK             u1   3.63 TB   SATA  -   /c0/e0/slt7  WDC WD4000FYYZ-01UL
> p28   OK             u1   3.63 TB   SATA  -   /c0/e0/slt11 WDC WD4000FYYZ-01UL
> p29   OK             u1   3.63 TB   SATA  -   /c0/e0/slt15 WDC WD4000FYYZ-01UL
> p30   OK             u1   3.63 TB   SATA  -   /c0/e0/slt19 WDC WD4000FYYZ-01UL
> p31   OK             u2   3.63 TB   SATA  -   /c0/e0/slt23 WDC WD4000FYYZ-01UL

That's a 21 disk raid6, which seems like a high spindle count for a stripe geometry like raid6 to me. I guess it depends on the use case.
Was this always the geometry of the array or was it grown over time?

> ?  LVM configuration
> I don't *think* these are in an LVM.. I could be wrong.
> ?  type of disks you are using
> Models included above in raid config.
> ?  write cache status of drives
> ?  size of BBWC and mode it is running in ?  xfs_info output on the 
> filesystem in question For the above three questions I'm not sure how 
> to get the the cache status of the drives, or what the BBWC is? xfs_info won't currently run (I'm waiting on the drive to unmount) but I ran an xfs_check and an xfs_repair -n and no errors were shown.
> ?  dmesg output showing all error messages and stack traces Then you 
> need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples of:
> 1.    iostat -x -d -m 5
> [root@ncb-sv-016 ~]# iostat -x -d -m 5
> Linux 2.6.32-358.23.2.el6.x86_64 (ncb-sv-016.ducom.edu)       09/15/2015    _x86_64_      (16 CPU)
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.29     3.61    5.78    3.58     0.10     0.03    28.27     0.05    5.19   2.39   2.24
> sdb               1.02     8.66   31.50    3.91     0.33     0.12    26.14     5.94  167.54  27.47  97.25
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     1.60    0.00    2.00     0.00     0.01    14.40     0.01    4.30   4.30   0.86
> sdb               0.00     0.00    0.00    0.80     0.00     0.03    64.00     6.46 6332.75 1250.00 100.00
> 

It looks like not much write activity causes severe I/O latencies (on the order of seconds) and 100% device utilization. Without some of the details noted above, it's kind of hard to grasp at what's going wrong here beyond the fact that the storage just appears to be running really slow.

Brian

> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     3.60    0.00    7.00     0.00     0.04    12.11     0.02    0.60   1.77   1.24
> sdb               0.00     0.00    0.00    1.00     0.00     0.03    64.00     6.28 6256.60 1000.00 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.40     0.00     0.00     8.00     0.00   42.50  12.00   0.48
> sdb               0.00     0.00    0.00    1.20     0.00     0.04    64.00     5.86 5846.33 833.33 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.60    0.00    0.60     0.00     0.00    16.00     0.01   12.67  12.67   0.76
> sdb               0.00     0.00    0.00    1.00     0.00     0.03    64.00     6.86 5725.20 1000.00 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> sdb               0.00     0.00    0.00    1.00     0.00     0.03    64.00     6.06 5459.00 1000.00 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     4.00    0.00    1.60     0.00     0.02    26.00     0.01    6.75   6.50   1.04
> sdb               0.00     0.00    0.00    1.00     0.00     0.03    51.20     7.05 5670.40 1000.00 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00    29.40    0.00    2.60     0.00     0.12    98.46     0.01    4.54   4.08   1.06
> sdb               0.00     0.00    0.00    1.20     0.00     0.03    53.33     6.54 7428.50 833.33 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     9.60    0.00   15.80     0.00     0.10    12.86     0.57   35.82   3.37   5.32
> sdb               0.00     0.00    0.00    1.00     0.00     0.03    64.00     6.30 5889.20 1000.00 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     7.80    0.00   12.80     0.00     0.08    12.38     0.74   58.09  15.06  19.28
> sdb               0.00     0.00    0.00    1.20     0.00     0.04    64.00     6.49 6140.83 833.33 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     4.80    0.00    3.20     0.00     0.03    20.00     0.01    0.06   3.12   1.00
> sdb               0.00     0.00    0.00    0.80     0.00     0.03    64.00     5.10 6489.25 1250.00 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.20     0.00     0.00     8.00     0.02  152.00 103.00   2.06
> sdb               0.00     0.00    0.00    1.00     0.00     0.03    64.00     6.75 5791.00 1000.20 100.02
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00    61.20    0.00   11.80     0.00     0.29    49.49     0.01    0.88   0.69   0.82
> sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     6.37 5569.71 714.14  99.98
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.40    0.00    0.60     0.00     0.00    13.33     0.01   24.33  24.33   1.46
> sdb               0.00     0.00    0.00    1.60     0.00     0.05    64.00     5.77 5162.00 625.12 100.02
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.40     0.00     0.00     8.00     0.00    3.00   1.50   0.06
> sdb               0.00     0.00    0.00    0.80     0.00     0.03    64.00     5.08 3428.50 1250.00 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     1.60    0.00    0.80     0.00     0.01    24.00     0.01   10.75  10.75   0.86
> sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     5.86 3932.14 714.29 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     2.00    0.00    4.80     0.00     0.03    11.33     0.01    2.21   2.08   1.00
> sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     5.60 3992.71 714.29 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     5.00    0.00   18.20     0.00     0.09    10.20     0.02    1.13   0.03   0.06
> sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     5.44 4208.86 714.29 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     1.40    0.00    0.60     0.00     0.01    26.67     0.02   27.00  27.00   1.62
> sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     5.22 4325.43 714.29 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.60    0.00    0.40     0.00     0.00    20.00     0.01   15.50  15.50   0.62
> sdb               0.00     0.00    0.00    1.60     0.00     0.05    64.00     5.06 4022.75 625.00 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
> sdb               0.00     0.00    0.00    0.80     0.00     0.03    64.00     5.08 3495.50 1250.00 100.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util
> sda               0.00     1.60    0.00    1.60     0.00     0.01    12.00     0.07   42.88  42.50   6.80
> sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     5.82 3894.71 714.29 100.00
> 2.    vmstat 5
> [root@ncb-sv-016 ~]# vmstat 5
> procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
> 0  0      0 126125768  19456 405396    0    0    28    10  421  150  0  0 99  0  0
> 0  0      0 126125520  19464 405396    0    0     0    34 6679 13281  0  0 100  0  0
> 1  0      0 126124896  19472 405392    0    0     0    38 6718 13310  0  0 100  0  0
> 0  0      0 126125312  19472 405400    0    0     0    74 6658 13256  0  0 100  0  0
> 0  0      0 126125440  19480 405392    0    0     0    60 6664 13291  0  0 100  0  0
> 2  0      0 126125440  19480 405400    0    0     0    26 6660 13272  0  0 100  0  0
> 0  0      0 126125680  19488 405400    0    0     0    30 6659 13282  0  0 100  0  0
> 2  0      0 126125696  19496 405396    0    0     0   117 6686 13298  0  0 100  0  0
> 1  0      0 126125568  19496 405400    0    0     0    33 6661 13287  0  0 100  0  0
> 0  0      0 126125816  19504 405400    0    0     0    30 6663 13271  0  0 100  0  0
> 1  0      0 126125816  19504 405400    0    0     0    27 6659 13285  0  0 100  0  0
> 0  0      0 126125696  19512 405400    0    0     0    75 6670 13269  0  0 100  0  0
> 0  0      0 126125816  19520 405400    0    0     0    55 6671 13286  0  0 100  0  0
> 2  0      0 126125696  19528 405396    0    0     0    34 6670 13284  0  0 100  0  0
> 0  0      0 126125272  19528 405400    0    0     0    26 6700 13298  0  0 100  0  0
> 0  0      0 126125408  19536 405400    0    0     0    61 6660 13277  0  0 100  0  0
> 1  0      0 126125536  19544 405392    0    0     0    98 6677 13281  0  0 100  0  0
> can give us insight into the IO and memory utilisation of your machine at the time of the problem.
> If the filesystem is hanging, then capture the output of the dmesg command after running:
> # echo w > /proc/sysrq-trigger
> # dmesg
> will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.
> Attached
> 
> Thanks!
> 
> Josh Earl, MS
> Research Instructor
> Drexel College of Medicine
> Center for Advanced Microbial Processing (CAMP) Institute of Molecular 
> Medicine and Infectious Disease
> (215) 762-8133
> 
> ________________________________
> 
> This email and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this email communication by others is strictly prohibited. If you are not the intended recipient, please notify the sender immediately and delete all copies. Thank you for your cooperation.

> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> https://urldefense.proofpoint.com/v2/url?u=http-3A__oss.sgi.com_mailma
> n_listinfo_xfs&d=AwIBAg&c=3kkgRanvEDK8hL9DynGJVVH69jkBMvrwECTgfOmvv0E&r=HiolWjLw4P5HxQGEvzKcfCwD3EYUiDpGfeAhgr7hw3M&m=0SleLEBijJl2B8Pk6Lsu8ECDCQDIKASaEIYUKaWRyAo&s=aAZVjeECRjIdMS-OkbTo3YNdphVwt74N-HK5ie2KBbc&e=

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs