RE: xfsxyncd in 'D' state

"Earl, Joshua P" <Joshua.Earl@xxxxxxxxxxxxx> · Wed, 16 Sep 2015 16:50:29 +0000

I was also able to get the xfs_info after finally getting the drive remounted:

[root@ncb-sv-016 ~]# xfs_info /home
meta-data="" isize=256    agcount=70, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=18554637056, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

From: Earl, Joshua P 

Sent: Tuesday, September 15, 2015 1:53 PM

To: 'xfs@xxxxxxxxxxx' <xfs@xxxxxxxxxxx>

Subject: xfsxyncd in 'D' state

Hello, I hope I’m writing to the correct list.  I’ve recently run into a problem, which has me stumped.  I’m running a cluster which shares an xfs filesystem to 10 nodes via nfs.  This has been working for almost two years.  However, I’ve
 been running into trouble with the drive where if anything tries to write to it at certain times it will simply hang, and every process trying to write will also hang and go into the ‘D’ state.  For example (just editing a text file with emacs):

[root@ncb-sv-016 ~]# ps aux|grep D
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root      2216  0.0  0.0      0     0 ?        D    11:28   0:00 [xfssyncd/sdb1]
archana   7708  0.0  0.0 249700 13352 pts/0    D    11:35   0:00 emacs things
root     11453  0.0  0.0 103312   868 pts/1    S+   12:47   0:00 grep D

This will remain like this for hours.  Can’t remount/unmount drive (sends the unmount command into ‘D’ state)

I have no idea what’s going on or how to fix it, but I’m hoping you guys might be able to point me in the right direction. Here is the info that’s requested in the FAQ:

§ 
kernel version (uname -a)

Linux ncb-sv-016.ducom.edu 2.6.32-358.23.2.el6.x86_64 #1 SMP Wed Oct 16 18:37:12 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

§ 
xfsprogs version (xfs_repair -V)

xfs_repair version 3.1.1

§ 
number of CPUs

16

§ 
contents of /proc/meminfo

§ 
contents of /proc/mounts

§ 
contents of /proc/partitions

Attached except for mounts (not currently in the /proc directory, attached the fstab instead (hopefully helpful?)

§ 
RAID layout (hardware and/or software) The RAID-6 is the problem one

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy

------------------------------------------------------------------------------

u0    RAID-1    OK             -       -       -       3725.28   RiW    ON    

u1    RAID-6    OK             -       -       64K     70780.3   RiW    ON    

u2    SPARE     OK             -       -       -       3726.01   -      OFF   

VPort Status         Unit Size      Type  Phy Encl-Slot    Model

------------------------------------------------------------------------------

p8    OK             u0   3.63 TB   SATA  -   /c0/e0/slt0  WDC WD4000FYYZ-01UL

p9    OK             u1   3.63 TB   SATA  -   /c0/e0/slt4  WDC WD4000FYYZ-01UL

p10   OK             u1   3.63 TB   SATA  -   /c0/e0/slt8  WDC WD4000FYYZ-01UL

p11   OK             u1   3.63 TB   SATA  -   /c0/e0/slt12 WDC WD4000FYYZ-01UL

p12   OK             u1   3.63 TB   SATA  -   /c0/e0/slt16 WDC WD4000FYYZ-01UL

p13   OK             u1   3.63 TB   SATA  -   /c0/e0/slt20 WDC WD4000FYYZ-01UL

p14   OK             u0   3.63 TB   SATA  -   /c0/e0/slt1  WDC WD4000FYYZ-01UL

p15   OK             u1   3.63 TB   SATA  -   /c0/e0/slt5  WDC WD4000FYYZ-01UL

p16   OK             u1   3.63 TB   SATA  -   /c0/e0/slt9  WDC WD4000FYYZ-01UL

p17   OK             u1   3.63 TB   SATA  -   /c0/e0/slt13 WDC WD4000FYYZ-01UL

p18   OK             u1   3.63 TB   SATA  -   /c0/e0/slt17 WDC WD4000FYYZ-01UL

p19   OK             u1   3.63 TB   SATA  -   /c0/e0/slt21 WDC WD4000FYYZ-01UL

p20   OK             u1   3.63 TB   SATA  -   /c0/e0/slt2  WDC WD4000FYYZ-01UL

p21   OK             u1   3.63 TB   SATA  -   /c0/e0/slt6  WDC WD4000FYYZ-01UL

p22   OK             u1   3.63 TB   SATA  -   /c0/e0/slt10 WDC WD4000FYYZ-01UL

p23   OK             u1   3.63 TB   SATA  -   /c0/e0/slt14 WDC WD4000FYYZ-01UL

p24   OK             u1   3.63 TB   SATA  -   /c0/e0/slt18 WDC WD4000FYYZ-01UL

p25   OK             u1   3.63 TB   SATA  -   /c0/e0/slt22 WDC WD4000FYYZ-01UL

p26   OK             u1   3.63 TB   SATA  -   /c0/e0/slt3  WDC WD4000FYYZ-01UL

p27   OK             u1   3.63 TB   SATA  -   /c0/e0/slt7  WDC WD4000FYYZ-01UL

p28   OK             u1   3.63 TB   SATA  -   /c0/e0/slt11 WDC WD4000FYYZ-01UL

p29   OK             u1   3.63 TB   SATA  -   /c0/e0/slt15 WDC WD4000FYYZ-01UL

p30   OK             u1   3.63 TB   SATA  -   /c0/e0/slt19 WDC WD4000FYYZ-01UL

p31   OK             u2   3.63 TB   SATA  -   /c0/e0/slt23 WDC WD4000FYYZ-01UL

§ 
LVM configuration

I don’t *think* these are in an LVM.. I could be wrong.

§ 
type of disks you are using

Models included above in raid config.

§ 
write cache status of drives

§ 
size of BBWC and mode it is running in

§ 
xfs_info output on the filesystem in question

For the above three questions I’m not sure how to get the the cache status of the drives, or what the BBWC is? xfs_info won’t currently run (I’m waiting on the drive to unmount) but I
 ran an xfs_check and an xfs_repair –n and no errors were shown.

§ 
dmesg output showing all error messages and stack traces

Then you need to describe your workload that is causing the problem, and a demonstration of the bad behaviour that is occurring. If it is a performance problem, then 30s - 1 minute samples
 of:

1.   
iostat -x -d -m 5

[root@ncb-sv-016 ~]# iostat -x -d -m 5

Linux 2.6.32-358.23.2.el6.x86_64 (ncb-sv-016.ducom.edu)       09/15/2015    _x86_64_      (16 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.29     3.61    5.78    3.58     0.10     0.03    28.27     0.05    5.19   2.39   2.24

sdb               1.02     8.66   31.50    3.91     0.33     0.12    26.14     5.94  167.54  27.47  97.25

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     1.60    0.00    2.00     0.00     0.01    14.40     0.01    4.30   4.30   0.86

sdb               0.00     0.00    0.00    0.80     0.00     0.03    64.00     6.46 6332.75 1250.00 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     3.60    0.00    7.00     0.00     0.04    12.11     0.02    0.60   1.77   1.24

sdb               0.00     0.00    0.00    1.00     0.00     0.03    64.00     6.28 6256.60 1000.00 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00    0.00    0.40     0.00     0.00     8.00     0.00   42.50  12.00   0.48

sdb               0.00     0.00    0.00    1.20     0.00     0.04    64.00     5.86 5846.33 833.33 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.60    0.00    0.60     0.00     0.00    16.00     0.01   12.67  12.67   0.76

sdb               0.00     0.00    0.00    1.00     0.00     0.03    64.00     6.86 5725.20 1000.00 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

sdb               0.00     0.00    0.00    1.00     0.00     0.03    64.00     6.06 5459.00 1000.00 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     4.00    0.00    1.60     0.00     0.02    26.00     0.01    6.75   6.50   1.04

sdb               0.00     0.00    0.00    1.00     0.00     0.03    51.20     7.05 5670.40 1000.00 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00    29.40    0.00    2.60     0.00     0.12    98.46     0.01    4.54   4.08   1.06

sdb               0.00     0.00    0.00    1.20     0.00     0.03    53.33     6.54 7428.50 833.33 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     9.60    0.00   15.80     0.00     0.10    12.86     0.57   35.82   3.37   5.32

sdb               0.00     0.00    0.00    1.00     0.00     0.03    64.00     6.30 5889.20 1000.00 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     7.80    0.00   12.80     0.00     0.08    12.38     0.74   58.09  15.06  19.28

sdb               0.00     0.00    0.00    1.20     0.00     0.04    64.00     6.49 6140.83 833.33 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     4.80    0.00    3.20     0.00     0.03    20.00     0.01    0.06   3.12   1.00

sdb               0.00     0.00    0.00    0.80     0.00     0.03    64.00     5.10 6489.25 1250.00 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00    0.00    0.20     0.00     0.00     8.00     0.02  152.00 103.00   2.06

sdb               0.00     0.00    0.00    1.00     0.00     0.03    64.00     6.75 5791.00 1000.20 100.02

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00    61.20    0.00   11.80     0.00     0.29    49.49     0.01    0.88   0.69   0.82

sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     6.37 5569.71 714.14  99.98

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.40    0.00    0.60     0.00     0.00    13.33     0.01   24.33  24.33   1.46

sdb               0.00     0.00    0.00    1.60     0.00     0.05    64.00     5.77 5162.00 625.12 100.02

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00    0.00    0.40     0.00     0.00     8.00     0.00    3.00   1.50   0.06

sdb               0.00     0.00    0.00    0.80     0.00     0.03    64.00     5.08 3428.50 1250.00 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     1.60    0.00    0.80     0.00     0.01    24.00     0.01   10.75  10.75   0.86

sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     5.86 3932.14 714.29 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     2.00    0.00    4.80     0.00     0.03    11.33     0.01    2.21   2.08   1.00

sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     5.60 3992.71 714.29 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     5.00    0.00   18.20     0.00     0.09    10.20     0.02    1.13   0.03   0.06

sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     5.44 4208.86 714.29 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     1.40    0.00    0.60     0.00     0.01    26.67     0.02   27.00  27.00   1.62

sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     5.22 4325.43 714.29 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.60    0.00    0.40     0.00     0.00    20.00     0.01   15.50  15.50   0.62

sdb               0.00     0.00    0.00    1.60     0.00     0.05    64.00     5.06 4022.75 625.00 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

sdb               0.00     0.00    0.00    0.80     0.00     0.03    64.00     5.08 3495.50 1250.00 100.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     1.60    0.00    1.60     0.00     0.01    12.00     0.07   42.88  42.50   6.80

sdb               0.00     0.00    0.00    1.40     0.00     0.04    64.00     5.82 3894.71 714.29 100.00

2.   
vmstat 5

[root@ncb-sv-016 ~]# vmstat 5

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----

r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st

0  0      0 126125768  19456 405396    0    0    28    10  421  150  0  0 99  0  0     

0  0      0 126125520  19464 405396    0    0     0    34 6679 13281  0  0 100  0  0   

1  0      0 126124896  19472 405392    0    0     0    38 6718 13310  0  0 100  0  0   

0  0      0 126125312  19472 405400    0    0     0    74 6658 13256  0  0 100  0  0   

0  0      0 126125440  19480 405392    0    0     0    60 6664 13291  0  0 100  0  0   

2  0      0 126125440  19480 405400    0    0     0    26 6660 13272  0  0 100  0  0   

0  0      0 126125680  19488 405400    0    0     0    30 6659 13282  0  0 100  0  0   

2  0      0 126125696  19496 405396    0    0     0   117 6686 13298  0  0 100  0  0   

1  0      0 126125568  19496 405400    0    0     0    33 6661 13287  0  0 100  0  0   

0  0      0 126125816  19504 405400    0    0     0    30 6663 13271  0  0 100  0  0   

1  0      0 126125816  19504 405400    0    0     0    27 6659 13285  0  0 100  0  0   

0  0      0 126125696  19512 405400    0    0     0    75 6670 13269  0  0 100  0  0   

0  0      0 126125816  19520 405400    0    0     0    55 6671 13286  0  0 100  0  0   

2  0      0 126125696  19528 405396    0    0     0    34 6670 13284  0  0 100  0  0   

0  0      0 126125272  19528 405400    0    0     0    26 6700 13298  0  0 100  0  0   

0  0      0 126125408  19536 405400    0    0     0    61 6660 13277  0  0 100  0  0   

1  0      0 126125536  19544 405392    0    0     0    98 6677 13281  0  0 100  0  0   

can give us insight into the IO and memory utilisation of your machine at the time of the problem.

If the filesystem is hanging, then capture the output of the dmesg command after running:

# echo w > /proc/sysrq-trigger
# dmesg

will tell us all the hung processes in the machine, often pointing us directly to the cause of the hang.

Attached

Thanks!

Josh Earl, MS
Research Instructor
Drexel College of Medicine
Center for Advanced Microbial Processing (CAMP)
Institute of Molecular Medicine and Infectious Disease
(215) 762-8133

This email and any accompanying attachments are confidential. The information is intended solely for the use of the individual to whom it is addressed. Any review, disclosure, copying, distribution, or use of this email communication by others is strictly prohibited.
 If you are not the intended recipient, please notify the sender immediately and delete all copies. Thank you for your cooperation.

_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs