Re: [XFS] Any process to a particular XFS device hung in D state forever.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Brain, 

Here's the a gist include sysrq-trigger and strace of one of the hanging $ls result. This is from another problematic disk (d817) on the same server.

https://gist.github.com/HugoKuo/8eb8208bbb7a7f562a6c9a3eafa8f37f

It looks like the hanging $ls is stuck on getting extend attribute of a file on this disk. The full output can be found in the link above. 

lstat("/srv/node/d864/tmp/tmpIRYFaW", {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
capget(0x20080522, 0, NULL) = -1 EFAULT (Bad address)
getxattr("/srv/node/d864/tmp/tmpIRYFaW", "security.capability"


As for the xfs_repair output in link https://gist.github.com/HugoKuo/76f65bdc0b860ca6ed5e786f8c43da0e . Your question is if the node been force rebooted. The answer is NO.   I didn't reboot this server yet. I force unmounted it via $umount -l <dev> . Then run the xfs_repair. 

$ls /srv/node/d864/tmp > test.d864
$ls /srv/node/d864/tmp


Thanks // Hugo

On Tue, Apr 19, 2016 at 7:30 PM, Brian Foster <bfoster@xxxxxxxxxx> wrote:
On Tue, Apr 19, 2016 at 05:56:19PM +0800, Hugo Kuo wrote:
> Hi XFS team,
>
> We encountered a problem frequently in past three weeks. Our daemons store
> data to XFS partition associate with xattr.
>
> Disk seems not responding since all processes to this disk in D state and
> can't be killed at all.
>
>    - It happens on several disks. I feel it's randomly.
>    - Reboot seems solve the problem temporarily.
>    - All disks are multipath devices.
>
>
> I suspected that's an issue from disk corrupted at beginning. But smartctl
> doesn't show any clue about disk bad. And reboot makes the problem gone
> away.
>
>
>    - Any process to this disk is blocked. Even a simple $ls . Kernel log
>    <https://gist.github.com/HugoKuo/f87748786b26ea04fd9e1d86d9538293>

Looks like it's waiting on an AGF buffer. The buffer could be held by
something else, but we don't have enough information from that one
trace. Could you get all of the blocked tasks when in this state (e.g.,
"echo w > /proc/sysrq-trigger")?


 

>    - I tested the disk by read bytes on block via $dd . It works fine
>    without any error in dmesg.
>    - The `xfs_repair -n` output of a problematic mount point [xfs_repair -n]
>    <https://gist.github.com/HugoKuo/76f65bdc0b860ca6ed5e786f8c43da0e> . It
>    is still processing.

I presume this was run after a forced reboot..? If so, was the
filesystem remounted first to replay the log (xfs_repair -n doesn't
detect/warn about a dirty log, iirc). If the log was dirty, then repair
is a bit less interesting simply because some corruption is to be
expected in that scenario.

>    - Kernel : Linux node9 2.6.32-573.8.1.el6.x86_64 #1 SMP Tue Nov 10
>    18:01:38 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
>    - OS : CentOS release 6.5 (Final)
>    - XFS : xfsprogs.x86_64         3.1.1-14.el6
>
>
> There's an interesting behaviour of $ls command.
>
> * This is completed in 1sec. Very quick and give me the result in the
> test.d864 file $ls /srv/node/d864/tmp > test.d864
> * This is hanging $ls /srv/node/d864/tmp
>

I'm not following you here. Are you missing an attachment (test.d864)?

Brian

> [image: Inline image 1]
>
> I suspect there's something wrong with imap. Is there a known bug ?
>
> Thanks // Hugo



> _______________________________________________
> xfs mailing list
> xfs@xxxxxxxxxxx
> http://oss.sgi.com/mailman/listinfo/xfs


_______________________________________________
xfs mailing list
xfs@xxxxxxxxxxx
http://oss.sgi.com/mailman/listinfo/xfs

[Index of Archives]     [Linux XFS Devel]     [Linux Filesystem Development]     [Filesystem Testing]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux