Re: [cfarm-admins] gcc202 is occasionally returning EIO from fdatasync(2)

Anatoly Pugachev <matorola@xxxxxxxxx> · Fri, 7 May 2021 11:40:06 +0300

On Fri, May 7, 2021 at 4:22 AM Tom Lane via cfarm-admins
<cfarm-admins@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Hi folks,
>
> I thought you ought to know about $SUBJECT.  Maybe it's some
> weird kernel glitch, but if it is reflecting real I/O errors,
> maybe that machine is about to have a disk failure.  Poking
> into its SMART logs (if any) might be useful.
>
> I got interested in this because a Postgres buildfarm instance
> that runs periodically on that machine reported a couple of
> unexplainable failures in the last few weeks [1].  I was able
> to reproduce the failure and determine that it's a fault in
> the logic that ought to report a failure from fdatasync(2).
> Looking in the core file shows that errno = 5 (EIO) is what
> was reported.  So we (PG) have some things to fix, but meanwhile
> I felt you'd better know about the possibility of a hardware
> issue.
>
>                         regards, tom lane
>
> [1] https://www.postgresql.org/message-id/CA+hUKGLhc0Nwnn9u60oYrx4MAUga+qEvj+4pBqPwrmPKDNtFmA@xxxxxxxxxxxxxx

Tom,

just checked /home filesystem, there's no errors... And yes, there's
sometimes (sporadically) I see kernel messages like those in logs:

May 07 03:26:45 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11
May 07 03:26:45 gcc202 kernel: blk_update_request: I/O error, dev
vdiskc, sector 159273120 op 0x1:(WRITE) flags 0x4800 phys_seg 17 prio
class 0
May 07 03:31:39 gcc202 kernel: dm-0: writeback error on inode
2148294407, offset 0, sector 159239256
May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11
May 07 03:31:39 gcc202 kernel: blk_update_request: I/O error, dev
vdiskc, sector 157618896 op 0x1:(WRITE) flags 0x4800 phys_seg 16 prio
class 0
May 07 03:35:06 gcc202 kernel: dm-0: writeback error on inode
155142134, offset 0, sector 157584576
May 07 03:35:06 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11
May 07 03:35:06 gcc202 kernel: blk_update_request: I/O error, dev
vdiskc, sector 657284672 op 0x1:(WRITE) flags 0x1000 phys_seg 4 prio
class 0
May 07 03:35:06 gcc202 kernel: XFS (dm-0): metadata I/O error in
"xfs_buf_ioend+0x2cc/0x640 [xfs]" at daddr 0x272d5640 len 32 error 5

I can't find a reproducer for it to start debugging the issue, but
going to run xfstests [1] on my sparc64 test LDOM to see if it would
catch something...

By the way, there are no SMART disk diagnostics available, since the
machine is a LDOM (read virtual machine) and backend storage is zfs
volume (OS is solaris 11 sparc), which is living on older Hitachi
AMS2000 (over FC).

Thanks for your report anyway.

1. https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git