On Fri, May 7, 2021 at 4:22 AM Tom Lane via cfarm-admins <cfarm-admins@xxxxxxxxxxxxxxxxxxxxx> wrote: > > Hi folks, > > I thought you ought to know about $SUBJECT. Maybe it's some > weird kernel glitch, but if it is reflecting real I/O errors, > maybe that machine is about to have a disk failure. Poking > into its SMART logs (if any) might be useful. > > I got interested in this because a Postgres buildfarm instance > that runs periodically on that machine reported a couple of > unexplainable failures in the last few weeks [1]. I was able > to reproduce the failure and determine that it's a fault in > the logic that ought to report a failure from fdatasync(2). > Looking in the core file shows that errno = 5 (EIO) is what > was reported. So we (PG) have some things to fix, but meanwhile > I felt you'd better know about the possibility of a hardware > issue. > > regards, tom lane > > [1] https://www.postgresql.org/message-id/CA+hUKGLhc0Nwnn9u60oYrx4MAUga+qEvj+4pBqPwrmPKDNtFmA@xxxxxxxxxxxxxx Tom, just checked /home filesystem, there's no errors... And yes, there's sometimes (sporadically) I see kernel messages like those in logs: May 07 03:26:45 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11 May 07 03:26:45 gcc202 kernel: blk_update_request: I/O error, dev vdiskc, sector 159273120 op 0x1:(WRITE) flags 0x4800 phys_seg 17 prio class 0 May 07 03:31:39 gcc202 kernel: dm-0: writeback error on inode 2148294407, offset 0, sector 159239256 May 07 03:31:39 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11 May 07 03:31:39 gcc202 kernel: blk_update_request: I/O error, dev vdiskc, sector 157618896 op 0x1:(WRITE) flags 0x4800 phys_seg 16 prio class 0 May 07 03:35:06 gcc202 kernel: dm-0: writeback error on inode 155142134, offset 0, sector 157584576 May 07 03:35:06 gcc202 kernel: sunvdc: vdc_tx_trigger() failure, err=-11 May 07 03:35:06 gcc202 kernel: blk_update_request: I/O error, dev vdiskc, sector 657284672 op 0x1:(WRITE) flags 0x1000 phys_seg 4 prio class 0 May 07 03:35:06 gcc202 kernel: XFS (dm-0): metadata I/O error in "xfs_buf_ioend+0x2cc/0x640 [xfs]" at daddr 0x272d5640 len 32 error 5 I can't find a reproducer for it to start debugging the issue, but going to run xfstests [1] on my sparc64 test LDOM to see if it would catch something... By the way, there are no SMART disk diagnostics available, since the machine is a LDOM (read virtual machine) and backend storage is zfs volume (OS is solaris 11 sparc), which is living on older Hitachi AMS2000 (over FC). Thanks for your report anyway. 1. https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git