SV: [linux-cluster] multipath issue... Smells of hardware issue.

"Kristoffer Lippert" <kristoffer.lippert@xxxxxxxx> · Fri, 6 Jul 2007 09:51:24 +0200

Hi,

Thank you very  much for the explaination.

The hardware should under no circumstances take 5 minutes to perform a readsector. Not even when the command queue is very long.
I've tried copying files to and from the SAN, and i've tried a little program called sys_basher working the disks continously since last Friday. (almost a week) and i have not been able to reproduce the error. Before i could produce it within an hour by copying files.
I've only seen the error on one server, and i've changed nothing. (well, obvouisly something must have changed since the error seems to be gone.) 

I get a throughput of about 120mb/sec on the san using GFS1. It's fast enough for my use (wich is large files for a website). Is it far below expected throughput? 

Kind regards
Kristoffer

-----Oprindelig meddelelse-----
Fra: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] På vegne af Benjamin Marzinski
Sendt: 5. juli 2007 22:01
Til: linux clustering
Emne: Re: [linux-cluster] multipath issue... Smells of hardware issue.

On Fri, Jun 29, 2007 at 05:23:20PM +0200, Kristoffer Lippert wrote:
>    Hi,
> 
>    I have a setup with two identical RX200s3 FuSi servers talking to a SAN
>    (SX60 + extra controller), and that works fine with gfs1.
> 
>    I do however see some errors on one of the servers. It's in my message log
>    and only now and then now and then (though always under load, but i cant
>    load it and thereby force it to give the error).
> 
>    The error says:
>    Jun 28 15:44:17 app02 multipathd: 8:16: mark as failed
>    Jun 28 15:44:17 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 28 15:44:17 app02 kernel: sd 2:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 28 15:44:17 app02 kernel: end_request: I/O error, dev sdb, sector
>    705160231
>    Jun 28 15:44:17 app02 kernel: device-mapper: multipath: Failing path 8:16.
>    Jun 28 15:44:22 app02 multipathd: sdb: readsector0 checker reports path is
>    up
>    Jun 28 15:44:22 app02 multipathd: 8:16: reinstated
>    Jun 28 15:44:22 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
>    Jun 28 15:46:02 app02 multipathd: 8:32: mark as failed
>    Jun 28 15:46:02 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 28 15:46:02 app02 kernel: sd 3:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 28 15:46:02 app02 kernel: end_request: I/O error, dev sdc, sector
>    739870727
>    Jun 28 15:46:02 app02 kernel: device-mapper: multipath: Failing path 8:32.
>    Jun 28 15:46:06 app02 multipathd: sdc: readsector0 checker reports path is
>    up
>    Jun 28 15:46:06 app02 multipathd: 8:32: reinstated
>    Jun 28 15:46:06 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
> 
>    To me i looks like a fiber that bounces up and down. (There is no switch
>    involved).
> 
>    Sometimes i only get a slightly shorter version:
>    Jun 29 09:04:32 app02 kernel: sd 2:0:0:0: SCSI error: return code =
>    0x00070000
>    Jun 29 09:04:32 app02 kernel: end_request: I/O error, dev sdb, sector
>    2782490295
>    Jun 29 09:04:32 app02 kernel: device-mapper: multipath: Failing path 8:16.
>    Jun 29 09:04:32 app02 multipathd: 8:16: mark as failed
>    Jun 29 09:04:32 app02 multipathd: main_disk_volume1: remaining active
>    paths: 1
>    Jun 29 09:04:37 app02 multipathd: sdb: readsector0 checker reports path is
>    up
>    Jun 29 09:04:37 app02 multipathd: 8:16: reinstated
>    Jun 29 09:04:37 app02 multipathd: main_disk_volume1: remaining active
>    paths: 2
> 
>    Any sugestions, but start swapping hardware?

It's possible that your scsi device is timing out the scsi read command from the readsector0 path checker, which is what it appears that your setup is using to check the path status.  This checker has it's timeout set to 5 minutes, but I suppose that it is possible to take this long if your hardware is a flaky. If you're willing to recompile the code, you can change this default by changing DEF_TIMEOUT in libcheckers/checkers.h. DEF_TIMEOUT is the scsi command timeout in milliseconds.

Otherwise, if you are only seeing this on one server, swapping hardware seems like a reasonable thing to try.

-Ben

>    Mvh / Kind regards
> 
>    Kristoffer Lippert
>    Systemansvarlig
>    JP/Politiken A/S
>    Online Magasiner
> 
>    Tlf. +45 8738 3032
>    Cell. +45 6062 8703

> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster