Re: Disk failed - simulation - but still healthy

Thorvald Hallvardsson <thorvald.hallvardsson@xxxxxxxxx> · Thu, 20 Feb 2014 09:15:17 +0000

We did another simulation yesterday with the benchmark running. 

When we detached the drive when the benchmark was running ceph noticed that straight away and marked osd.6 as down. 

So in the first test when we had no IO something must have hit it after 1 hour what spotted osd.6 doesn't exist anymore or rather a hard drive behind it.

Regards.

On 19 February 2014 15:36, Wido den Hollander <wido@xxxxxxxx> wrote:

On 02/19/2014 02:22 PM, Thorvald Hallvardsson wrote:

Eventually after 1 hour it spotted that. I took the disk out at 11:06:02

so literally 1 hour later:

6       0.9                     osd.6   down    0

7       0.9                     osd.7   up      1

8       0.9                     osd.8   up      1

2014-02-19 12:06:02.802388 mon.0 [INF] osd.6 172.17.12.15:6800/1569

<http://172.17.12.15:6800/1569> failed (3 reports from 3 peers after

22.338687 >= grace 20.000000)

but 1 hour is a bit ... too long isn't it ?

The OSD will commit suicide if it encounters to much I/O errors, but it's not clear what exactly happened in this case.

I suggest you take a look at the logs of osd.6 to see why it stopped working.

Wido

On 19 February 2014 11:31, Thorvald Hallvardsson

<thorvald.hallvardsson@gmail.com

<mailto:thorvald.hallvardsson@gmail.com>> wrote:

    Hi guys,

    Quick question. I have a VM with some SCSI drives which act as the

    OSDs in my test lab. I have removed the SCSI drive so it's totally

    gone from the system, syslog is dropping I/O errors but the cluster

    still looks healthy.

    Can you tell me why ? I'm trying to reproduce the problem if the

    real drive would have failed.

    # ll /dev/sd*

    brw-rw---- 1 root disk 8,  0 Feb 19 11:13 /dev/sda

    brw-rw---- 1 root disk 8,  1 Feb 17 16:45 /dev/sda1

    brw-rw---- 1 root disk 8,  2 Feb 17 16:45 /dev/sda2

    brw-rw---- 1 root disk 8,  5 Feb 17 16:45 /dev/sda5

    brw-rw---- 1 root disk 8, 32 Feb 19 11:13 /dev/sdc

    brw-rw---- 1 root disk 8, 33 Feb 17 16:45 /dev/sdc1

    brw-rw---- 1 root disk 8, 34 Feb 19 11:11 /dev/sdc2

    brw-rw---- 1 root disk 8, 48 Feb 19 11:13 /dev/sdd

    brw-rw---- 1 root disk 8, 49 Feb 17 16:45 /dev/sdd1

    brw-rw---- 1 root disk 8, 50 Feb 19 11:05 /dev/sdd2

    Feb 19 11:06:02 ceph-test-vosd-03 kernel: [586497.813485] sd

    2:0:1:0: [sdb] Synchronizing SCSI cache

    Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197668] XFS

    (sdb1): metadata I/O error: block 0x39e116d3 ("xlog_iodone") error

    19 numblks 64

    Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197815] XFS

    (sdb1): xfs_do_force_shutdown(0x2) called from line 1115 of file

    /build/buildd/linux-lts-saucy-3.11.0/fs/xfs/xfs_log.c.  Return

    address = 0xffffffffa01e1fe1

    Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197823] XFS

    (sdb1): Log I/O Error Detected.  Shutting down filesystem

    Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197880] XFS

    (sdb1): Please umount the filesystem and rectify the problem(s)

    Feb 19 11:06:43 ceph-test-vosd-03 kernel: [586538.306817] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:07:13 ceph-test-vosd-03 kernel: [586568.415986] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:07:43 ceph-test-vosd-03 kernel: [586598.525178] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:08:13 ceph-test-vosd-03 kernel: [586628.634356] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:08:43 ceph-test-vosd-03 kernel: [586658.743533] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:09:13 ceph-test-vosd-03 kernel: [586688.852714] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:09:43 ceph-test-vosd-03 kernel: [586718.961903] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:10:13 ceph-test-vosd-03 kernel: [586749.071076] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:10:43 ceph-test-vosd-03 kernel: [586779.180263] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:11:13 ceph-test-vosd-03 kernel: [586809.289440] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:11:44 ceph-test-vosd-03 kernel: [586839.398626] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:12:14 ceph-test-vosd-03 kernel: [586869.507804] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:12:44 ceph-test-vosd-03 kernel: [586899.616988] XFS

    (sdb1): xfs_log_force: error 5 returned.

    Feb 19 11:12:52 ceph-test-vosd-03 kernel: [586907.848993]

    end_request: I/O error, dev fd0, sector 0

    mount:

    /dev/sdb1 on /var/lib/ceph/osd/ceph-6 type xfs (rw,noatime)

    /dev/sdc1 on /var/lib/ceph/osd/ceph-7 type xfs (rw,noatime)

    /dev/sdd1 on /var/lib/ceph/osd/ceph-8 type xfs (rw,noatime)

    ll /var/lib/ceph/osd/ceph-6

    ls: cannot access /var/lib/ceph/osd/ceph-6: Input/output error

    -4      2.7             host ceph-test-vosd-03

    6       0.9                     osd.6   up      1

    7       0.9                     osd.7   up      1

    8       0.9                     osd.8   up      1

    # ceph-disk list

    /dev/fd0 other, unknown

    /dev/sda :

      /dev/sda1 other, ext2

      /dev/sda2 other

      /dev/sda5 other, LVM2_member

    /dev/sdc :

      /dev/sdc1 ceph data, active, cluster ceph, osd.7, journal /dev/sdc2

      /dev/sdc2 ceph journal, for /dev/sdc1

    /dev/sdd :

      /dev/sdd1 ceph data, active, cluster ceph, osd.8, journal /dev/sdd2

      /dev/sdd2 ceph journal, for /dev/sdd1

         cluster 1a588c94-6f5e-4b04-bc07-f5ce99b91a35

          health HEALTH_OK

          monmap e7: 3 mons at

    {ceph-test-mon-01=172.17.12.11:6789/0,ceph-test-mon-02=172.17.12.12:6789/0,ceph-test-mon-03=172.17.12.13:6789/0

    <http://172.17.12.11:6789/0,ceph-test-mon-02=172.17.12.12:6789/0,ceph-test-mon-03=172.17.12.13:6789/0>},

    election epoch 50, quorum 0,1,2

    ceph-test-mon-01,ceph-test-mon-02,ceph-test-mon-03

          mdsmap e4: 1/1/1 up {0=ceph-test-admin=up:active}

          osdmap e124: 9 osds: 9 up, 9 in

           pgmap v1812: 256 pgs, 13 pools, 1522 MB data, 469 objects

                 3379 MB used, 8326 GB / 8329 GB avail

                      256 active+clean

    So as you can see osd.6 is missing but the cluster is happy.

    Thank you.

    Regards.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

Wido den Hollander

42on B.V.

Phone: +31 (0)20 700 9902

Skype: contact42on

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com