Re: Disk failed - simulation - but still healthy

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



We did another simulation yesterday with the benchmark running.

When we detached the drive when the benchmark was running ceph noticed that straight away and marked osd.6 as down.

So in the first test when we had no IO something must have hit it after 1 hour what spotted osd.6 doesn't exist anymore or rather a hard drive behind it.

Regards.


On 19 February 2014 15:36, Wido den Hollander <wido@xxxxxxxx> wrote:
On 02/19/2014 02:22 PM, Thorvald Hallvardsson wrote:
Eventually after 1 hour it spotted that. I took the disk out at 11:06:02
so literally 1 hour later:
6       0.9                     osd.6   down    0
7       0.9                     osd.7   up      1
8       0.9                     osd.8   up      1

2014-02-19 12:06:02.802388 mon.0 [INF] osd.6 172.17.12.15:6800/1569
<http://172.17.12.15:6800/1569> failed (3 reports from 3 peers after

22.338687 >= grace 20.000000)

but 1 hour is a bit ... too long isn't it ?


The OSD will commit suicide if it encounters to much I/O errors, but it's not clear what exactly happened in this case.

I suggest you take a look at the logs of osd.6 to see why it stopped working.

Wido




On 19 February 2014 11:31, Thorvald Hallvardsson
<thorvald.hallvardsson@gmail.com
<mailto:thorvald.hallvardsson@gmail.com>> wrote:

    Hi guys,

    Quick question. I have a VM with some SCSI drives which act as the
    OSDs in my test lab. I have removed the SCSI drive so it's totally
    gone from the system, syslog is dropping I/O errors but the cluster
    still looks healthy.

    Can you tell me why ? I'm trying to reproduce the problem if the
    real drive would have failed.

    # ll /dev/sd*
    brw-rw---- 1 root disk 8,  0 Feb 19 11:13 /dev/sda
    brw-rw---- 1 root disk 8,  1 Feb 17 16:45 /dev/sda1
    brw-rw---- 1 root disk 8,  2 Feb 17 16:45 /dev/sda2
    brw-rw---- 1 root disk 8,  5 Feb 17 16:45 /dev/sda5
    brw-rw---- 1 root disk 8, 32 Feb 19 11:13 /dev/sdc
    brw-rw---- 1 root disk 8, 33 Feb 17 16:45 /dev/sdc1
    brw-rw---- 1 root disk 8, 34 Feb 19 11:11 /dev/sdc2
    brw-rw---- 1 root disk 8, 48 Feb 19 11:13 /dev/sdd
    brw-rw---- 1 root disk 8, 49 Feb 17 16:45 /dev/sdd1
    brw-rw---- 1 root disk 8, 50 Feb 19 11:05 /dev/sdd2


    Feb 19 11:06:02 ceph-test-vosd-03 kernel: [586497.813485] sd
    2:0:1:0: [sdb] Synchronizing SCSI cache
    Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197668] XFS
    (sdb1): metadata I/O error: block 0x39e116d3 ("xlog_iodone") error
    19 numblks 64
    Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197815] XFS
    (sdb1): xfs_do_force_shutdown(0x2) called from line 1115 of file
    /build/buildd/linux-lts-saucy-3.11.0/fs/xfs/xfs_log.c.  Return
    address = 0xffffffffa01e1fe1
    Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197823] XFS
    (sdb1): Log I/O Error Detected.  Shutting down filesystem
    Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197880] XFS
    (sdb1): Please umount the filesystem and rectify the problem(s)
    Feb 19 11:06:43 ceph-test-vosd-03 kernel: [586538.306817] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:07:13 ceph-test-vosd-03 kernel: [586568.415986] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:07:43 ceph-test-vosd-03 kernel: [586598.525178] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:08:13 ceph-test-vosd-03 kernel: [586628.634356] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:08:43 ceph-test-vosd-03 kernel: [586658.743533] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:09:13 ceph-test-vosd-03 kernel: [586688.852714] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:09:43 ceph-test-vosd-03 kernel: [586718.961903] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:10:13 ceph-test-vosd-03 kernel: [586749.071076] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:10:43 ceph-test-vosd-03 kernel: [586779.180263] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:11:13 ceph-test-vosd-03 kernel: [586809.289440] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:11:44 ceph-test-vosd-03 kernel: [586839.398626] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:12:14 ceph-test-vosd-03 kernel: [586869.507804] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:12:44 ceph-test-vosd-03 kernel: [586899.616988] XFS
    (sdb1): xfs_log_force: error 5 returned.
    Feb 19 11:12:52 ceph-test-vosd-03 kernel: [586907.848993]
    end_request: I/O error, dev fd0, sector 0

    mount:
    /dev/sdb1 on /var/lib/ceph/osd/ceph-6 type xfs (rw,noatime)
    /dev/sdc1 on /var/lib/ceph/osd/ceph-7 type xfs (rw,noatime)
    /dev/sdd1 on /var/lib/ceph/osd/ceph-8 type xfs (rw,noatime)

    ll /var/lib/ceph/osd/ceph-6
    ls: cannot access /var/lib/ceph/osd/ceph-6: Input/output error

    -4      2.7             host ceph-test-vosd-03
    6       0.9                     osd.6   up      1
    7       0.9                     osd.7   up      1
    8       0.9                     osd.8   up      1

    # ceph-disk list
    /dev/fd0 other, unknown
    /dev/sda :
      /dev/sda1 other, ext2
      /dev/sda2 other
      /dev/sda5 other, LVM2_member
    /dev/sdc :
      /dev/sdc1 ceph data, active, cluster ceph, osd.7, journal /dev/sdc2
      /dev/sdc2 ceph journal, for /dev/sdc1
    /dev/sdd :
      /dev/sdd1 ceph data, active, cluster ceph, osd.8, journal /dev/sdd2
      /dev/sdd2 ceph journal, for /dev/sdd1

         cluster 1a588c94-6f5e-4b04-bc07-f5ce99b91a35
          health HEALTH_OK
          monmap e7: 3 mons at
    {ceph-test-mon-01=172.17.12.11:6789/0,ceph-test-mon-02=172.17.12.12:6789/0,ceph-test-mon-03=172.17.12.13:6789/0
    <http://172.17.12.11:6789/0,ceph-test-mon-02=172.17.12.12:6789/0,ceph-test-mon-03=172.17.12.13:6789/0>},

    election epoch 50, quorum 0,1,2
    ceph-test-mon-01,ceph-test-mon-02,ceph-test-mon-03
          mdsmap e4: 1/1/1 up {0=ceph-test-admin=up:active}
          osdmap e124: 9 osds: 9 up, 9 in
           pgmap v1812: 256 pgs, 13 pools, 1522 MB data, 469 objects
                 3379 MB used, 8326 GB / 8329 GB avail
                      256 active+clean

    So as you can see osd.6 is missing but the cluster is happy.

    Thank you.

    Regards.




_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux