We did another simulation yesterday with the benchmark running.
When we detached the drive when the benchmark was running ceph noticed that straight away and marked osd.6 as down. On 19 February 2014 15:36, Wido den Hollander <wido@xxxxxxxx> wrote:
On 02/19/2014 02:22 PM, Thorvald Hallvardsson wrote:
Eventually after 1 hour it spotted that. I took the disk out at 11:06:02<http://172.17.12.15:6800/1569> failed (3 reports from 3 peers after
so literally 1 hour later:
6 0.9 osd.6 down 0
7 0.9 osd.7 up 1
8 0.9 osd.8 up 1
2014-02-19 12:06:02.802388 mon.0 [INF] osd.6 172.17.12.15:6800/1569
22.338687 >= grace 20.000000)
but 1 hour is a bit ... too long isn't it ?
The OSD will commit suicide if it encounters to much I/O errors, but it's not clear what exactly happened in this case.
I suggest you take a look at the logs of osd.6 to see why it stopped working.
Wido
<http://172.17.12.11:6789/0,ceph-test-mon-02=172.17.12.12:6789/0,ceph-test-mon-03=172.17.12.13:6789/0>},<mailto:thorvald.hallvardsson@gmail.com>> wrote:
Hi guys,
Quick question. I have a VM with some SCSI drives which act as the
OSDs in my test lab. I have removed the SCSI drive so it's totally
gone from the system, syslog is dropping I/O errors but the cluster
still looks healthy.
Can you tell me why ? I'm trying to reproduce the problem if the
real drive would have failed.
# ll /dev/sd*
brw-rw---- 1 root disk 8, 0 Feb 19 11:13 /dev/sda
brw-rw---- 1 root disk 8, 1 Feb 17 16:45 /dev/sda1
brw-rw---- 1 root disk 8, 2 Feb 17 16:45 /dev/sda2
brw-rw---- 1 root disk 8, 5 Feb 17 16:45 /dev/sda5
brw-rw---- 1 root disk 8, 32 Feb 19 11:13 /dev/sdc
brw-rw---- 1 root disk 8, 33 Feb 17 16:45 /dev/sdc1
brw-rw---- 1 root disk 8, 34 Feb 19 11:11 /dev/sdc2
brw-rw---- 1 root disk 8, 48 Feb 19 11:13 /dev/sdd
brw-rw---- 1 root disk 8, 49 Feb 17 16:45 /dev/sdd1
brw-rw---- 1 root disk 8, 50 Feb 19 11:05 /dev/sdd2
Feb 19 11:06:02 ceph-test-vosd-03 kernel: [586497.813485] sd
2:0:1:0: [sdb] Synchronizing SCSI cache
Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197668] XFS
(sdb1): metadata I/O error: block 0x39e116d3 ("xlog_iodone") error
19 numblks 64
Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197815] XFS
(sdb1): xfs_do_force_shutdown(0x2) called from line 1115 of file
/build/buildd/linux-lts-saucy-3.11.0/fs/xfs/xfs_log.c. Return
address = 0xffffffffa01e1fe1
Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197823] XFS
(sdb1): Log I/O Error Detected. Shutting down filesystem
Feb 19 11:06:13 ceph-test-vosd-03 kernel: [586508.197880] XFS
(sdb1): Please umount the filesystem and rectify the problem(s)
Feb 19 11:06:43 ceph-test-vosd-03 kernel: [586538.306817] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:07:13 ceph-test-vosd-03 kernel: [586568.415986] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:07:43 ceph-test-vosd-03 kernel: [586598.525178] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:08:13 ceph-test-vosd-03 kernel: [586628.634356] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:08:43 ceph-test-vosd-03 kernel: [586658.743533] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:09:13 ceph-test-vosd-03 kernel: [586688.852714] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:09:43 ceph-test-vosd-03 kernel: [586718.961903] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:10:13 ceph-test-vosd-03 kernel: [586749.071076] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:10:43 ceph-test-vosd-03 kernel: [586779.180263] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:11:13 ceph-test-vosd-03 kernel: [586809.289440] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:11:44 ceph-test-vosd-03 kernel: [586839.398626] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:12:14 ceph-test-vosd-03 kernel: [586869.507804] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:12:44 ceph-test-vosd-03 kernel: [586899.616988] XFS
(sdb1): xfs_log_force: error 5 returned.
Feb 19 11:12:52 ceph-test-vosd-03 kernel: [586907.848993]
end_request: I/O error, dev fd0, sector 0
mount:
/dev/sdb1 on /var/lib/ceph/osd/ceph-6 type xfs (rw,noatime)
/dev/sdc1 on /var/lib/ceph/osd/ceph-7 type xfs (rw,noatime)
/dev/sdd1 on /var/lib/ceph/osd/ceph-8 type xfs (rw,noatime)
ll /var/lib/ceph/osd/ceph-6
ls: cannot access /var/lib/ceph/osd/ceph-6: Input/output error
-4 2.7 host ceph-test-vosd-03
6 0.9 osd.6 up 1
7 0.9 osd.7 up 1
8 0.9 osd.8 up 1
# ceph-disk list
/dev/fd0 other, unknown
/dev/sda :
/dev/sda1 other, ext2
/dev/sda2 other
/dev/sda5 other, LVM2_member
/dev/sdc :
/dev/sdc1 ceph data, active, cluster ceph, osd.7, journal /dev/sdc2
/dev/sdc2 ceph journal, for /dev/sdc1
/dev/sdd :
/dev/sdd1 ceph data, active, cluster ceph, osd.8, journal /dev/sdd2
/dev/sdd2 ceph journal, for /dev/sdd1
cluster 1a588c94-6f5e-4b04-bc07-f5ce99b91a35
health HEALTH_OK
monmap e7: 3 mons at
{ceph-test-mon-01=172.17.12.11:6789/0,ceph-test-mon-02=172.17.12.12:6789/0,ceph-test-mon-03=172.17.12.13:6789/0
_______________________________________________
election epoch 50, quorum 0,1,2
ceph-test-mon-01,ceph-test-mon-02,ceph-test-mon-03
mdsmap e4: 1/1/1 up {0=ceph-test-admin=up:active}
osdmap e124: 9 osds: 9 up, 9 in
pgmap v1812: 256 pgs, 13 pools, 1522 MB data, 469 objects
3379 MB used, 8326 GB / 8329 GB avail
256 active+clean
So as you can see osd.6 is missing but the cluster is happy.
Thank you.
Regards.
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Wido den Hollander
42on B.V.
Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com