problem with hanging cluster

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
our test cluster going stuck every time when one of our osd host going down, when mising osd go to "up" state and recovery go to 100% cluster still not working propertly.

When ceph crash there are some working jobs from other host which only mounted by rbd and CephFS kernel driver. Each 5 clients do similar job in loop in. ex. dd if=/dev/zero of=/mnt/ceph/$filename bs=1M count=$random; dd if=/mnt/ceph/$filename of=/dev/null bs=512k; rm -f /mnt/ceph/$filename

When cluster is fresh after new deploy, this test working propertly but when wewill fail one of our osd some times, cluster not responding all dd process going to state D.

We have cluster build with 3 nodes, with journal on ssd disk:
system debian 7.0 3.2.0-3-amd64

#ceph osd tree

# id    weight    type name    up/down    reweight
-1    6    root default
-3    6        rack unknownrack
-2    2            host uranos
0    1                osd.0    up    1
1    1                osd.1    up    1
-4    3            host node04
401    1                osd.401    up    1
402    1                osd.402    up    1
403    1                osd.403    up    1
-5    1            host node03
2    1                osd.2    up    1

#mount
/dev/sdb1 /var/lib/ceph/osd/ceph-401 ext4 rw,sync,noatime,user_xattr,barrier=0,data=writeback 0 0 /dev/sdc1 /var/lib/ceph/osd/ceph-402 ext4 rw,sync,noatime,user_xattr,barrier=0,data=writeback 0 0 /dev/sde1 /var/lib/ceph/osd/ceph-403 ext4 rw,sync,noatime,user_xattr,barrier=0,data=writeback 0 0 /dev/sdd1 /var/lib/ceph/journal ext4 rw,sync,noatime,user_xattr,barrier=0,data=writeback 0 0 /dev/sdd2 /var/lib/ceph/mon ext4 rw,sync,noatime,user_xattr,barrier=0,data=writeback 0 0


#ceph -s
health HEALTH_WARN 111 pgs peering; 111 pgs stuck inactive; 45 pgs stuck unclean monmap e1: 1 mons at {alfa=10.32.20.46:6789/0}, election epoch 1, quorum 0 alfa
   osdmap e853: 6 osds: 6 up, 6 in
pgmap v10870: 1152 pgs: 871 active+clean, 111 peering, 170 active+clean+scrubbing; 186 GB data, 478 GB used, 6500 GB / 6979 GB avail
   mdsmap e9: 1/1/1 up {0=alfa=up:active}

### /var/log/ceph.log
http://pastebin.com/z6prrnS4

### client kernel driver CephFS
10.32.20.46:6789:/ on /mnt/ceph type ceph (rw,relatime,name=admin,secret=<hidden>)

strace ls -al /mnt/ceph
http://pastebin.com/hqwa3sDt

### ceph.conf
[global]

    auth supported = cephx

[osd]
    osd journal size = 1000
    filestore xattr use omap = true
    osd journal = /var/lib/ceph/journal/osd.$id/journal

[mon.alfa]
    host = node04
    mon addr = 10.32.20.46:6789

[osd.401]
    host = node04

[osd.402]
    host = node04

[osd.403]
    host = node04

[osd.2]
    host = node03

[osd.0]
    host = uranos

[osd.1]
    host = uranos

[mds.alfa]
    host = node04

Any suggest ? Thanks!
--
Best,
blink

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux