On Thu, Jan 23, 2014 at 5:24 PM, Stuart Longland <stuartl@xxxxxxxxxx> wrote: > Hi all, > > I'm in the process of setting up a storage cluster for production use. > At the moment I have it in development and am testing the robustness of > the cluster. One key thing I'm conscious of is single points of > failure. Thus, I'm testing the cluster by simulating node outages (hard > powering-off nodes). > > The set up at present is 3 identical nodes: > - Ubuntu 12.04 LTS AMD64 > - Intel Core i3 3570T CPUs > - 8GB RAM > - Dual Gigabit Ethernet (one interface public, one cluster) > - 60GB Intel 520S SSD > - 2× Seagate SV35 3TB HDD for OSDs > > Partition structure is as follows: > >> root@bnedevsn1:~# parted /dev/sda print >> Model: ATA ST3000VX000-1CU1 (scsi) >> Disk /dev/sda: 3001GB >> Sector size (logical/physical): 512B/4096B >> Partition Table: gpt >> >> Number Start End Size File system Name Flags >> 1 1049kB 3001GB 3001GB xfs ceph data >> >> root@bnedevsn1:~# parted /dev/sdb print >> Model: ATA ST3000VX000-1CU1 (scsi) >> Disk /dev/sdb: 3001GB >> Sector size (logical/physical): 512B/4096B >> Partition Table: gpt >> >> Number Start End Size File system Name Flags >> 1 1049kB 3001GB 3001GB xfs ceph data >> >> root@bnedevsn1:~# parted /dev/sdc print >> Model: ATA INTEL SSDSC2CW06 (scsi) >> Disk /dev/sdc: 60.0GB >> Sector size (logical/physical): 512B/512B >> Partition Table: msdos >> >> Number Start End Size Type File system Flags >> 1 1049kB 2264MB 2263MB primary xfs boot >> 2 2265MB 60.0GB 57.8GB extended >> 5 2265MB 22.7GB 20.5GB logical >> 6 22.7GB 43.2GB 20.5GB logical >> 7 43.2GB 60.0GB 16.8GB logical linux-swap(v1) > > sdc[56] are journals for the OSDs. sdc1 is /. > > Each node runs two OSD daemons and a monitor daemon. I'm not sure > whether this is safe or not; I see the QuickStart guide seems to suggest > such a set-up for testing, so I'm guessing it's safe. That should be fine for a 3-node cluster, yeah. You should make sure that your CRUSH rules are distributing data across hosts, but I believe it does that by default now. > My configuration looks like this: >> root@sn1:~# cat /etc/ceph/ceph.conf >> [global] >> # Use cephx authentication >> auth cluster required = cephx >> auth service required = cephx >> auth client required = cephx >> >> # Our filesystem UUID >> fsid = [...] >> >> public network = 10.20.30.0/24 >> cluster network = 10.20.40.0/24 >> >> # By default, keep at least 3 replicas of all pools >> osd pool default size = 3 >> >> # and use 200 placement groups (6 OSDs * 100 / 3 replicas) >> osd pool default pg num = 200 >> osd pool default pgp num = 200 >> # Monitors >> [mon.0] >> # debug mon = 20 >> # debug paxos = 20 >> # debug auth = 20 >> >> host = sn0 >> mon addr = 10.20.30.224:6789 >> mon data = /var/lib/ceph/mon/ceph-0 >> public addr = 10.20.30.224 >> cluster addr = 10.20.40.64 >> >> [mon.1] >> # debug mon = 20 >> # debug paxos = 20 >> # debug auth = 20 >> >> host = sn1 >> mon addr = 10.20.30.225:6789 >> mon data = /var/lib/ceph/mon/ceph-1 >> public addr = 10.20.30.225 >> cluster addr = 10.20.40.65 >> >> [mon.2] >> # debug mon = 20 >> # debug paxos = 20 >> # debug auth = 20 >> >> host = sn2 >> mon addr = 10.20.30.226:6789 >> mon data = /var/lib/ceph/mon/ceph-2 >> public addr = 10.20.30.226 >> cluster addr = 10.20.40.66 >> >> # Metadata servers >> >> [osd.0] >> host = sn0 >> >> osd journal size = 20000 >> >> osd mkfs type = xfs >> devs = /dev/sda1 >> osd journal = /dev/sdc5 >> osd addr = 10.20.30.224:6789 >> osd data = /var/lib/ceph/osd/ceph-0 >> public addr = 10.20.30.224 >> cluster addr = 10.20.40.64 >> >> [osd.1] >> host = sn0 >> >> osd journal size = 20000 >> >> osd mkfs type = xfs >> devs = /dev/sdb1 >> osd journal = /dev/sdc6 >> osd addr = 10.20.30.224:6789 >> osd data = /var/lib/ceph/osd/ceph-1 >> public addr = 10.20.30.224 >> cluster addr = 10.20.40.64 >> >> [osd.2] >> host = sn1 >> >> osd journal size = 20000 >> >> osd mkfs type = xfs >> devs = /dev/sda1 >> osd journal = /dev/sdc5 >> osd addr = 10.20.30.225:6789 >> osd data = /var/lib/ceph/osd/ceph-2 >> public addr = 10.20.30.225 >> cluster addr = 10.20.40.65 >> >> [osd.3] >> host = sn1 >> >> osd journal size = 20000 >> >> osd mkfs type = xfs >> devs = /dev/sdb1 >> osd journal = /dev/sdc6 >> osd addr = 10.20.30.225:6789 >> osd data = /var/lib/ceph/osd/ceph-3 >> public addr = 10.20.30.225 >> cluster addr = 10.20.40.65 >> >> [osd.4] >> host = sn2 >> >> osd journal size = 20000 >> >> osd mkfs type = xfs >> devs = /dev/sda1 >> osd journal = /dev/sdc5 >> osd addr = 10.20.30.226:6789 >> osd data = /var/lib/ceph/osd/ceph-4 >> public addr = 10.20.30.226 >> cluster addr = 10.20.40.66 >> >> [osd.5] >> host = sn2 >> >> osd journal size = 20000 >> >> osd mkfs type = xfs >> devs = /dev/sdb1 >> osd journal = /dev/sdc6 >> osd addr = 10.20.30.226:6789 >> osd data = /var/lib/ceph/osd/ceph-5 >> public addr = 10.20.30.226 >> cluster addr = 10.20.40.66 > > My test client has a single gigabit link to the cluster, and is running > a similar configuration (but a 240GB SSD and no HDDs) with libvirt and > KVM. The VM has a RBD volume attached to it via virtio, and runs > Windows Server 2008 R2. > > The cluster was fully up when I started the VM. I observed that when I > downed one of the storage nodes, the I/O to the disk seemed to freeze > up. I waited for the RBD client to time out the TCP connection to the > downed node, but that didn't seem to happen. > > I could see in netstat that the connections were still "up" after a few > minutes. How long did you wait? By default the node will get timed out after ~30 seconds, be marked down, and then the remaining OSDs will take over all activity for it. The client's connections might not close for several minutes more, but that on its own isn't a problem. Did the cluster actually detect the node as down? (You could check this by looking at the ceph -w output or similar when running the test.) If it was detected as down and the VM continued to block (modulo maybe a little time for the client to decide its monitor was down; I forget what the timeouts are there), that would be odd. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com