Hi all, I'm in the process of setting up a storage cluster for production use. At the moment I have it in development and am testing the robustness of the cluster. One key thing I'm conscious of is single points of failure. Thus, I'm testing the cluster by simulating node outages (hard powering-off nodes). The set up at present is 3 identical nodes: - Ubuntu 12.04 LTS AMD64 - Intel Core i3 3570T CPUs - 8GB RAM - Dual Gigabit Ethernet (one interface public, one cluster) - 60GB Intel 520S SSD - 2× Seagate SV35 3TB HDD for OSDs Partition structure is as follows: > root@bnedevsn1:~# parted /dev/sda print > Model: ATA ST3000VX000-1CU1 (scsi) > Disk /dev/sda: 3001GB > Sector size (logical/physical): 512B/4096B > Partition Table: gpt > > Number Start End Size File system Name Flags > 1 1049kB 3001GB 3001GB xfs ceph data > > root@bnedevsn1:~# parted /dev/sdb print > Model: ATA ST3000VX000-1CU1 (scsi) > Disk /dev/sdb: 3001GB > Sector size (logical/physical): 512B/4096B > Partition Table: gpt > > Number Start End Size File system Name Flags > 1 1049kB 3001GB 3001GB xfs ceph data > > root@bnedevsn1:~# parted /dev/sdc print > Model: ATA INTEL SSDSC2CW06 (scsi) > Disk /dev/sdc: 60.0GB > Sector size (logical/physical): 512B/512B > Partition Table: msdos > > Number Start End Size Type File system Flags > 1 1049kB 2264MB 2263MB primary xfs boot > 2 2265MB 60.0GB 57.8GB extended > 5 2265MB 22.7GB 20.5GB logical > 6 22.7GB 43.2GB 20.5GB logical > 7 43.2GB 60.0GB 16.8GB logical linux-swap(v1) sdc[56] are journals for the OSDs. sdc1 is /. Each node runs two OSD daemons and a monitor daemon. I'm not sure whether this is safe or not; I see the QuickStart guide seems to suggest such a set-up for testing, so I'm guessing it's safe. My configuration looks like this: > root@sn1:~# cat /etc/ceph/ceph.conf > [global] > # Use cephx authentication > auth cluster required = cephx > auth service required = cephx > auth client required = cephx > > # Our filesystem UUID > fsid = [...] > > public network = 10.20.30.0/24 > cluster network = 10.20.40.0/24 > > # By default, keep at least 3 replicas of all pools > osd pool default size = 3 > > # and use 200 placement groups (6 OSDs * 100 / 3 replicas) > osd pool default pg num = 200 > osd pool default pgp num = 200 > # Monitors > [mon.0] > # debug mon = 20 > # debug paxos = 20 > # debug auth = 20 > > host = sn0 > mon addr = 10.20.30.224:6789 > mon data = /var/lib/ceph/mon/ceph-0 > public addr = 10.20.30.224 > cluster addr = 10.20.40.64 > > [mon.1] > # debug mon = 20 > # debug paxos = 20 > # debug auth = 20 > > host = sn1 > mon addr = 10.20.30.225:6789 > mon data = /var/lib/ceph/mon/ceph-1 > public addr = 10.20.30.225 > cluster addr = 10.20.40.65 > > [mon.2] > # debug mon = 20 > # debug paxos = 20 > # debug auth = 20 > > host = sn2 > mon addr = 10.20.30.226:6789 > mon data = /var/lib/ceph/mon/ceph-2 > public addr = 10.20.30.226 > cluster addr = 10.20.40.66 > > # Metadata servers > > [osd.0] > host = sn0 > > osd journal size = 20000 > > osd mkfs type = xfs > devs = /dev/sda1 > osd journal = /dev/sdc5 > osd addr = 10.20.30.224:6789 > osd data = /var/lib/ceph/osd/ceph-0 > public addr = 10.20.30.224 > cluster addr = 10.20.40.64 > > [osd.1] > host = sn0 > > osd journal size = 20000 > > osd mkfs type = xfs > devs = /dev/sdb1 > osd journal = /dev/sdc6 > osd addr = 10.20.30.224:6789 > osd data = /var/lib/ceph/osd/ceph-1 > public addr = 10.20.30.224 > cluster addr = 10.20.40.64 > > [osd.2] > host = sn1 > > osd journal size = 20000 > > osd mkfs type = xfs > devs = /dev/sda1 > osd journal = /dev/sdc5 > osd addr = 10.20.30.225:6789 > osd data = /var/lib/ceph/osd/ceph-2 > public addr = 10.20.30.225 > cluster addr = 10.20.40.65 > > [osd.3] > host = sn1 > > osd journal size = 20000 > > osd mkfs type = xfs > devs = /dev/sdb1 > osd journal = /dev/sdc6 > osd addr = 10.20.30.225:6789 > osd data = /var/lib/ceph/osd/ceph-3 > public addr = 10.20.30.225 > cluster addr = 10.20.40.65 > > [osd.4] > host = sn2 > > osd journal size = 20000 > > osd mkfs type = xfs > devs = /dev/sda1 > osd journal = /dev/sdc5 > osd addr = 10.20.30.226:6789 > osd data = /var/lib/ceph/osd/ceph-4 > public addr = 10.20.30.226 > cluster addr = 10.20.40.66 > > [osd.5] > host = sn2 > > osd journal size = 20000 > > osd mkfs type = xfs > devs = /dev/sdb1 > osd journal = /dev/sdc6 > osd addr = 10.20.30.226:6789 > osd data = /var/lib/ceph/osd/ceph-5 > public addr = 10.20.30.226 > cluster addr = 10.20.40.66 My test client has a single gigabit link to the cluster, and is running a similar configuration (but a 240GB SSD and no HDDs) with libvirt and KVM. The VM has a RBD volume attached to it via virtio, and runs Windows Server 2008 R2. The cluster was fully up when I started the VM. I observed that when I downed one of the storage nodes, the I/O to the disk seemed to freeze up. I waited for the RBD client to time out the TCP connection to the downed node, but that didn't seem to happen. I could see in netstat that the connections were still "up" after a few minutes. If I then brought that node back up, I'd see a recovery take place, the client would re-connect, everything would start working again. The latter fact is good, it means the data is safe. I figure if I started another Ceph client up at this point, it'd connect to the two surviving nodes (and time out on the downed one) but would still otherwise work -- it's just a problem if a client is connected to a node and the node goes down; detecting that the node is gone. Is there a configuration parameter (in libvirt or ceph itself) that tells a client how long to wait before considering an OSD/monitor node down and moving on? Regards, -- Stuart Longland Systems Engineer _ ___ \ /|_) | T: +61 7 3535 9619 \/ | \ | 38b Douglas Street F: +61 7 3535 9699 SYSTEMS Milton QLD 4064 http://www.vrt.com.au _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com