OSD/monitor timeouts?

Stuart Longland <stuartl@xxxxxxxxxx> · Fri, 24 Jan 2014 11:24:19 +1000

Hi all,

I'm in the process of setting up a storage cluster for production use.
At the moment I have it in development and am testing the robustness of
the cluster.  One key thing I'm conscious of is single points of
failure.  Thus, I'm testing the cluster by simulating node outages (hard
powering-off nodes).

The set up at present is 3 identical nodes:
- Ubuntu 12.04 LTS AMD64
- Intel Core i3 3570T CPUs
- 8GB RAM
- Dual Gigabit Ethernet (one interface public, one cluster)
- 60GB Intel 520S SSD
- 2× Seagate SV35 3TB HDD for OSDs

Partition structure is as follows:

> root@bnedevsn1:~# parted /dev/sda print
> Model: ATA ST3000VX000-1CU1 (scsi)
> Disk /dev/sda: 3001GB
> Sector size (logical/physical): 512B/4096B
> Partition Table: gpt
> 
> Number  Start   End     Size    File system  Name       Flags
>  1      1049kB  3001GB  3001GB  xfs          ceph data
> 
> root@bnedevsn1:~# parted /dev/sdb print
> Model: ATA ST3000VX000-1CU1 (scsi)
> Disk /dev/sdb: 3001GB
> Sector size (logical/physical): 512B/4096B
> Partition Table: gpt
> 
> Number  Start   End     Size    File system  Name       Flags
>  1      1049kB  3001GB  3001GB  xfs          ceph data
> 
> root@bnedevsn1:~# parted /dev/sdc print
> Model: ATA INTEL SSDSC2CW06 (scsi)
> Disk /dev/sdc: 60.0GB
> Sector size (logical/physical): 512B/512B
> Partition Table: msdos
> 
> Number  Start   End     Size    Type      File system     Flags
>  1      1049kB  2264MB  2263MB  primary   xfs             boot
>  2      2265MB  60.0GB  57.8GB  extended
>  5      2265MB  22.7GB  20.5GB  logical
>  6      22.7GB  43.2GB  20.5GB  logical
>  7      43.2GB  60.0GB  16.8GB  logical   linux-swap(v1)

sdc[56] are journals for the OSDs.  sdc1 is /.

Each node runs two OSD daemons and a monitor daemon.  I'm not sure
whether this is safe or not; I see the QuickStart guide seems to suggest
such a set-up for testing, so I'm guessing it's safe.

My configuration looks like this:
> root@sn1:~# cat /etc/ceph/ceph.conf
> [global]
>     # Use cephx authentication
>     auth cluster required = cephx
>     auth service required = cephx
>     auth client required = cephx
> 
>     # Our filesystem UUID
>     fsid = [...]
> 
>     public network = 10.20.30.0/24
>     cluster network = 10.20.40.0/24
> 
>     # By default, keep at least 3 replicas of all pools
>     osd pool default size = 3
> 
>     # and use 200 placement groups (6 OSDs * 100 / 3 replicas)
>     osd pool default pg num = 200
>     osd pool default pgp num = 200
> # Monitors
> [mon.0]
>     # debug mon = 20
>     # debug paxos = 20
>     # debug auth = 20
> 
>     host = sn0
>     mon addr = 10.20.30.224:6789
>     mon data = /var/lib/ceph/mon/ceph-0
>     public addr = 10.20.30.224
>     cluster addr = 10.20.40.64
> 
> [mon.1]
>     # debug mon = 20
>     # debug paxos = 20
>     # debug auth = 20
> 
>     host = sn1
>     mon addr = 10.20.30.225:6789
>     mon data = /var/lib/ceph/mon/ceph-1
>     public addr = 10.20.30.225
>     cluster addr = 10.20.40.65
> 
> [mon.2]
>     # debug mon = 20
>     # debug paxos = 20
>     # debug auth = 20
> 
>     host = sn2
>     mon addr = 10.20.30.226:6789
>     mon data = /var/lib/ceph/mon/ceph-2
>     public addr = 10.20.30.226
>     cluster addr = 10.20.40.66
> 
> # Metadata servers
> 
> [osd.0]
>     host = sn0
>  
>     osd journal size = 20000
>  
>     osd mkfs type = xfs
>     devs = /dev/sda1
>     osd journal = /dev/sdc5
>     osd addr = 10.20.30.224:6789
>     osd data = /var/lib/ceph/osd/ceph-0
>     public addr = 10.20.30.224
>     cluster addr = 10.20.40.64
> 
> [osd.1]
>     host = sn0
>  
>     osd journal size = 20000
>  
>     osd mkfs type = xfs
>     devs = /dev/sdb1
>     osd journal = /dev/sdc6
>     osd addr = 10.20.30.224:6789
>     osd data = /var/lib/ceph/osd/ceph-1
>     public addr = 10.20.30.224
>     cluster addr = 10.20.40.64
> 
> [osd.2]
>     host = sn1
>  
>     osd journal size = 20000
>  
>     osd mkfs type = xfs
>     devs = /dev/sda1
>     osd journal = /dev/sdc5
>     osd addr = 10.20.30.225:6789
>     osd data = /var/lib/ceph/osd/ceph-2
>     public addr = 10.20.30.225
>     cluster addr = 10.20.40.65
> 
> [osd.3]
>     host = sn1
>  
>     osd journal size = 20000
>  
>     osd mkfs type = xfs
>     devs = /dev/sdb1
>     osd journal = /dev/sdc6
>     osd addr = 10.20.30.225:6789
>     osd data = /var/lib/ceph/osd/ceph-3
>     public addr = 10.20.30.225
>     cluster addr = 10.20.40.65
> 
> [osd.4]
>     host = sn2
>  
>     osd journal size = 20000
>  
>     osd mkfs type = xfs
>     devs = /dev/sda1
>     osd journal = /dev/sdc5
>     osd addr = 10.20.30.226:6789
>     osd data = /var/lib/ceph/osd/ceph-4
>     public addr = 10.20.30.226
>     cluster addr = 10.20.40.66
> 
> [osd.5]
>     host = sn2
>  
>     osd journal size = 20000
>  
>     osd mkfs type = xfs
>     devs = /dev/sdb1
>     osd journal = /dev/sdc6
>     osd addr = 10.20.30.226:6789
>     osd data = /var/lib/ceph/osd/ceph-5
>     public addr = 10.20.30.226
>     cluster addr = 10.20.40.66

My test client has a single gigabit link to the cluster, and is running
a similar configuration (but a 240GB SSD and no HDDs) with libvirt and
KVM.  The VM has a RBD volume attached to it via virtio, and runs
Windows Server 2008 R2.

The cluster was fully up when I started the VM.  I observed that when I
downed one of the storage nodes, the I/O to the disk seemed to freeze
up.  I waited for the RBD client to time out the TCP connection to the
downed node, but that didn't seem to happen.

I could see in netstat that the connections were still "up" after a few
minutes.

If I then brought that node back up, I'd see a recovery take place, the
client would re-connect, everything would start working again.

The latter fact is good, it means the data is safe.  I figure if I
started another Ceph client up at this point, it'd connect to the two
surviving nodes (and time out on the downed one) but would still
otherwise work -- it's just a problem if a client is connected to a node
and the node goes down; detecting that the node is gone.

Is there a configuration parameter (in libvirt or ceph itself) that
tells a client how long to wait before considering an OSD/monitor node
down and moving on?

Regards,
-- 
Stuart Longland
Systems Engineer
     _ ___
\  /|_) |                           T: +61 7 3535 9619
 \/ | \ |     38b Douglas Street    F: +61 7 3535 9699
   SYSTEMS    Milton QLD 4064       http://www.vrt.com.au
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com