Re: OSD/monitor timeouts?

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 23 Jan 2014 18:20:05 -0800

On Thu, Jan 23, 2014 at 5:24 PM, Stuart Longland <stuartl@xxxxxxxxxx> wrote:
> Hi all,
>
> I'm in the process of setting up a storage cluster for production use.
> At the moment I have it in development and am testing the robustness of
> the cluster.  One key thing I'm conscious of is single points of
> failure.  Thus, I'm testing the cluster by simulating node outages (hard
> powering-off nodes).
>
> The set up at present is 3 identical nodes:
> - Ubuntu 12.04 LTS AMD64
> - Intel Core i3 3570T CPUs
> - 8GB RAM
> - Dual Gigabit Ethernet (one interface public, one cluster)
> - 60GB Intel 520S SSD
> - 2× Seagate SV35 3TB HDD for OSDs
>
> Partition structure is as follows:
>
>> root@bnedevsn1:~# parted /dev/sda print
>> Model: ATA ST3000VX000-1CU1 (scsi)
>> Disk /dev/sda: 3001GB
>> Sector size (logical/physical): 512B/4096B
>> Partition Table: gpt
>>
>> Number  Start   End     Size    File system  Name       Flags
>>  1      1049kB  3001GB  3001GB  xfs          ceph data
>>
>> root@bnedevsn1:~# parted /dev/sdb print
>> Model: ATA ST3000VX000-1CU1 (scsi)
>> Disk /dev/sdb: 3001GB
>> Sector size (logical/physical): 512B/4096B
>> Partition Table: gpt
>>
>> Number  Start   End     Size    File system  Name       Flags
>>  1      1049kB  3001GB  3001GB  xfs          ceph data
>>
>> root@bnedevsn1:~# parted /dev/sdc print
>> Model: ATA INTEL SSDSC2CW06 (scsi)
>> Disk /dev/sdc: 60.0GB
>> Sector size (logical/physical): 512B/512B
>> Partition Table: msdos
>>
>> Number  Start   End     Size    Type      File system     Flags
>>  1      1049kB  2264MB  2263MB  primary   xfs             boot
>>  2      2265MB  60.0GB  57.8GB  extended
>>  5      2265MB  22.7GB  20.5GB  logical
>>  6      22.7GB  43.2GB  20.5GB  logical
>>  7      43.2GB  60.0GB  16.8GB  logical   linux-swap(v1)
>
> sdc[56] are journals for the OSDs.  sdc1 is /.
>
> Each node runs two OSD daemons and a monitor daemon.  I'm not sure
> whether this is safe or not; I see the QuickStart guide seems to suggest
> such a set-up for testing, so I'm guessing it's safe.

That should be fine for a 3-node cluster, yeah. You should make sure
that your CRUSH rules are distributing data across hosts, but I
believe it does that by default now.

> My configuration looks like this:
>> root@sn1:~# cat /etc/ceph/ceph.conf
>> [global]
>>     # Use cephx authentication
>>     auth cluster required = cephx
>>     auth service required = cephx
>>     auth client required = cephx
>>
>>     # Our filesystem UUID
>>     fsid = [...]
>>
>>     public network = 10.20.30.0/24
>>     cluster network = 10.20.40.0/24
>>
>>     # By default, keep at least 3 replicas of all pools
>>     osd pool default size = 3
>>
>>     # and use 200 placement groups (6 OSDs * 100 / 3 replicas)
>>     osd pool default pg num = 200
>>     osd pool default pgp num = 200
>> # Monitors
>> [mon.0]
>>     # debug mon = 20
>>     # debug paxos = 20
>>     # debug auth = 20
>>
>>     host = sn0
>>     mon addr = 10.20.30.224:6789
>>     mon data = /var/lib/ceph/mon/ceph-0
>>     public addr = 10.20.30.224
>>     cluster addr = 10.20.40.64
>>
>> [mon.1]
>>     # debug mon = 20
>>     # debug paxos = 20
>>     # debug auth = 20
>>
>>     host = sn1
>>     mon addr = 10.20.30.225:6789
>>     mon data = /var/lib/ceph/mon/ceph-1
>>     public addr = 10.20.30.225
>>     cluster addr = 10.20.40.65
>>
>> [mon.2]
>>     # debug mon = 20
>>     # debug paxos = 20
>>     # debug auth = 20
>>
>>     host = sn2
>>     mon addr = 10.20.30.226:6789
>>     mon data = /var/lib/ceph/mon/ceph-2
>>     public addr = 10.20.30.226
>>     cluster addr = 10.20.40.66
>>
>> # Metadata servers
>>
>> [osd.0]
>>     host = sn0
>>
>>     osd journal size = 20000
>>
>>     osd mkfs type = xfs
>>     devs = /dev/sda1
>>     osd journal = /dev/sdc5
>>     osd addr = 10.20.30.224:6789
>>     osd data = /var/lib/ceph/osd/ceph-0
>>     public addr = 10.20.30.224
>>     cluster addr = 10.20.40.64
>>
>> [osd.1]
>>     host = sn0
>>
>>     osd journal size = 20000
>>
>>     osd mkfs type = xfs
>>     devs = /dev/sdb1
>>     osd journal = /dev/sdc6
>>     osd addr = 10.20.30.224:6789
>>     osd data = /var/lib/ceph/osd/ceph-1
>>     public addr = 10.20.30.224
>>     cluster addr = 10.20.40.64
>>
>> [osd.2]
>>     host = sn1
>>
>>     osd journal size = 20000
>>
>>     osd mkfs type = xfs
>>     devs = /dev/sda1
>>     osd journal = /dev/sdc5
>>     osd addr = 10.20.30.225:6789
>>     osd data = /var/lib/ceph/osd/ceph-2
>>     public addr = 10.20.30.225
>>     cluster addr = 10.20.40.65
>>
>> [osd.3]
>>     host = sn1
>>
>>     osd journal size = 20000
>>
>>     osd mkfs type = xfs
>>     devs = /dev/sdb1
>>     osd journal = /dev/sdc6
>>     osd addr = 10.20.30.225:6789
>>     osd data = /var/lib/ceph/osd/ceph-3
>>     public addr = 10.20.30.225
>>     cluster addr = 10.20.40.65
>>
>> [osd.4]
>>     host = sn2
>>
>>     osd journal size = 20000
>>
>>     osd mkfs type = xfs
>>     devs = /dev/sda1
>>     osd journal = /dev/sdc5
>>     osd addr = 10.20.30.226:6789
>>     osd data = /var/lib/ceph/osd/ceph-4
>>     public addr = 10.20.30.226
>>     cluster addr = 10.20.40.66
>>
>> [osd.5]
>>     host = sn2
>>
>>     osd journal size = 20000
>>
>>     osd mkfs type = xfs
>>     devs = /dev/sdb1
>>     osd journal = /dev/sdc6
>>     osd addr = 10.20.30.226:6789
>>     osd data = /var/lib/ceph/osd/ceph-5
>>     public addr = 10.20.30.226
>>     cluster addr = 10.20.40.66
>
> My test client has a single gigabit link to the cluster, and is running
> a similar configuration (but a 240GB SSD and no HDDs) with libvirt and
> KVM.  The VM has a RBD volume attached to it via virtio, and runs
> Windows Server 2008 R2.
>
> The cluster was fully up when I started the VM.  I observed that when I
> downed one of the storage nodes, the I/O to the disk seemed to freeze
> up.  I waited for the RBD client to time out the TCP connection to the
> downed node, but that didn't seem to happen.
>
> I could see in netstat that the connections were still "up" after a few
> minutes.

How long did you wait? By default the node will get timed out after
~30 seconds, be marked down, and then the remaining OSDs will take
over all activity for it. The client's connections might not close for
several minutes more, but that on its own isn't a problem. Did the
cluster actually detect the node as down? (You could check this by
looking at the ceph -w output or similar when running the test.)
If it was detected as down and the VM continued to block (modulo maybe
a little time for the client to decide its monitor was down; I forget
what the timeouts are there), that would be odd.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com