Re: How to do maintenance without falling out of service?

Luke Kao <Luke.Kao@xxxxxxxxxxxxx> · Thu, 22 Jan 2015 06:17:26 +0000

Hi David,
How about your pools size & min_size setting?
In your cluster, you may need to set all pools min_size=1 before shutdown server

BR,
Luke
MYCOM-OSI
________________________________________
From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of J David [j.david.lists@xxxxxxxxx]
Sent: Tuesday, January 20, 2015 12:40 AM
To: ceph-users@xxxxxxxxxxxxxx
Subject:  How to do maintenance without falling out of service?

A couple of weeks ago, we had some involuntary maintenance come up
that required us to briefly turn off one node of a three-node ceph
cluster.

To our surprise, this resulted in failure to write on the VM's on that
ceph cluster, even though we set noout before the maintenance.

This cluster is for bulk storage, it has copies=1 (2 total) and very
large SATA drives.  The OSD tree looks like this:

# id weight type name up/down reweight
-1 127.1 root default
-2 18.16 host f16
0 4.54 osd.0 up 1
1 4.54 osd.1 up 1
2 4.54 osd.2 up 1
3 4.54 osd.3 up 1
-3 54.48 host f17
4 4.54 osd.4 up 1
5 4.54 osd.5 up 1
6 4.54 osd.6 up 1
7 4.54 osd.7 up 1
8 4.54 osd.8 up 1
9 4.54 osd.9 up 1
10 4.54 osd.10 up 1
11 4.54 osd.11 up 1
12 4.54 osd.12 up 1
13 4.54 osd.13 up 1
14 4.54 osd.14 up 1
15 4.54 osd.15 up 1
-4 54.48 host f18
16 4.54 osd.16 up 1
17 4.54 osd.17 up 1
18 4.54 osd.18 up 1
19 4.54 osd.19 up 1
20 4.54 osd.20 up 1
21 4.54 osd.21 up 1
22 4.54 osd.22 up 1
23 4.54 osd.23 up 1
24 4.54 osd.24 up 1
25 4.54 osd.25 up 1
26 4.54 osd.26 up 1
27 4.54 osd.27 up 1

The host that was turned off was f18.  f16 does have a handful of
OSDs, but it is mostly there to provide an odd number of monitors.
The cluster is very lightly used, here is the current status:

    cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
     health HEALTH_OK
     monmap e3: 3 mons at
{f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0},
election epoch 28, quorum 0,1,2 f16,f17,f18
     osdmap e1674: 28 osds: 28 up, 28 in
      pgmap v12965109: 1152 pgs, 3 pools, 11139 GB data, 2784 kobjects
            22314 GB used, 105 TB / 127 TB avail
                1152 active+clean
  client io 38162 B/s wr, 9 op/s

Where did we go wrong last time?  How can we do the same maintenance
to f17 (taking it offline for about 15-30 minutes) without repeating
our mistake?

As it stands, it seems like we have inadvertently created a cluster
with three single points of failure, rather than none.  That has not
been our experience with our other clusters, so we're really confused
at present.

Thanks for any advice!
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

________________________________

This electronic message contains information from Mycom which may be privileged or confidential. The information is intended to be for the use of the individual(s) or entity named above. If you are not the intended recipient, be aware that any disclosure, copying, distribution or any other use of the contents of this information is prohibited. If you have received this electronic message in error, please notify us by post or telephone (to the numbers or correspondence address above) or by email (at the email address above) immediately.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com