Re: How to do maintenance without falling out of service?

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 21 Jan 2015 14:53:30 -0800

On Mon, Jan 19, 2015 at 8:40 AM, J David <j.david.lists@xxxxxxxxx> wrote:
> A couple of weeks ago, we had some involuntary maintenance come up
> that required us to briefly turn off one node of a three-node ceph
> cluster.
>
> To our surprise, this resulted in failure to write on the VM's on that
> ceph cluster, even though we set noout before the maintenance.
>
> This cluster is for bulk storage, it has copies=1 (2 total) and very
> large SATA drives.  The OSD tree looks like this:

2 total? Is that the pool size, you mean?
Depending on how you configured things it's possible that the min_size
is also set to 2, which would be bad for your purposes (it should be
at 1).

But without more information about what the cluster was reporting
during that time we can't tell you more.
-Greg

>
> # id weight type name up/down reweight
> -1 127.1 root default
> -2 18.16 host f16
> 0 4.54 osd.0 up 1
> 1 4.54 osd.1 up 1
> 2 4.54 osd.2 up 1
> 3 4.54 osd.3 up 1
> -3 54.48 host f17
> 4 4.54 osd.4 up 1
> 5 4.54 osd.5 up 1
> 6 4.54 osd.6 up 1
> 7 4.54 osd.7 up 1
> 8 4.54 osd.8 up 1
> 9 4.54 osd.9 up 1
> 10 4.54 osd.10 up 1
> 11 4.54 osd.11 up 1
> 12 4.54 osd.12 up 1
> 13 4.54 osd.13 up 1
> 14 4.54 osd.14 up 1
> 15 4.54 osd.15 up 1
> -4 54.48 host f18
> 16 4.54 osd.16 up 1
> 17 4.54 osd.17 up 1
> 18 4.54 osd.18 up 1
> 19 4.54 osd.19 up 1
> 20 4.54 osd.20 up 1
> 21 4.54 osd.21 up 1
> 22 4.54 osd.22 up 1
> 23 4.54 osd.23 up 1
> 24 4.54 osd.24 up 1
> 25 4.54 osd.25 up 1
> 26 4.54 osd.26 up 1
> 27 4.54 osd.27 up 1
>
> The host that was turned off was f18.  f16 does have a handful of
> OSDs, but it is mostly there to provide an odd number of monitors.
> The cluster is very lightly used, here is the current status:
>
>     cluster e9c32e63-f3eb-4c25-b172-4815ed566ec7
>      health HEALTH_OK
>      monmap e3: 3 mons at
> {f16=192.168.19.216:6789/0,f17=192.168.19.217:6789/0,f18=192.168.19.218:6789/0},
> election epoch 28, quorum 0,1,2 f16,f17,f18
>      osdmap e1674: 28 osds: 28 up, 28 in
>       pgmap v12965109: 1152 pgs, 3 pools, 11139 GB data, 2784 kobjects
>             22314 GB used, 105 TB / 127 TB avail
>                 1152 active+clean
>   client io 38162 B/s wr, 9 op/s
>
> Where did we go wrong last time?  How can we do the same maintenance
> to f17 (taking it offline for about 15-30 minutes) without repeating
> our mistake?
>
> As it stands, it seems like we have inadvertently created a cluster
> with three single points of failure, rather than none.  That has not
> been our experience with our other clusters, so we're really confused
> at present.
>
> Thanks for any advice!
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com