Re: I/O hangs with 2 node failure even if one node isn't involved in I/O

Adam Carheden <carheden@xxxxxxxx> · Tue, 21 Mar 2017 12:59:19 -0600

Let's see if I got this. 4 host cluster. size=3, min_size=2. 2 hosts
fail. Are all of the following accurate?

a. An rdb is split into lots of objects, parts of which will probably
exist on all 4 hosts.

b. Some objects will have 2 of their 3 replicas on 2 of the offline OSDs.

c. Reads can continue from the single online OSD even in pgs that
happened to have two of 3 osds offline.

d. Writes hang for pgs that have 2 offline OSDs because CRUSH can't meet
the min_size=2 constraint.

e. Rebalancing does not occur because with only two hosts online there
is no way for CRUSH to meet the size=3 constraint even if it were to
rebalance.

f. I/O can been restored by setting min_size=1.

g. Alternatively, I/O can be restored by setting size=2, which would
kick off rebalancing and restored I/O as the pgs come into compliance
with the size=2 constraint.

h. If I instead have a cluster with 10 hosts, size=3 and min_size=2 and
two hosts fail, some pgs would have only 1 OSD online, but rebalancing
would start immediately since CRUSH can honor the size=3 constraint by
rebalancing. This means more nodes makes for a more reliable cluster.

i. If I wanted to force CRUSH to bring I/O back online with size=3 and
min_size=2 but only 2 hosts online, I could remove the host bucket from
the crushmap. CRUSH would then rebalance, but some PGs would likely end
up with 3 OSDs all on the same host. (This is theory. I promise not to
do any such thing to a production system ;)

Thanks
-- 
Adam Carheden

On 03/21/2017 11:48 AM, Wes Dillingham wrote:
> If you had set min_size to 1 you would not have seen the writes pause. a
> min_size of 1 is dangerous though because it means you are 1 hard disk
> failure away from losing the objects within that placement group
> entirely. a min_size of 2 is generally considered the minimum you want
> but many people ignore that advice, some wish they hadn't. 
> 
> On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden <carheden@xxxxxxxx
> <mailto:carheden@xxxxxxxx>> wrote:
> 
>     Thanks everyone for the replies. Very informative. However, should I
>     have expected writes to pause if I'd had min_size set to 1 instead of 2?
> 
>     And yes, I was under the false impression that my rdb devices was a
>     single object. That explains what all those other things are on a test
>     cluster where I only created a single object!
> 
> 
>     --
>     Adam Carheden
> 
>     On 03/20/2017 08:24 PM, Wes Dillingham wrote:
>     > This is because of the min_size specification. I would bet you have it
>     > set at 2 (which is good).
>     >
>     > ceph osd pool get rbd min_size
>     >
>     > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1
>     > from each hosts) results in some of the objects only having 1 replica
>     > min_size dictates that IO freezes for those objects until min_size is
>     > achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas
>     <http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas>
>     >
>     > I cant tell if your under the impression that your RBD device is a
>     > single object. It is not. It is chunked up into many objects and spread
>     > throughout the cluster, as Kjeti mentioned earlier.
>     >
>     > On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx <mailto:kjetil@xxxxxxxxxxxx>
>     > <mailto:kjetil@xxxxxxxxxxxx <mailto:kjetil@xxxxxxxxxxxx>>> wrote:
>     >
>     >     Hi,
>     >
>     >     rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents
>     >     will get you a "prefix", which then gets you on to
>     >     rbd_header.<prefix>, rbd_header.prefix contains block size,
>     >     striping, etc. The actual data bearing objects will be named
>     >     something like rbd_data.prefix.%-016x.
>     >
>     >     Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first
>     >     <block size> of that image will be named rbd_data.
>     >     86ce2ae8944a.000000000000, the second <block size> will be
>     >     86ce2ae8944a.000000000001, and so on, chances are that one of these
>     >     objects are mapped to a pg which has both host3 and host4 among it's
>     >     replicas.
>     >
>     >     An rbd image will end up scattered across most/all osds of the pool
>     >     it's in.
>     >
>     >     Cheers,
>     >     -KJ
>     >
>     >     On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carheden@xxxxxxxx <mailto:carheden@xxxxxxxx>
>     >     <mailto:carheden@xxxxxxxx <mailto:carheden@xxxxxxxx>>> wrote:
>     >
>     >         I have a 4 node cluster shown by `ceph osd tree` below.
>     Monitors are
>     >         running on hosts 1, 2 and 3. It has a single replicated
>     pool of size
>     >         3. I have a VM with its hard drive replicated to OSDs
>     11(host3),
>     >         5(host1) and 3(host2).
>     >
>     >         I can 'fail' any one host by disabling the SAN network
>     interface and
>     >         the VM keeps running with a simple slowdown in I/O performance
>     >         just as
>     >         expected. However, if 'fail' both nodes 3 and 4, I/O hangs on
>     >         the VM.
>     >         (i.e. `df` never completes, etc.) The monitors on hosts 1
>     and 2
>     >         still
>     >         have quorum, so that shouldn't be an issue. The placement
>     group
>     >         still
>     >         has 2 of its 3 replicas online.
>     >
>     >         Why does I/O hang even though host4 isn't running a
>     monitor and
>     >         doesn't have anything to do with my VM's hard drive.
>     >
>     >
>     >         Size?
>     >         # ceph osd pool get rbd size
>     >         size: 3
>     >
>     >         Where's rbd_id.vm-100-disk-1?
>     >         # ceph osd getmap -o /tmp/map && osdmaptool --pool 0
>     >         --test-map-object
>     >         rbd_id.vm-100-disk-1 /tmp/map
>     >         got osdmap epoch 1043
>     >         osdmaptool: osdmap file '/tmp/map'
>     >          object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
>     >
>     >         # ceph osd tree
>     >         ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT
>     PRIMARY-AFFINITY
>     >         -1 8.06160 root default
>     >         -7 5.50308     room A
>     >         -3 1.88754         host host1
>     >          4 0.40369             osd.4       up  1.00000         
>     1.00000
>     >          5 0.40369             osd.5       up  1.00000         
>     1.00000
>     >          6 0.54008             osd.6       up  1.00000         
>     1.00000
>     >          7 0.54008             osd.7       up  1.00000         
>     1.00000
>     >         -2 3.61554         host host2
>     >          0 0.90388             osd.0       up  1.00000         
>     1.00000
>     >          1 0.90388             osd.1       up  1.00000         
>     1.00000
>     >          2 0.90388             osd.2       up  1.00000         
>     1.00000
>     >          3 0.90388             osd.3       up  1.00000         
>     1.00000
>     >         -6 2.55852     room B
>     >         -4 1.75114         host host3
>     >          8 0.40369             osd.8       up  1.00000         
>     1.00000
>     >          9 0.40369             osd.9       up  1.00000         
>     1.00000
>     >         10 0.40369             osd.10      up  1.00000         
>     1.00000
>     >         11 0.54008             osd.11      up  1.00000         
>     1.00000
>     >         -5 0.80737         host host4
>     >         12 0.40369             osd.12      up  1.00000         
>     1.00000
>     >         13 0.40369             osd.13      up  1.00000         
>     1.00000
>     >
>     >
>     >         --
>     >         Adam Carheden
>     >         _______________________________________________
>     >         ceph-users mailing list
>     >         ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx> <mailto:ceph-users@xxxxxxxxxxxxxx
>     <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >         http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>     >         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
>     >
>     >
>     >
>     >
>     >     --
>     >     Kjetil Joergensen <kjetil@xxxxxxxxxxxx
>     <mailto:kjetil@xxxxxxxxxxxx> <mailto:kjetil@xxxxxxxxxxxx
>     <mailto:kjetil@xxxxxxxxxxxx>>>
>     >     SRE, Medallia Inc
>     >     Phone: +1 (650) 739-6580 <tel:%2B1%20%28650%29%20739-6580>
>     <tel:(650)%20739-6580>
>     >
>     >     _______________________________________________
>     >     ceph-users mailing list
>     >     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     <mailto:ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>>
>     >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>     >     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>>
>     >
>     >
>     >
>     >
>     > --
>     > Respectfully,
>     >
>     > Wes Dillingham
>     > wes_dillingham@xxxxxxxxxxx <mailto:wes_dillingham@xxxxxxxxxxx>
>     <mailto:wes_dillingham@xxxxxxxxxxx <mailto:wes_dillingham@xxxxxxxxxxx>>
>     > Research Computing | Infrastructure Engineer
>     > Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
>     >
>     _______________________________________________
>     ceph-users mailing list
>     ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> 
> 
> 
> 
> -- 
> Respectfully,
> 
> Wes Dillingham
> wes_dillingham@xxxxxxxxxxx <mailto:wes_dillingham@xxxxxxxxxxx>
> Research Computing | Infrastructure Engineer
> Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com