If you had set min_size to 1 you would not have seen the writes pause. a min_size of 1 is dangerous though because it means you are 1 hard disk failure away from losing the objects within that placement group entirely. a min_size of 2 is generally considered the minimum you want but many people ignore that advice, some wish they hadn't.
On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden <carheden@xxxxxxxx> wrote:
Thanks everyone for the replies. Very informative. However, should I
have expected writes to pause if I'd had min_size set to 1 instead of 2?
And yes, I was under the false impression that my rdb devices was a
single object. That explains what all those other things are on a test
cluster where I only created a single object!
--
Adam Carheden
On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> This is because of the min_size specification. I would bet you have it
> set at 2 (which is good).
>
> ceph osd pool get rbd min_size
>
> With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1
> from each hosts) results in some of the objects only having 1 replica
> min_size dictates that IO freezes for those objects until min_size is
> achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/# set-the-number-of-object- replicas
>
> I cant tell if your under the impression that your RBD device is a
> single object. It is not. It is chunked up into many objects and spread
> throughout the cluster, as Kjeti mentioned earlier.
>
> On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kjetil@xxxxxxxxxxxx
> <mailto:kjetil@xxxxxxxxxxxx>> wrote:
>
> Hi,
>
> rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents
> will get you a "prefix", which then gets you on to
> rbd_header.<prefix>, rbd_header.prefix contains block size,
> striping, etc. The actual data bearing objects will be named
> something like rbd_data.prefix.%-016x.
>
> Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first
> <block size> of that image will be named rbd_data.
> 86ce2ae8944a.000000000000, the second <block size> will be
> 86ce2ae8944a.000000000001, and so on, chances are that one of these
> objects are mapped to a pg which has both host3 and host4 among it's
> replicas.
>
> An rbd image will end up scattered across most/all osds of the pool
> it's in.
>
> Cheers,
> -KJ
>
> On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carheden@xxxxxxxx
> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.> <mailto:carheden@xxxxxxxx>> wrote:
>
> I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
> running on hosts 1, 2 and 3. It has a single replicated pool of size
> 3. I have a VM with its hard drive replicated to OSDs 11(host3),
> 5(host1) and 3(host2).
>
> I can 'fail' any one host by disabling the SAN network interface and
> the VM keeps running with a simple slowdown in I/O performance
> just as
> expected. However, if 'fail' both nodes 3 and 4, I/O hangs on
> the VM.
> (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2
> still
> have quorum, so that shouldn't be an issue. The placement group
> still
> has 2 of its 3 replicas online.
>
> Why does I/O hang even though host4 isn't running a monitor and
> doesn't have anything to do with my VM's hard drive.
>
>
> Size?
> # ceph osd pool get rbd size
> size: 3
>
> Where's rbd_id.vm-100-disk-1?
> # ceph osd getmap -o /tmp/map && osdmaptool --pool 0
> --test-map-object
> rbd_id.vm-100-disk-1 /tmp/map
> got osdmap epoch 1043
> osdmaptool: osdmap file '/tmp/map'
> object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
>
> # ceph osd tree
> ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 8.06160 root default
> -7 5.50308 room A
> -3 1.88754 host host1
> 4 0.40369 osd.4 up 1.00000 1.00000
> 5 0.40369 osd.5 up 1.00000 1.00000
> 6 0.54008 osd.6 up 1.00000 1.00000
> 7 0.54008 osd.7 up 1.00000 1.00000
> -2 3.61554 host host2
> 0 0.90388 osd.0 up 1.00000 1.00000
> 1 0.90388 osd.1 up 1.00000 1.00000
> 2 0.90388 osd.2 up 1.00000 1.00000
> 3 0.90388 osd.3 up 1.00000 1.00000
> -6 2.55852 room B
> -4 1.75114 host host3
> 8 0.40369 osd.8 up 1.00000 1.00000
> 9 0.40369 osd.9 up 1.00000 1.00000
> 10 0.40369 osd.10 up 1.00000 1.00000
> 11 0.54008 osd.11 up 1.00000 1.00000
> -5 0.80737 host host4
> 12 0.40369 osd.12 up 1.00000 1.00000
> 13 0.40369 osd.13 up 1.00000 1.00000
>
>
> --
> Adam Carheden
> _______________________________________________
> ceph-users mailing list
com >
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. >com
>
>
>
>
> --
> Kjetil Joergensen <kjetil@xxxxxxxxxxxx <mailto:kjetil@xxxxxxxxxxxx>>
> SRE, Medallia Inc
> Phone: +1 (650) 739-6580 <tel:(650)%20739-6580>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxx.com >
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. >com
>
>
>
>
> --
> Respectfully,
>
> Wes Dillingham
> wes_dillingham@xxxxxxxxxxx <mailto:wes_dillingham@harvard.edu >
> Research Computing | Infrastructure Engineer
> Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph. com
Respectfully,
Wes Dillingham
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com