Re: Reading from replica

daniel pol <daniel_pol@xxxxxxxxxxx> · Wed, 28 Aug 2013 15:22:06 -0500

Sorry, my bad. Only my second post and forgot the "reply all"

Thanks for the info. I'm looking at the impact of pg number on performance. Just trying to learn more about how Ceph works.

I didn't set pgp_num. It came by default with 2 in my case. 

Have a nice day,

Dani

> Date: Wed, 28 Aug 2013 13:04:19 -0700
> Subject: Re:  Reading from replica
> From: greg@xxxxxxxxxxx
> To: daniel_pol@xxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
> 
> [ Please keep list discussions on the list. :) ]
> 
> On Wed, Aug 28, 2013 at 12:54 PM, daniel pol <daniel_pol@xxxxxxxxxxx> wrote:
> > Hi !
> >
> > Any pointers to where I can find the contortions ?
> 
> You don't really want to — read-from-replica isn't safe except in very
> specific circumstances.
> 
> > I agree with you on
> > should be seeing reads from both OSDs. I'm new to Ceph, so I might have done
> > something wrong. I created a pool with size=2, and pg_num 16. I used rados
> > bench for testing with default values. Here's the info about that pool:
> > osd dump:
> > pool 8 'test5' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
> > pg_num 16 pgp_num 2 last_change 28 owner 0
> 
> That "pgp_num 2" is your problem — you are placing all the data as if
> there are only two shards, and since it's pseudorandom both shards
> happened to end up with one OSD as primary. You should set that to the
> same as your pg_num in most cases, and should probably have a lot more
> than 16. :) The rule of thumb is that if you have only one pool in use
> it should have roughly 100*[num OSDs]/[pool replication size] PGs.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> >
> > pg dump:
> > 8.4 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231407 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.964532 0'0
> > 2013-08-28 13:16:22.964532
> > 8.3 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231596 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.964239 0'0
> > 2013-08-28 13:16:22.964239
> > 8.2 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231922 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.963956 0'0
> > 2013-08-28 13:16:22.963956
> > 8.1 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.232564 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.963659 0'0
> > 2013-08-28 13:16:22.963659
> > 8.0 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.232604 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.963153 0'0
> > 2013-08-28 13:16:22.963153
> > 8.f 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.233342 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.967841 0'0
> > 2013-08-28 13:16:22.967841
> > 8.e 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231966 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.967529 0'0
> > 2013-08-28 13:16:22.967529
> > 8.d 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.232289 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.967249 0'0
> > 2013-08-28 13:16:22.967249
> > 8.c 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.232694 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.966945 0'0
> > 2013-08-28 13:16:22.966945
> > 8.b 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.233098 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.966641 0'0
> > 2013-08-28 13:16:22.966641
> > 8.a 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.235592 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.966362 0'0
> > 2013-08-28 13:16:22.966362
> > 8.9 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.235616 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.966052 0'0
> > 2013-08-28 13:16:22.966052
> > 8.8 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.235950 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.965760 0'0
> > 2013-08-28 13:16:22.965760
> > 8.7 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231703 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.965458 0'0
> > 2013-08-28 13:16:22.965458
> > 8.6 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.230886 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.965128 0'0
> > 2013-08-28 13:16:22.965128
> > 8.5 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231136 0'0
> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.964817 0'0
> > 2013-08-28 13:16:22.964817
> >
> > crush map:
> > # begin crush map
> > # devices
> > device 0 osd.0
> > device 1 osd.1
> > # types
> > type 0 osd
> > type 1 host
> > type 2 rack
> > type 3 row
> > type 4 room
> > type 5 datacenter
> > type 6 root
> > # buckets
> > host DFS1 {
> > id -2 # do not change unnecessarily
> > # weight 0.800
> > alg straw
> > hash 0 # rjenkins1
> > item osd.0 weight 0.400
> > item osd.1 weight 0.400
> > }
> > root default {
> > id -1 # do not change unnecessarily
> > # weight 0.800
> > alg straw
> > hash 0 # rjenkins1
> > item DFS1 weight 0.800
> > }
> > # rules
> > rule data {
> > ruleset 0
> > type replicated
> > min_size 1
> > max_size 10
> > step take default
> > step choose firstn 0 type osd
> > step emit
> > }
> > rule metadata {
> > ruleset 1
> > type replicated
> > min_size 1
> > max_size 10
> > step take default
> > step choose firstn 0 type osd
> > step emit
> > }
> > rule rbd {
> > ruleset 2
> > type replicated
> > min_size 1
> > max_size 10
> > step take default
> > step choose firstn 0 type osd
> > step emit
> > }
> > # end crush map
> >
> >
> > Have a nice day,
> > Dani
> >
> > ________________________________
> > Date: Wed, 28 Aug 2013 12:09:41 -0700
> > Subject: Re:  Reading from replica
> > From: greg@xxxxxxxxxxx
> > To: daniel_pol@xxxxxxxxxxx
> > CC: ceph-users@xxxxxxxxxxxxxx
> >
> >
> > Read-from-replica does not happen unless you go through some contortions
> > with config and developer setups. However, all n OSDs should be the primary
> > for about 1/n% of the data, so you should be seeing reads to to both OSDs as
> > long as you touch several objects at a time.
> > -Greg
> >
> > On Wednesday, August 28, 2013, daniel pol wrote:
> >
> > Hi !
> >
> > I need a little help understanding reads from replicas. I've read a few
> > conflicting messages and the documentation is not very clear to me on this
> > subject (maybe I didn't find the proper doc).
> > Here's the question: With default replication of 2 (size=2), when doing
> > reads (big sequential reads in my case) are we expecting to see read going
> > to the "primary" object AND it's replicas ? (similar to RAID1 where you
> > read from both sides of the mirror)
> >
> > I'm not seeing that on my setup right now. I have a pool with 16 PGs on 2
> > OSDs. When I do reads only one 1 OSD gets IO.
> > If that's normal (replica involved in IO only when primary is down) I'll
> > take a note, otherwise I'll have to find why I don't get reads from
> > replicas.
> >
> > Have a nice day,
> > Dani
> >
> >
> >
> > --
> > Software Engineer #42 @ http://inktank.com | http://ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com