Re: Reading from replica

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 28 Aug 2013 13:28:29 -0700



On Wed, Aug 28, 2013 at 1:22 PM, daniel pol <daniel_pol@xxxxxxxxxxx> wrote:
> Sorry, my bad. Only my second post and forgot the "reply all"
>
> Thanks for the info. I'm looking at the impact of pg number on performance.
> Just trying to learn more about how Ceph works.
> I didn't set pgp_num. It came by default with 2 in my case.

Did you start the pool with 2 PGs? If not, that's...odd. You can
update it with "ceph osd pool set" (see
http://ceph.com/docs/master/rados/operations/control/).
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
> Have a nice day,
> Dani
>
>> Date: Wed, 28 Aug 2013 13:04:19 -0700
>
>> Subject: Re:  Reading from replica
>> From: greg@xxxxxxxxxxx
>> To: daniel_pol@xxxxxxxxxxx; ceph-users@xxxxxxxxxxxxxx
>
>>
>> [ Please keep list discussions on the list. :) ]
>>
>> On Wed, Aug 28, 2013 at 12:54 PM, daniel pol <daniel_pol@xxxxxxxxxxx>
>> wrote:
>> > Hi !
>> >
>> > Any pointers to where I can find the contortions ?
>>
>> You don't really want to — read-from-replica isn't safe except in very
>> specific circumstances.
>>
>> > I agree with you on
>> > should be seeing reads from both OSDs. I'm new to Ceph, so I might have
>> > done
>> > something wrong. I created a pool with size=2, and pg_num 16. I used
>> > rados
>> > bench for testing with default values. Here's the info about that pool:
>> > osd dump:
>> > pool 8 'test5' rep size 2 min_size 1 crush_ruleset 0 object_hash
>> > rjenkins
>> > pg_num 16 pgp_num 2 last_change 28 owner 0
>>
>> That "pgp_num 2" is your problem — you are placing all the data as if
>> there are only two shards, and since it's pseudorandom both shards
>> happened to end up with one OSD as primary. You should set that to the
>> same as your pg_num in most cases, and should probably have a lot more
>> than 16. :) The rule of thumb is that if you have only one pool in use
>> it should have roughly 100*[num OSDs]/[pool replication size] PGs.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>> >
>> > pg dump:
>> > 8.4 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231407 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.964532 0'0
>> > 2013-08-28 13:16:22.964532
>> > 8.3 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231596 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.964239 0'0
>> > 2013-08-28 13:16:22.964239
>> > 8.2 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231922 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.963956 0'0
>> > 2013-08-28 13:16:22.963956
>> > 8.1 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.232564 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.963659 0'0
>> > 2013-08-28 13:16:22.963659
>> > 8.0 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.232604 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.963153 0'0
>> > 2013-08-28 13:16:22.963153
>> > 8.f 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.233342 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.967841 0'0
>> > 2013-08-28 13:16:22.967841
>> > 8.e 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231966 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.967529 0'0
>> > 2013-08-28 13:16:22.967529
>> > 8.d 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.232289 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.967249 0'0
>> > 2013-08-28 13:16:22.967249
>> > 8.c 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.232694 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.966945 0'0
>> > 2013-08-28 13:16:22.966945
>> > 8.b 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.233098 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.966641 0'0
>> > 2013-08-28 13:16:22.966641
>> > 8.a 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.235592 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.966362 0'0
>> > 2013-08-28 13:16:22.966362
>> > 8.9 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.235616 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.966052 0'0
>> > 2013-08-28 13:16:22.966052
>> > 8.8 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.235950 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.965760 0'0
>> > 2013-08-28 13:16:22.965760
>> > 8.7 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231703 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.965458 0'0
>> > 2013-08-28 13:16:22.965458
>> > 8.6 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.230886 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.965128 0'0
>> > 2013-08-28 13:16:22.965128
>> > 8.5 0 0 0 0 0 0 0 active+clean 2013-08-28 13:16:48.231136 0'0
>> > 29:15 [1,0] [1,0] 0'0 2013-08-28 13:16:22.964817 0'0
>> > 2013-08-28 13:16:22.964817
>> >
>> > crush map:
>> > # begin crush map
>> > # devices
>> > device 0 osd.0
>> > device 1 osd.1
>> > # types
>> > type 0 osd
>> > type 1 host
>> > type 2 rack
>> > type 3 row
>> > type 4 room
>> > type 5 datacenter
>> > type 6 root
>> > # buckets
>> > host DFS1 {
>> > id -2 # do not change unnecessarily
>> > # weight 0.800
>> > alg straw
>> > hash 0 # rjenkins1
>> > item osd.0 weight 0.400
>> > item osd.1 weight 0.400
>> > }
>> > root default {
>> > id -1 # do not change unnecessarily
>> > # weight 0.800
>> > alg straw
>> > hash 0 # rjenkins1
>> > item DFS1 weight 0.800
>> > }
>> > # rules
>> > rule data {
>> > ruleset 0
>> > type replicated
>> > min_size 1
>> > max_size 10
>> > step take default
>> > step choose firstn 0 type osd
>> > step emit
>> > }
>> > rule metadata {
>> > ruleset 1
>> > type replicated
>> > min_size 1
>> > max_size 10
>> > step take default
>> > step choose firstn 0 type osd
>> > step emit
>> > }
>> > rule rbd {
>> > ruleset 2
>> > type replicated
>> > min_size 1
>> > max_size 10
>> > step take default
>> > step choose firstn 0 type osd
>> > step emit
>> > }
>> > # end crush map
>> >
>> >
>> > Have a nice day,
>> > Dani
>> >
>> > ________________________________
>> > Date: Wed, 28 Aug 2013 12:09:41 -0700
>> > Subject: Re:  Reading from replica
>> > From: greg@xxxxxxxxxxx
>> > To: daniel_pol@xxxxxxxxxxx
>> > CC: ceph-users@xxxxxxxxxxxxxx
>> >
>> >
>> > Read-from-replica does not happen unless you go through some contortions
>> > with config and developer setups. However, all n OSDs should be the
>> > primary
>> > for about 1/n% of the data, so you should be seeing reads to to both
>> > OSDs as
>> > long as you touch several objects at a time.
>> > -Greg
>> >
>> > On Wednesday, August 28, 2013, daniel pol wrote:
>> >
>> > Hi !
>> >
>> > I need a little help understanding reads from replicas. I've read a few
>> > conflicting messages and the documentation is not very clear to me on
>> > this
>> > subject (maybe I didn't find the proper doc).
>> > Here's the question: With default replication of 2 (size=2), when doing
>> > reads (big sequential reads in my case) are we expecting to see read
>> > going
>> > to the "primary" object AND it's replicas ? (similar to RAID1 where you
>> > read from both sides of the mirror)
>> >
>> > I'm not seeing that on my setup right now. I have a pool with 16 PGs on
>> > 2
>> > OSDs. When I do reads only one 1 OSD gets IO.
>> > If that's normal (replica involved in IO only when primary is down) I'll
>> > take a note, otherwise I'll have to find why I don't get reads from
>> > replicas.
>> >
>> > Have a nice day,
>> > Dani
>> >
>> >
>> >
>> > --
>> > Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com