Re: Reading from replica

Gregory Farnum <greg@xxxxxxxxxxx> · Wed, 28 Aug 2013 13:04:19 -0700

[ Please keep list discussions on the list. :) ]

On Wed, Aug 28, 2013 at 12:54 PM, daniel pol <daniel_pol@xxxxxxxxxxx> wrote:
> Hi !
>
> Any pointers to where I can find the contortions ?

You don't really want to — read-from-replica isn't safe except in very
specific circumstances.

> I agree with you on
> should be seeing reads from both OSDs. I'm new to Ceph, so I might have done
> something wrong. I created a pool with size=2, and pg_num 16. I used rados
> bench for testing with default values. Here's the info about that pool:
> osd dump:
> pool 8 'test5' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins
> pg_num 16 pgp_num 2 last_change 28 owner 0

That "pgp_num 2" is your problem — you are placing all the data as if
there are only two shards, and since it's pseudorandom both shards
happened to end up with one OSD as primary. You should set that to the
same as your pg_num in most cases, and should probably have a lot more
than 16. :) The rule of thumb is that if you have only one pool in use
it should have roughly 100*[num OSDs]/[pool replication size] PGs.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
> pg dump:
> 8.4    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.231407   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.964532      0'0
> 2013-08-28 13:16:22.964532
> 8.3    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.231596   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.964239      0'0
> 2013-08-28 13:16:22.964239
> 8.2    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.231922   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.963956      0'0
> 2013-08-28 13:16:22.963956
> 8.1    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.232564   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.963659      0'0
> 2013-08-28 13:16:22.963659
> 8.0    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.232604   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.963153      0'0
> 2013-08-28 13:16:22.963153
> 8.f    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.233342   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.967841      0'0
> 2013-08-28 13:16:22.967841
> 8.e    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.231966   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.967529      0'0
> 2013-08-28 13:16:22.967529
> 8.d    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.232289   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.967249      0'0
> 2013-08-28 13:16:22.967249
> 8.c    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.232694   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.966945      0'0
> 2013-08-28 13:16:22.966945
> 8.b    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.233098   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.966641      0'0
> 2013-08-28 13:16:22.966641
> 8.a    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.235592   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.966362      0'0
> 2013-08-28 13:16:22.966362
> 8.9    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.235616   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.966052      0'0
> 2013-08-28 13:16:22.966052
> 8.8    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.235950   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.965760      0'0
> 2013-08-28 13:16:22.965760
> 8.7    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.231703   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.965458      0'0
> 2013-08-28 13:16:22.965458
> 8.6    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.230886   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.965128      0'0
> 2013-08-28 13:16:22.965128
> 8.5    0 0 0 0 0 0 0 active+clean  2013-08-28 13:16:48.231136   0'0
> 29:15   [1,0]   [1,0]   0'0     2013-08-28 13:16:22.964817      0'0
> 2013-08-28 13:16:22.964817
>
> crush map:
> # begin crush map
> # devices
> device 0 osd.0
> device 1 osd.1
> # types
> type 0 osd
> type 1 host
> type 2 rack
> type 3 row
> type 4 room
> type 5 datacenter
> type 6 root
> # buckets
> host DFS1 {
>        id -2  # do not change unnecessarily
>        # weight 0.800
>        alg straw
>        hash 0 # rjenkins1
>        item osd.0 weight 0.400
>        item osd.1 weight 0.400
> }
> root default {
>        id -1  # do not change unnecessarily
>        # weight 0.800
>        alg straw
>        hash 0 # rjenkins1
>        item DFS1 weight 0.800
> }
> # rules
> rule data {
>        ruleset 0
>        type replicated
>        min_size 1
>        max_size 10
>        step take default
>        step choose firstn 0 type osd
>        step emit
> }
> rule metadata {
>        ruleset 1
>        type replicated
>        min_size 1
>        max_size 10
>        step take default
>        step choose firstn 0 type osd
>        step emit
> }
> rule rbd {
>        ruleset 2
>        type replicated
>        min_size 1
>        max_size 10
>        step take default
>        step choose firstn 0 type osd
>        step emit
> }
> # end crush map
>
>
> Have a nice day,
> Dani
>
> ________________________________
> Date: Wed, 28 Aug 2013 12:09:41 -0700
> Subject: Re:  Reading from replica
> From: greg@xxxxxxxxxxx
> To: daniel_pol@xxxxxxxxxxx
> CC: ceph-users@xxxxxxxxxxxxxx
>
>
> Read-from-replica does not happen unless you go through some contortions
> with config and developer setups. However, all n OSDs should be the primary
> for about 1/n% of the data, so you should be seeing reads to to both OSDs as
> long as you touch several objects at a time.
> -Greg
>
> On Wednesday, August 28, 2013, daniel pol wrote:
>
> Hi !
>
> I need a little help understanding reads from replicas. I've read a few
> conflicting messages and the documentation is not very clear to me on this
> subject (maybe I didn't find the proper doc).
> Here's the question: With default replication of 2 (size=2), when doing
> reads (big sequential reads in my case) are we expecting to see read going
> to the "primary" object AND  it's replicas ? (similar to RAID1 where you
> read from both sides of the mirror)
>
> I'm not seeing that on my setup right now. I have a pool with 16 PGs on 2
> OSDs.  When I do reads only one 1 OSD gets IO.
> If that's normal (replica involved in IO only when primary is down) I'll
> take a note, otherwise I'll have to find why I don't get reads from
> replicas.
>
> Have a nice day,
> Dani
>
>
>
> --
> Software Engineer #42 @ http://inktank.com | http://ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com