Re: Some question about data placement

John Wilkins <john.wilkins@xxxxxxxxxxx> · Mon, 22 Apr 2013 10:07:12 -0700

George,
I don't think you needed to change your default CRUSH rule at all, since all the OSDs are on the same machine.

It sounds to me like you are conflating replication with striping. Ceph clients write an object to a pool, which maps the object to a placement group, and the object gets stored in that placement group on an OSD as determined by CRUSH. In fact, a placement group ID is a combination of the pool number (not its name) and the hash code generated by CRUSH. See 
http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/#monitoring-placement-group-states for a brief description of placement group IDs about half way through this section. 

When you create a pool, you may set the number of replicas with the "size" setting. Documentation suggests modifying the default number of placement groups, but size defaults to 2 already. http://ceph.com/docs/master/rados/configuration/pool-pg-config-ref/

Ceph clients write an object to the primary OSD, and the OSD makes the replicas by computing where the replica should be stored. http://ceph.com/docs/master/architecture/#how-ceph-scales

However, the object store DOES NOT break up the object and store it across all OSDs. The Ceph clients do have the ability to stripe data across objects. http://ceph.com/docs/master/architecture/#how-ceph-clients-stripe-data

On Mon, Apr 22, 2013 at 3:43 AM, George Shuklin <shuklin@xxxxxxxxxxx> wrote:

I still lost at documentation.

Let's assume I has 8 osd's on single server (osd.[0-7]). I use cephfs and want to has redundancy 2 (means each peace of data on two osd's) and spanning of the file across all OSD's (to get some performance on writing).

My expectation: 8x speed on reading, 4x speed on writing (compare to single drive). [I put aside some overhead]

I'm checking performance of random writes and reads to single file on mounted cephfs (fio, iodepth=32, blocksize=4k) and I'm getting nice read performance (1000 IOPS = 125x8, as expected) and just and only 30 iops on writing. Less then half of single drive performance.

I want to understand what I'm doing wrong.

My settings (for all OSDs they are same, but with different disk name):

[osd.1]

        host = testserver

        devs = /dev/sdb

        osd mkfs type = xfs

I tried to change CRUSH map:  "step choose firstn 2 type osd" (for 'data' rule, compare to default), but no effect.

I think here is some huge mistake I making... I need to say 'no more than two copies of data' and 'block size = 4k when stripping'.

Please help.

Thanks.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
John Wilkins
Senior Technical Writer
Intank
john.wilkins@xxxxxxxxxxx
(415) 425-9599

http://inktank.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com