Re: replication write speed

Simon Tian <aixt2006@xxxxxxxxx> · Tue, 10 May 2011 10:25:56 +0800

Thx a lot!  I've got it!
You really pull me into ceph deeply! haha

2011/5/10 Gregory Farnum <gregf@xxxxxxxxxxxxxxx>:
> On Mon, May 9, 2011 at 12:01 AM, Simon Tian <aixt2006@xxxxxxxxx> wrote:
>> 2011/5/9 Gregory Farnum <gregf@xxxxxxxxxxxxxxx>:
>>> On Sun, May 8, 2011 at 8:04 PM, Simon Tian <aixt2006@xxxxxxxxx> wrote:
>>>> For primary copy, I think when the replication size is 3, 4, or even
>>>> more, the writing speed should also near with 2 replication. Because
>>>> the 2nd, 3rd, 4th, ... replication are written parallelly. ÂThe speed
>>>> I got for 3, 4 replication is not near with the speed of 2, in fact,
>>>> like linear reduce.
>>> You're hitting your network limits there. With primary copy then the
>>> primary needs to send out the data to each of the replicas, which caps
>>> the write speed at (network bandwidth) / (num replicas). Presumably
>>> you're using a gigabit network (or at least your nodes have gigabit
>>> connections):
>>> 1 replica: ~125MB/s (really a bit less due to protocol overhead)
>>> 2 replicas:~62MB/s
>>> 3 replicas: ~40MB/s
>>> 4 replicas: ~31MB/s
>>> etc.
>>> Of course, you can also be limited by the speed of your disks (don't
>>> forget to take journaling into account); and your situation is further
>>> complicated by having multiple daemons per physical node. But I
>>> suspect you get the idea. :)
>>
>>
>> ÂYes, you are very right! Â The client throughput with different replication
>> size will be limited by the network bandwidth of the primary copy.
>>
>>
>> I have some other questions:
>> 1. ÂIf I have to write or read a sparse file randomly, will the
>> performance reduce much?
> That depends on how large your random IOs are, how much of the file is
> cached in-memory on the OSDs, etc. In general, random IO does not look
> a lot different than sequential IO to the OSDs -- since the OSDs store
> files in 4MB blocks then any large file read will involve retrieving
> random 4MB blocks from the OSD anyway. On the client you might see a
> bigger difference, though -- there is a limited amount of prefetching
> going on client-side and it will work much better with sequential than
> random reads.
>
> But behavior under different workloads is an area that still needs
> more study and refinement.
>
>> 2. Is a rbd image sparse file?
> Yes! As with files in the POSIX-compatible Ceph layer, RBD images are
> stored in blocks (4MB by default) on the OSDs. Only those chunks with
> data actually exist, and depending on your options and the backing
> filesystem, only the piece of the chunk with data is actually stored.
>
>> 3. As the attachment showed, Âthe read throughput will increase when
>> I/O size increasing.
>> Â What is this I/O size mean? Is there any relationship between I/O
>> size and object size?
>> Â In the latest ceph, what will the read throughput of different file
>> system like with different I/O size?
> Is that one of the illustrations from Sage's thesis?
> In general larger IOs will have higher throughput for many of the same
> reasons that larger IOs have higher throughput on hard drives: the OSD
> still needs to retrieve the data from off-disk, and a larger IO size
> will minimize the impact of the seek latency there. With very large
> IOs, the client can dispatch multiple read requests at once, allowing
> the seek latency on the OSDs to happen simultaneously rather than
> sequentially.
> You can obviously do IOs of any size without regard fro the size of
> the object; the client layers handle all the necessary translation.
> In all versions of Ceph, you can expect higher throughput with larger
> IO sizes. I'm not sure if that's what you mean?
> -Greg
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html