Re: replication write speed

Gregory Farnum <gregf@xxxxxxxxxxxxxxx> · Mon, 9 May 2011 09:02:38 -0700

On Mon, May 9, 2011 at 12:01 AM, Simon Tian <aixt2006@xxxxxxxxx> wrote:
> 2011/5/9 Gregory Farnum <gregf@xxxxxxxxxxxxxxx>:
>> On Sun, May 8, 2011 at 8:04 PM, Simon Tian <aixt2006@xxxxxxxxx> wrote:
>>> For primary copy, I think when the replication size is 3, 4, or even
>>> more, the writing speed should also near with 2 replication. Because
>>> the 2nd, 3rd, 4th, ... replication are written parallelly.  The speed
>>> I got for 3, 4 replication is not near with the speed of 2, in fact,
>>> like linear reduce.
>> You're hitting your network limits there. With primary copy then the
>> primary needs to send out the data to each of the replicas, which caps
>> the write speed at (network bandwidth) / (num replicas). Presumably
>> you're using a gigabit network (or at least your nodes have gigabit
>> connections):
>> 1 replica: ~125MB/s (really a bit less due to protocol overhead)
>> 2 replicas:~62MB/s
>> 3 replicas: ~40MB/s
>> 4 replicas: ~31MB/s
>> etc.
>> Of course, you can also be limited by the speed of your disks (don't
>> forget to take journaling into account); and your situation is further
>> complicated by having multiple daemons per physical node. But I
>> suspect you get the idea. :)
>
>
>  Yes, you are very right!   The client throughput with different replication
> size will be limited by the network bandwidth of the primary copy.
>
>
> I have some other questions:
> 1.  If I have to write or read a sparse file randomly, will the
> performance reduce much?
That depends on how large your random IOs are, how much of the file is
cached in-memory on the OSDs, etc. In general, random IO does not look
a lot different than sequential IO to the OSDs -- since the OSDs store
files in 4MB blocks then any large file read will involve retrieving
random 4MB blocks from the OSD anyway. On the client you might see a
bigger difference, though -- there is a limited amount of prefetching
going on client-side and it will work much better with sequential than
random reads.

But behavior under different workloads is an area that still needs
more study and refinement.

> 2. Is a rbd image sparse file?
Yes! As with files in the POSIX-compatible Ceph layer, RBD images are
stored in blocks (4MB by default) on the OSDs. Only those chunks with
data actually exist, and depending on your options and the backing
filesystem, only the piece of the chunk with data is actually stored.

> 3. As the attachment showed,  the read throughput will increase when
> I/O size increasing.
>   What is this I/O size mean? Is there any relationship between I/O
> size and object size?
>   In the latest ceph, what will the read throughput of different file
> system like with different I/O size?
Is that one of the illustrations from Sage's thesis?
In general larger IOs will have higher throughput for many of the same
reasons that larger IOs have higher throughput on hard drives: the OSD
still needs to retrieve the data from off-disk, and a larger IO size
will minimize the impact of the seek latency there. With very large
IOs, the client can dispatch multiple read requests at once, allowing
the seek latency on the OSDs to happen simultaneously rather than
sequentially.
You can obviously do IOs of any size without regard fro the size of
the object; the client layers handle all the necessary translation.
In all versions of Ceph, you can expect higher throughput with larger
IO sizes. I'm not sure if that's what you mean?
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html