Thx a lot! I've got it! You really pull me into ceph deeply! haha 2011/5/10 Gregory Farnum <gregf@xxxxxxxxxxxxxxx>: > On Mon, May 9, 2011 at 12:01 AM, Simon Tian <aixt2006@xxxxxxxxx> wrote: >> 2011/5/9 Gregory Farnum <gregf@xxxxxxxxxxxxxxx>: >>> On Sun, May 8, 2011 at 8:04 PM, Simon Tian <aixt2006@xxxxxxxxx> wrote: >>>> For primary copy, I think when the replication size is 3, 4, or even >>>> more, the writing speed should also near with 2 replication. Because >>>> the 2nd, 3rd, 4th, ... replication are written parallelly. ÂThe speed >>>> I got for 3, 4 replication is not near with the speed of 2, in fact, >>>> like linear reduce. >>> You're hitting your network limits there. With primary copy then the >>> primary needs to send out the data to each of the replicas, which caps >>> the write speed at (network bandwidth) / (num replicas). Presumably >>> you're using a gigabit network (or at least your nodes have gigabit >>> connections): >>> 1 replica: ~125MB/s (really a bit less due to protocol overhead) >>> 2 replicas:~62MB/s >>> 3 replicas: ~40MB/s >>> 4 replicas: ~31MB/s >>> etc. >>> Of course, you can also be limited by the speed of your disks (don't >>> forget to take journaling into account); and your situation is further >>> complicated by having multiple daemons per physical node. But I >>> suspect you get the idea. :) >> >> >> ÂYes, you are very right! Â The client throughput with different replication >> size will be limited by the network bandwidth of the primary copy. >> >> >> I have some other questions: >> 1. ÂIf I have to write or read a sparse file randomly, will the >> performance reduce much? > That depends on how large your random IOs are, how much of the file is > cached in-memory on the OSDs, etc. In general, random IO does not look > a lot different than sequential IO to the OSDs -- since the OSDs store > files in 4MB blocks then any large file read will involve retrieving > random 4MB blocks from the OSD anyway. On the client you might see a > bigger difference, though -- there is a limited amount of prefetching > going on client-side and it will work much better with sequential than > random reads. > > But behavior under different workloads is an area that still needs > more study and refinement. > >> 2. Is a rbd image sparse file? > Yes! As with files in the POSIX-compatible Ceph layer, RBD images are > stored in blocks (4MB by default) on the OSDs. Only those chunks with > data actually exist, and depending on your options and the backing > filesystem, only the piece of the chunk with data is actually stored. > >> 3. As the attachment showed, Âthe read throughput will increase when >> I/O size increasing. >> Â What is this I/O size mean? Is there any relationship between I/O >> size and object size? >> Â In the latest ceph, what will the read throughput of different file >> system like with different I/O size? > Is that one of the illustrations from Sage's thesis? > In general larger IOs will have higher throughput for many of the same > reasons that larger IOs have higher throughput on hard drives: the OSD > still needs to retrieve the data from off-disk, and a larger IO size > will minimize the impact of the seek latency there. With very large > IOs, the client can dispatch multiple read requests at once, allowing > the seek latency on the OSDs to happen simultaneously rather than > sequentially. > You can obviously do IOs of any size without regard fro the size of > the object; the client layers handle all the necessary translation. > In all versions of Ceph, you can expect higher throughput with larger > IO sizes. I'm not sure if that's what you mean? > -Greg > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html