On Thu, Nov 24, 2011 at 11:32 AM, Xiaofei Du <xiaofei.du008@xxxxxxxxx> wrote: >> Unfortunately, that is the problem. Don't forget that Ceph is writing >> everything twice, and it's synchronous replication. 50% of your writes >> have to wait on those 20MB/s nodes to finish saving the data to disk. >> (I believe HDFS uses async replication.) Plus because it's journaling, >> all data gets written twice on each OSD. Reads go much faster because >> they only hit the slow OSDs for 25% of operations, and are often >> served out of memory. >> We haven't yet gotten around to working out ways of automatically >> handling this situation, but there are a few things you can do: >> 1) Remove the bad nodes from your cluster. Your performance should go >> up by a lot. >> 2) Change the weight of the slow nodes so they get a lower percentage >> of writes compared to the other nodes. (The default is 1.) >> http://ceph.newdream.net/wiki/Monitor_commands#reweight > It is not 50%. It is still 25% of the writes need to wait for the 2 > slow nodes no matter how many replication it has to write or > journaling. And even though two nodes are slower than the others. It > still should scale up to some extent. But now I didn't see any scale > up. Many times even slower I guess due to some resource contention. > (HDFS uses sync replication too, it uses pipeline for writing the > data). I tested using large files (15G), so they are not served out of > memory for reading. By default, each write goes to a primary, and the primary then replicates that write to a replica. The write is not considered complete until it is data safe on every active replica. (You can increase or decrease the number of replicas, but I haven't seen anywhere that you did.) So 1/4 of your writes will have a slow node as a primary, and 3/4 will have a fast node as a primary. Then 1/7 of your slow primary writes will also have a slow replica, and 6/7 will have a fast replica (but you are still bound by the slow one either way). And 5/7 of your fast primary writes will have a fast replica, but 2/7 will have a slow replica. So you end up with 1/4 (slow primary) plus 3/4 * 2/7 = 6/28 (slow replica) of your writes being bounded by the write speed on a slow node, which is 46% of the writes. Like I said, journaling also hurts your performance here, since it requires that every write go to disk twice. This is necessary to provide the data consistency guarantees that Ceph provides, but doesn't do us any favors in performance (especially on old disks). You have two nodes that write at 20MB/s and six that write at 45MB/s. Journaling cuts each of those in half: two nodes at 10MB/s, six at 23MB/s. That's an aggregate 158MB/s of write bandwidth, but each write goes to two OSDs so you only get half of that, or 79MB/s of write bandwidth. You are getting about the bandwidth I would expect out of a slow and unbalanced cluster. >> When you try and ls a directory the MetaData Server needs to get >> current statistics on every file in the directory. If there are other >> nodes currently writing to the directory that involves revoking >> "capabilities." Generally this should be pretty fast but is sometimes >> still noticeable. >> If it's more than a second or so, we probably broke things so that >> nodes are trying to flush out all their buffered writes before >> releasing caps. (We've done this before, although I thought it was >> fixed in current code.) What versions are you running? > ls is much longer than 1 second. I used v0.34 and v0.37. Both have this issue. Are you using ceph-fuse or the kernel client? And if it's the kernel client, what version? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html