On Fri, 15 Jun 2012, Mark Nelson wrote: > On 06/15/2012 12:56 AM, Stefan Priebe - Profihost AG wrote: > > Hello list, > > > > i still don't understand why the speed of the rados bench depends so > > heavily on the threads. > > > > Right now i get around 100MB/s per thread. So 1 thread is 100MB/s, 4 > > Threads 400MB/s and 16 threads results an about 1100MB/s. > > > > So 1100MB/s is great but i still don't get why 1 thread gets "only" > > 100MB/s. The one other thing worth mentioning here is that "thread" is really a misnomer. Rados bench is actually dispatching it's IO asynchronously from a single thread, and the -t option is really controlling the number of IO's in flight. That is more or less what you get if you have N threads doing a single synchronous IO each, which is why the option is called that. sage > > > > Total time run: 30.037374 > > Total writes made: 8326 > > Write size: 4194304 > > Bandwidth (MB/sec): 1108.752 > > > > Stddev Bandwidth: 47.5612 > > Max bandwidth (MB/sec): 1152 > > Min bandwidth (MB/sec): 948 > > Average Latency: 0.0577107 > > Stddev Latency: 0.020784 > > Max latency: 0.382413 > > Min latency: 0.026057 > > > > Stefan > > Hi Stefan, > > Let me preface this by saying that I haven't specifically read through the > rados bench code. Having said that, the basic idea here is that you have a > pipeline where a request is sent from the client to an OSD. If you specify > "-t 1", the client will only send a single request at a time, which means that > the entire process is serial and you are entirely latency bound. Now think > about what happens when the client sends the request. Before client gets an > acknowledgement, the request must: > > 1) Go through client side processing. > 2) Travel over the IP network to the destination OSD. > 3) Go through all of the queue processing code on the OSD. > 4a) Write the data to the journal (Or the faster of the journal/data disk when > using btrfs. Note: The journal writes may stall if the data disk is too slow > and the journal has gotten sufficiently ahead of it) > 4b) Complete replication to other OSDs based on the pool's replication level > and the placement group the data gets put in. (basically steps 1,2,3,4a and 5 > all over again with the OSD as the client). > 5) Send the Ack back to the client over the IP network > > If only one request is sent at a time, most of the hardware will sit idle > while the request is making it's way through the pipeline. If you have > multiple concurrent requests, the OSD(s) can better utilize all of the > hardware (ie some requests can be coming in over the network, while others can > be writing to disk, while others can be replicating). > > You can probably imagine that once you have multiple OSDs on multiple Nodes, > having concurrent requests in flight help you even more. > > Mark > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html