Hello, in a nutshell, I can confirm that write amplification, see inline. On Mon, 20 Oct 2014 10:43:51 -0500 Mark Nelson wrote: > On 10/20/2014 09:28 AM, Mark Wu wrote: > > > > > > 2014-10-20 21:04 GMT+08:00 Mark Nelson <mark.nelson@xxxxxxxxxxx > > <mailto:mark.nelson@xxxxxxxxxxx>>: > > > > On 10/20/2014 06:27 AM, Mark Wu wrote: > > > > Test result Update: > > > > > > Number of Hosts Maximum single volume IOPS Maximum > > aggregated IOPS > > SSD Disk IOPS SSD Disk Utilization > > > > 7 14k 45k 9800+ > > 90% > > > > 8 21k > > 50k 9800+ > > 90% > > > > 9 30k > > 56k 9800+ 90% > > > > 10 40k > > 54k 8200+ > > 70% > > > > > > > > Note: the disk average request size is about 20 sectors, not > > same as > > client side (4k) > > > > > > I have two questions about the result: > > > > > > 1. No matter how many nodes the cluster has, the backend write > > throughput is always almost 8 times of client side. Is it > > normal behavior in Ceph, or caused by some wrong configuration in my > > setup? > > > > > > Are you counting journal writes and replication into this? Also > > note that journal writes will be slightly larger and padded to a 4K > > boundary for each write due to header information. I suspect for > > coalesced journal writes we may be able to pack the headers > > together to reduce this overhead. > > > > > > Yes, the journal writes and replication are counted into backend > > writes. Each ssd disk has two partitions: the raw one is used for > > journal and the one formatted as xfs is used osd data. The replica > > setting is 2. > > So considering the journal writes and replication, I expect the writes > > on backend is 4 times of client side. From the perspective of disk > > utilization, it's good because it's already close to the physical > > limitation. > > But the overhead is too big. Is it possible to try your idea without > > modifying code? If yes, I am glad to give it a try. > > Sadly it will require code changes and is something we've only briefly > talked about. So it is surprising that you would see 8x writes with 2x > replication and on-disk journals imho. In the past one of the things > I've done is add up all of the totals for the entire test both on the > client side and on the server side just to make sure that the numbers > are right. At least in past testing things properly added up, at least > on our test rig. > Using rbd bench with the default (4MB) block size it adds up pretty well, even better than expected on my temporary test machine (single storage server, no replication, 8 DC S3700 100GB for OSDs, journal on partition of the same device, IB 4xQDR link to client, Ceph 0.80.7). During that test I see about 1.5GB/s disk I/O with atop, which matches the result beautifully, as in: Total time run: 30.205574 Total writes made: 5393 Write size: 4194304 Bandwidth (MB/sec): 714.173 That makes about 21GB written or about 2.7GB per OSD. According to the SSD SMART values, they wrote about 4.6GB during that test, so better than 2x amplification expected. As a side note, while the 8 3.2GHz cores get moderately busy (50%ish) during this test, the limiting factor are the SSDs at their 200MB/s max write speed. Now if we run that same test with a block size of 4KB one gets: Total time run: 30.033196 Total writes made: 126944 Write size: 4096 Bandwidth (MB/sec): 16.511 This makes about 508MB written or roughly 64MB per OSD. According to the SMART values of the SSDs they wrote 768MB each or in other words 6 times more than one would have expected with a write amplification of 2. Not only does this obviously point at another vast reservoir for improvements, it also means that your SSDs will be whittled down to size at a rate that's both totally unexpected and unacceptable. And the side note here unsurprisingly is that during this test all 8 cores go into meltdown while the SSDs are bored (about 20% utilization) for a whooping 4000 IOPS. Here's hoping for Giant and Hammer... Christian > > > > > > > > > > > > The following data is captured in the 9 hosts test. Roughly, > > the aggregated backend write throughput is 1000 * 22 * 512 * 2 * 9 > > = 1980M/s > > > > The client side is 56k * 4 = 244M/s > > > > > > Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s > > wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/s rops/s > > wops/s > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s > > wsec/s avgrq-sz avgqu-sz await svctm %util > > sda 0.00 0.33 0.00 1.33 0.00 > > 10.67 8.00 0.00 0.00 0.00 0.00 > > sdb 0.00 6.00 0.00 10219.67 0.00 > > 223561.67 21.88 4.08 0.40 0.09 89.43 > > sdc 0.00 6.00 0.00 9750.67 0.00 > > 220286.67 22.59 2.47 0.25 0.09 89.83 > > dm-0 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 > > dm-1 0.00 0.00 0.00 1.33 0.00 > > 10.67 8.00 0.00 0.00 0.00 0.00 > > > > Filesystem: rBlk_nor/s wBlk_nor/s rBlk_dir/s > > wBlk_dir/s rBlk_svr/s wBlk_svr/s ops/s rops/s > > wops/s > > > > Device: rrqm/s wrqm/s r/s w/s rsec/s > > wsec/s avgrq-sz avgqu-sz await svctm %util > > sda 0.00 0.00 0.00 1.00 0.00 > > 26.67 26.67 0.00 0.00 0.00 0.00 > > sdb 0.00 6.33 0.00 10389.00 0.00 > > 224668.67 21.63 3.78 0.36 0.09 89.23 > > sdc 0.00 4.33 0.00 10106.67 0.00 > > 217986.00 21.57 3.83 0.38 0.09 91.10 > > dm-0 0.00 0.00 0.00 0.00 0.00 > > 0.00 0.00 0.00 0.00 0.00 0.00 > > dm-1 0.00 0.00 0.00 1.00 0.00 > > 26.67 26.67 0.00 0.00 0.00 0.00 > > > > > > 2. For the scalability issue ( 10 hosts performs worse than 9 > > hosts), > > is there any tuning suggestion to improve it? > > > > > > Can you post exactly the test you are running and on how many > > hosts/volumes? That would help us debug. > > > > In the test, we run vdbench with the following parameters on one host: > > > > sd=sd1,lun=/dev/rbd2,threads=128 > > sd=sd2,lun=/dev/rbd0,threads=128 > > sd=sd3,lun=/dev/rbd1,threads=128 > > *sd=sd4,lun=/dev/rbd3,threads=128 > > wd=wd1,sd=sd1,xfersize=4k,rdpct=0,openflags=o_direct > > wd=wd2,sd=sd2,xfersize=4k,rdpct=0,openflags=o_direct > > wd=wd3,sd=sd3,xfersize=4k,rdpct=0,openflags=o_direct > > *wd=wd4,sd=sd4,xfersize=4k,rdpct=0,openflags=o_direct > > rd=run1,wd=wd*,iorate=100000,elapsed=500,interval=1 > > Ok, I don't know a ton about vdbench. Is there any reason you are > limiting the iorate to 100000? You might try running the test on > multiple clients and seeing if that makes any difference. If you feel > like it, it might be worth also running similar tests with something > like fio just to verify that the same behaviour is present. > > > > > > > > > Thanks! > > Mark > > > > > > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com