On Fri, 10 Apr 2015, Mark Nelson wrote: > Notice for instance a comparison of random 512k writes between filestore, > newstore with no overlay, and newstore with 8m overlay: > > http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite.png > http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite.png > http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite.png > > The client rbd throughput as reported by fio is: > > filestore: 20.44MB/s > newstore+no_overlay: 4.35MB/s > newstore+8m_overlay: 3.86MB/s > > But notice that in the graphs, we see very different behaviors on disk. > > Filestore does a lot of reads and writes to a couple of specific portions of > the device and has peaks/valleys when data gets written out in bulk. I would > have expected to see more sequential looking writes during the peaks due to > journal writes and no reads to that portion of the disk, but it seems murkier > to me than that. > > http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite_OSD0.mpg > > newstore+no_overlay does kind of a flurry of random IO and looks like it's > somewhat seek bound. It's very consistent but actual write performance is low > compared to what blktrace reports as the data hitting the disk. Something > happening toward the beginning of the drive too. > > http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrite_OSD0.mpg Yeah, looks like a bunch of write amplication... the disk bw used is really high. I think we need to look at what rocksdb is doing here. A couple things: - Make the log bigger, if we can, so that short-lived WAL keys don't get amplified. We'd rather eat memory than rewrite them in an sst since the number of them in flight is pretty well bounded. - The rocksdb log as it stands isn't ever going to perform as well as the FileJournal currently does. The FileJouranl uses a fixed-size device or file that's preallocated with no 'size' associated with it, so that when there is a write we only have to push down the data blocks (one seek), and on replay can identify valid records with a seq # and checksum. Rocksdb's log is a .log file that grows and get's fsync(2)'d, which means that the data blocks have to hit the disk *and* the inode (size) needs to get updated for the commit to happen. We could improve this by doing a fallocate and turning it into a circular buffer. I'm not sure XFS will let us fallocate a fresh file of 0's though and avoid a second seek because it'll still need to flip the extent bits when the data blocks are written... or prefill the file with 0's before using it. :/ sage > > newstore+8m overlay is interesting. Lots of data gets written out to the disk > in seemingly large chunks but the actual throughput as reported by the client > is very slow. I assume there's tons of write amplification happening as > rocksdb moves the 512k objects around into different levels. > > http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrite_OSD0.mpg > > Mark > > On 04/10/2015 02:41 PM, Mark Nelson wrote: > > Seekwatcher movies and graphs finally finished generating for all of the > > tests: > > > > http://nhm.ceph.com/newstore/20150409/ > > > > Mark > > > > On 04/10/2015 10:53 AM, Mark Nelson wrote: > > > Test results attached for different overlay settings at various IO sizes > > > for writes and random writes. Basically it looks like as we increase > > > the overlay size it changes the curve. So far we're still not doing as > > > good as the filestore (co-located journal) though. > > > > > > I imagine the WAL probably does play a big part here. > > > > > > Mark > > > > > > On 04/10/2015 10:28 AM, Sage Weil wrote: > > > > On Fri, 10 Apr 2015, Ning Yao wrote: > > > > > KV store introduces too much write amplification, we may need > > > > > self-implemented WAL? > > > > > > > > What we really want is to hint to the kv store that these keys (or this > > > > key range) is short-lived and should never get compacted. And/or, we > > > > need > > > > to just make sure the wal is sufficiently large so that in practice that > > > > never happens to those keys. > > > > > > > > Putting them outside the kv store means an additional seek/sync for > > > > disks, > > > > which defeats most of the purpose. Maybe it makes sense for flash... > > > > but > > > > the above avoids the problem in either case. > > > > > > > > I think we should target rocksdb for our initial tuning attempts. So > > > > far > > > > all I've done is played a bit with the file size (1mb -> 4mb -> 8mb) > > > > but my ad hoc tests didn't see much difference. > > > > > > > > sage > > > > > > > > > > > > > > > > > Regards > > > > > Ning Yao > > > > > > > > > > > > > > > 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@xxxxxxxxx>: > > > > > > IMHO, the newstore performance depends so much on KV store > > > > > > performance due to the WAL - so pick up the right KV or tune it > > > > > > will be the 1st step to do. > > > > > > > > > > > > -jiangang > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx > > > > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson > > > > > > Sent: Friday, April 10, 2015 1:01 AM > > > > > > To: Sage Weil > > > > > > Cc: ceph-devel > > > > > > Subject: Re: Initial newstore vs filestore results > > > > > > > > > > > > On 04/08/2015 10:19 PM, Mark Nelson wrote: > > > > > > > On 04/07/2015 09:58 PM, Sage Weil wrote: > > > > > > > > What would be very interesting would be to see the 4KB > > > > > > > > performance > > > > > > > > with the defaults (newstore overlay max = 32) vs overlays > > > > > > > > disabled > > > > > > > > (newstore overlay max = 0) and see if/how much it is helping. > > > > > > > > > > > > > > And here we go. 1 OSD, 1X replication. 16GB RBD volume. > > > > > > > > > > > > > > 4MB write read randw randr > > > > > > > default overlay 36.13 106.61 34.49 92.69 > > > > > > > no overlay 36.29 105.61 34.49 93.55 > > > > > > > > > > > > > > 128KB write read randw randr > > > > > > > default overlay 1.71 97.90 1.65 25.79 > > > > > > > no overlay 1.72 97.80 1.66 25.78 > > > > > > > > > > > > > > 4KB write read randw randr > > > > > > > default overlay 0.40 61.88 1.29 1.11 > > > > > > > no overlay 0.05 61.26 0.05 1.10 > > > > > > > > > > > > > > > > > > > Update this morning. Also ran filestore tests for comparison. Next > > > > > > we'll look at how tweaking the overlay for different IO sizes > > > > > > affects things. IE the overlay threshold is 64k right now and it > > > > > > appears that 128K write IOs for instance are quite a bit worse with > > > > > > newstore currently than with filestore. Sage also just committed > > > > > > changes that will allow overlay writes during append/create which > > > > > > may help improve small IO write performance as well in some cases. > > > > > > > > > > > > 4MB write read randw randr > > > > > > default overlay 36.13 106.61 34.49 92.69 > > > > > > no overlay 36.29 105.61 34.49 93.55 > > > > > > filestore 36.17 84.59 34.11 79.85 > > > > > > > > > > > > 128KB write read randw randr > > > > > > default overlay 1.71 97.90 1.65 25.79 > > > > > > no overlay 1.72 97.80 1.66 25.78 > > > > > > filestore 27.15 79.91 8.77 19.00 > > > > > > > > > > > > 4KB write read randw randr > > > > > > default overlay 0.40 61.88 1.29 1.11 > > > > > > no overlay 0.05 61.26 0.05 1.10 > > > > > > filestore 4.14 56.30 0.42 0.76 > > > > > > > > > > > > Seekwatcher movies and graphs available here: > > > > > > > > > > > > http://nhm.ceph.com/newstore/20150408/ > > > > > > > > > > > > Note for instance the very interesting blktrace patterns for 4K > > > > > > random writes on the OSD in each case: > > > > > > > > > > > > http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096_randwrite.png > > > > > > > > > > > > > > > > > > http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00004096_randwrite.png > > > > > > > > > > > > > > > > > > http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_00004096_randwrite.png > > > > > > > > > > > > > > > > > > > > > > > > Mark > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > ceph-devel" in > > > > > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html