You can try Universal Compaction https://github.com/facebook/rocksdb/wiki/Universal-Compaction -----Original Message----- From: Sage Weil [mailto:sage@xxxxxxxxxxxx] Sent: Saturday, April 11, 2015 7:24 AM To: Mark Nelson Cc: Ning Yao; Duan, Jiangang; ceph-devel Subject: Re: Initial newstore vs filestore results On Fri, 10 Apr 2015, Mark Nelson wrote: > Notice for instance a comparison of random 512k writes between > filestore, newstore with no overlay, and newstore with 8m overlay: > > http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite > .png > http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit > e.png > http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit > e.png > > The client rbd throughput as reported by fio is: > > filestore: 20.44MB/s > newstore+no_overlay: 4.35MB/s > newstore+8m_overlay: 3.86MB/s > > But notice that in the graphs, we see very different behaviors on disk. > > Filestore does a lot of reads and writes to a couple of specific > portions of the device and has peaks/valleys when data gets written > out in bulk. I would have expected to see more sequential looking > writes during the peaks due to journal writes and no reads to that > portion of the disk, but it seems murkier to me than that. > > http://nhm.ceph.com/newstore/20150409/filestore/RBD_00524288_randwrite > _OSD0.mpg > > newstore+no_overlay does kind of a flurry of random IO and looks like > newstore+it's > somewhat seek bound. It's very consistent but actual write > performance is low compared to what blktrace reports as the data > hitting the disk. Something happening toward the beginning of the drive too. > > http://nhm.ceph.com/newstore/20150409/no_overlay/RBD_00524288_randwrit > e_OSD0.mpg Yeah, looks like a bunch of write amplication... the disk bw used is really high. I think we need to look at what rocksdb is doing here. A couple things: - Make the log bigger, if we can, so that short-lived WAL keys don't get amplified. We'd rather eat memory than rewrite them in an sst since the number of them in flight is pretty well bounded. - The rocksdb log as it stands isn't ever going to perform as well as the FileJournal currently does. The FileJouranl uses a fixed-size device or file that's preallocated with no 'size' associated with it, so that when there is a write we only have to push down the data blocks (one seek), and on replay can identify valid records with a seq # and checksum. Rocksdb's log is a .log file that grows and get's fsync(2)'d, which means that the data blocks have to hit the disk *and* the inode (size) needs to get updated for the commit to happen. We could improve this by doing a fallocate and turning it into a circular buffer. I'm not sure XFS will let us fallocate a fresh file of 0's though and avoid a second seek because it'll still need to flip the extent bits when the data blocks are written... or prefill the file with 0's before using it. :/ sage > > newstore+8m overlay is interesting. Lots of data gets written out to > newstore+the disk > in seemingly large chunks but the actual throughput as reported by the > client is very slow. I assume there's tons of write amplification > happening as rocksdb moves the 512k objects around into different levels. > > http://nhm.ceph.com/newstore/20150409/8m_overlay/RBD_00524288_randwrit > e_OSD0.mpg > > Mark > > On 04/10/2015 02:41 PM, Mark Nelson wrote: > > Seekwatcher movies and graphs finally finished generating for all of > > the > > tests: > > > > http://nhm.ceph.com/newstore/20150409/ > > > > Mark > > > > On 04/10/2015 10:53 AM, Mark Nelson wrote: > > > Test results attached for different overlay settings at various IO > > > sizes for writes and random writes. Basically it looks like as we > > > increase the overlay size it changes the curve. So far we're > > > still not doing as good as the filestore (co-located journal) though. > > > > > > I imagine the WAL probably does play a big part here. > > > > > > Mark > > > > > > On 04/10/2015 10:28 AM, Sage Weil wrote: > > > > On Fri, 10 Apr 2015, Ning Yao wrote: > > > > > KV store introduces too much write amplification, we may need > > > > > self-implemented WAL? > > > > > > > > What we really want is to hint to the kv store that these keys > > > > (or this key range) is short-lived and should never get > > > > compacted. And/or, we need to just make sure the wal is > > > > sufficiently large so that in practice that never happens to > > > > those keys. > > > > > > > > Putting them outside the kv store means an additional seek/sync > > > > for disks, which defeats most of the purpose. Maybe it makes > > > > sense for flash... > > > > but > > > > the above avoids the problem in either case. > > > > > > > > I think we should target rocksdb for our initial tuning > > > > attempts. So far all I've done is played a bit with the file > > > > size (1mb -> 4mb -> 8mb) but my ad hoc tests didn't see much > > > > difference. > > > > > > > > sage > > > > > > > > > > > > > > > > > Regards > > > > > Ning Yao > > > > > > > > > > > > > > > 2015-04-10 14:11 GMT+08:00 Duan, Jiangang <jiangang.duan@xxxxxxxxx>: > > > > > > IMHO, the newstore performance depends so much on KV store > > > > > > performance due to the WAL - so pick up the right KV or > > > > > > tune it will be the 1st step to do. > > > > > > > > > > > > -jiangang > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx > > > > > > [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Mark > > > > > > Nelson > > > > > > Sent: Friday, April 10, 2015 1:01 AM > > > > > > To: Sage Weil > > > > > > Cc: ceph-devel > > > > > > Subject: Re: Initial newstore vs filestore results > > > > > > > > > > > > On 04/08/2015 10:19 PM, Mark Nelson wrote: > > > > > > > On 04/07/2015 09:58 PM, Sage Weil wrote: > > > > > > > > What would be very interesting would be to see the 4KB > > > > > > > > performance with the defaults (newstore overlay max = > > > > > > > > 32) vs overlays disabled (newstore overlay max = 0) and > > > > > > > > see if/how much it is helping. > > > > > > > > > > > > > > And here we go. 1 OSD, 1X replication. 16GB RBD volume. > > > > > > > > > > > > > > 4MB write read randw randr > > > > > > > default overlay 36.13 106.61 34.49 92.69 > > > > > > > no overlay 36.29 105.61 34.49 93.55 > > > > > > > > > > > > > > 128KB write read randw randr > > > > > > > default overlay 1.71 97.90 1.65 25.79 > > > > > > > no overlay 1.72 97.80 1.66 25.78 > > > > > > > > > > > > > > 4KB write read randw randr > > > > > > > default overlay 0.40 61.88 1.29 1.11 > > > > > > > no overlay 0.05 61.26 0.05 1.10 > > > > > > > > > > > > > > > > > > > Update this morning. Also ran filestore tests for > > > > > > comparison. Next we'll look at how tweaking the overlay for > > > > > > different IO sizes affects things. IE the overlay threshold > > > > > > is 64k right now and it appears that 128K write IOs for > > > > > > instance are quite a bit worse with newstore currently than > > > > > > with filestore. Sage also just committed changes that will > > > > > > allow overlay writes during append/create which may help improve small IO write performance as well in some cases. > > > > > > > > > > > > 4MB write read randw randr > > > > > > default overlay 36.13 106.61 34.49 92.69 > > > > > > no overlay 36.29 105.61 34.49 93.55 > > > > > > filestore 36.17 84.59 34.11 79.85 > > > > > > > > > > > > 128KB write read randw randr > > > > > > default overlay 1.71 97.90 1.65 25.79 > > > > > > no overlay 1.72 97.80 1.66 25.78 > > > > > > filestore 27.15 79.91 8.77 19.00 > > > > > > > > > > > > 4KB write read randw randr > > > > > > default overlay 0.40 61.88 1.29 1.11 > > > > > > no overlay 0.05 61.26 0.05 1.10 > > > > > > filestore 4.14 56.30 0.42 0.76 > > > > > > > > > > > > Seekwatcher movies and graphs available here: > > > > > > > > > > > > http://nhm.ceph.com/newstore/20150408/ > > > > > > > > > > > > Note for instance the very interesting blktrace patterns for > > > > > > 4K random writes on the OSD in each case: > > > > > > > > > > > > http://nhm.ceph.com/newstore/20150408/filestore/RBD_00004096 > > > > > > _randwrite.png > > > > > > > > > > > > > > > > > > http://nhm.ceph.com/newstore/20150408/default_overlay/RBD_00 > > > > > > 004096_randwrite.png > > > > > > > > > > > > > > > > > > http://nhm.ceph.com/newstore/20150408/no_overlay/RBD_0000409 > > > > > > 6_randwrite.png > > > > > > > > > > > > > > > > > > > > > > > > Mark > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > ceph-devel" in the body of a message to > > > > > > majordomo@xxxxxxxxxxxxxxx More majordomo info at > > > > > > http://vger.kernel.org/majordomo-info.html > > > > > > -- > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > ceph-devel" in the body of a message to > > > > > > majordomo@xxxxxxxxxxxxxxx More majordomo info at > > > > > > http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe > > ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo > info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html