On Thu, 30 Apr 2015, Mark Nelson wrote: > On 04/30/2015 09:02 AM, Chen, Xiaoxi wrote: > > I am not sure I really understand the osd code, but from the osd log, in > > the sequential small write case, only one inflight op happening? > > > > and Mark, did you pre-allocate the rbd before doing sequential test? I > > believe you did, so both seq and random are in WAL mode. > > Yes, the RBD image is pre-allocated. Maybe Sage can chime in regarding the > one inflight op. I'm not sure why that would happen. :/ How are you generating the client workload? FWIW, the sequential tests I'm doing are doing small sequentail appends, not writes to a preallocated object; that's slightly harder because we have to update the file size on each write too. ./ceph_smalliobench --duration 6000 --io-size 4096 --write-ratio 1 --disable-detailed-ops=1 --pool rbd --use-prefix fooa --do-not-init=1 --num-concurrent-ops 16 --sequentia sage > > Mark > > > > > ---- Mark Nelson?? ---- > > > > > > On 04/29/2015 11:38 AM, Sage Weil wrote: > > > On Wed, 29 Apr 2015, Chen, Xiaoxi wrote: > > > > Hi Mark, > > > > Really good test:) I only played a bit on SSD, the parallel WAL > > > > threads really helps but we still have a long way to go especially on > > > > all-ssd case. I tried this > > > > https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515 > > > > by hacking the rocksdb, but the performance difference is negligible. > > > > > > It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead > > > and committed the change to the branch. Probably not noticeable on the > > > SSD, though it can't hurt. > > > > > > > The rocksdb digest speed should be the problem, I believe, I was planned > > > > to prove this by skip all db transaction, but failed since hitting other > > > > deadlock bug in newstore. > > > > > > Will look at that next! > > > > > > > > > > > Below are a bit more comments. > > > > > Sage has been furiously working away at fixing bugs in newstore and > > > > > improving performance. Specifically we've been focused on write > > > > > performance as newstore was lagging filestore but quite a bit > > > > > previously. A > > > > > lot of work has gone into implementing libaio behind the scenes and as > > > > > a > > > > > result performance on spinning disks with SSD WAL (and SSD backed > > > > > rocksdb) > > > > > has improved pretty dramatically. It's now often beating filestore: > > > > > > > > > > > > > SSD DB is still better than SSD WAL with request size > 128KB, this > > > > indicate some WALs are actually written to Level0...Hmm, could we add > > > > newstore_wal_max_ops/bytes to capping the total WAL size(how much data > > > > is in WAL but not yet apply to backend FS) ? I suspect this would > > > > improve performance by prevent some IO with high WA cost and latency? > > > > > > > > > http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf > > > > > > > > > > On the other hand, sequential writes are slower than random writes > > > > > when > > > > > the OSD, DB, and WAL are all on the same device be it a spinning disk > > > > > or SSD. > > > > > > > > I think sequential writes slower than random is by design in Newstore, > > > > because for every object we can only have one WAL , that means no > > > > concurrent IO if the req_size* QD < 4MB. Not sure how many #QD do you > > > > have in the test? I suspect 64 since there is a boost in seq write > > > > performance with req size > 64 ( 64KB*64=4MB). > > > > > > > > In this case, IO pattern will be : 1 write to DB WAL->Sync-> 1 Write to > > > > FS -> Sync, we do everything in synchronize way ,which is essentially > > > > expensive. > > > > > > The number of syncs is the same for appends vs wal... in both cases we > > > fdatasync the file and the db commit, but with WAL the fs sync comes after > > > the commit point instead of before (and we don't double-write the data). > > > Appends should still be pipelined (many in flight for the same object)... > > > and the db syncs will be batched in both cases (submit_transaction for > > > each io, and a single thread doing the submit_transaction_sync in a loop). > > > > > > If that's not the case then it's an accident? > > > > > > sage > > > > So I ran some more tests last night on 2c914df7 to see if any of the new > > changes made much difference for spinning disk small sequential writes, > > and the short answer is no. Since overlay now works again I also ran > > tests with overlay enabled, and this may have helped marginally (and had > > mixed results for random writes, may need to tweak the default). > > > > After this I got to thinking about how the WAL-on-SSD results were so > > much better that I wanted to confirm that this issue is WAL related. I > > tried setting DisableWAL. This resulted in about a 90x increase in > > sequential write performance, but only a 2x increase in random write > > performance. What's more, if you look at the last graph on the pdf > > linked below, you can see that sequential 4k writes with WAL enabled are > > significantly slower than 4K random writes, but sequential 4K writes > > with WAL disabled are significantly faster. > > > > http://nhm.ceph.com/newstore/Newstore_DisableWAL.pdf > > > > So I guess now I wonder what is happening that is different in each > > case. I'll probably sit down and start looking through the blktrace > > data and try to get more statistics out of rocksdb for each case. It > > would be useful if we could tie the rocksdb stats call into an asok command: > > > > DB::GetProperty("rocksdb.stats", &stats) > > > > Mark > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html