Re: Initial newstore vs filestore results

Milosz Tanski <milosz@xxxxxxxxx> · Wed, 8 Apr 2015 15:16:21 -0400

On Wed, Apr 8, 2015 at 12:49 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Wed, 8 Apr 2015, Haomai Wang wrote:
>> On Wed, Apr 8, 2015 at 10:58 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
>> > On Tue, 7 Apr 2015, Mark Nelson wrote:
>> >> On 04/07/2015 02:16 PM, Mark Nelson wrote:
>> >> > On 04/07/2015 09:57 AM, Mark Nelson wrote:
>> >> > > Hi Guys,
>> >> > >
>> >> > > I ran some quick tests on Sage's newstore branch.  So far given that
>> >> > > this is a prototype, things are looking pretty good imho.  The 4MB
>> >> > > object rados bench read/write and small read performance looks
>> >> > > especially good.  Keep in mind that this is not using the SSD journals
>> >> > > in any way, so 640MB/s sequential writes is actually really good
>> >> > > compared to filestore without SSD journals.
>> >> > >
>> >> > > small write performance appears to be fairly bad, especially in the RBD
>> >> > > case where it's small writes to larger objects.  I'm going to sit down
>> >> > > and see if I can figure out what's going on.  It's bad enough that I
>> >> > > suspect there's just something odd going on.
>> >> > >
>> >> > > Mark
>> >> >
>> >> > Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
>> >> > interested:
>> >> >
>> >> > http://nhm.ceph.com/newstore/
>> >> >
>> >> > Interestingly small object write/read performance with 4 OSDs was about
>> >> > 1/3-1/4 the speed of the same cluster with 36 OSDs.
>> >> >
>> >> > Note: Thanks Dan for fixing the directory column width!
>> >> >
>> >> > Mark
>> >>
>> >> New fio/librbd results using Sage's latest code that attempts to keep small
>> >> overwrite extents in the db.  This is 4 OSD so not directly comparable to the
>> >> 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:
>> >>
>> >>       write   read    randw   randr
>> >> 4MB   57.9    319.6   55.2    285.9
>> >> 128KB 2.5     230.6   2.4     125.4
>> >> 4KB   0.46    55.65   1.11    3.56
>> >
>> > What would be very interesting would be to see the 4KB performance
>> > with the defaults (newstore overlay max = 32) vs overlays disabled
>> > (newstore overlay max = 0) and see if/how much it is helping.
>> >
>> > The latest branch also has open-by-handle.  It's on by default (newstore
>> > open by handle = true).  I think for most workloads it won't be very
>> > noticeable... I think there are two questions we need to answer though:
>> >
>> > 1) Does it have any impact on a creation workload (say, 4kb objects).  It
>> > shouldn't, but we should confirm.
>> >
>> > 2) Does it impact small object random reads with a cold cache.  I think to
>> > see the effect we'll probably need to pile a ton of objects into the
>> > store, drop caches, and then do random reads.  In the best case the
>> > effect will be small, but hopefully noticeable: we should go from
>> > a directory lookup (1+ seeks) + inode lookup (1+ seek) + data
>> > read, to inode lookup (1+ seek) + data read.  So, 3 -> 2 seeks best case?
>> > I'm not really sure what XFS is doing under the covers here..
>>
>> WOW, it's really a cool implementation beyond my original mind
>> according to blueprint. Handler, overlay_map and data_map looks so
>> flexible and make small io cheaper in theory. Now we only have 1
>> element in data_map and I'm not sure your goal about the future's
>> usage. Although I have a unclearly idea that it could enhance the role
>> of NewStore and make local filesystem just as a block space allocator.
>> Let NewStore own a variable of FTL(File Translation Layer), so many
>> cool features could be added. What's your idea about data_map?
>
> Exactly, that is one option.  The other is that we'd treat the data_map
> similar to overlay_map with a fixed or max extent size so that a large
> partial overwrite will mostly go to a new file instead of doing the
> slow WAL.
>
>> My concern currently still is WAL after fsync and kv commiting, maybe
>> fsync process is just fine because mostly we won't meet this case in
>> rbd. But submit sync kv transaction isn't a low latency job I think,
>> maybe we could let WAL parallel with kv commiting?(yes, I really
>> concern the latency of one op :-) )
>
> The WAL has to come after kv commit.  But the fsync after the wal
> completion sucks, especially since we are always dispatching a single
> fsync at a time so it's kind of worst-case seek behavior.  We could throw
> these into another parallel fsync queue so that the fs can batch them up,
> but I'm not sure we will enough parallelism.  What would really be nice is
> a batch fsync syscall, but in leiu of that maybe we wait until we have a
> bunch of fsyncs pending and then throw them at the kernel together in a
> bunch of threads?  Not sure.  These aren't normally time sensitive
> unless a read comes along (which is pretty rare), but they have to be done
> for correctness.
>
>> Then from the actual rados write op, it will add setattr and
>> omap_setkeys ops. Current NewStore looks plays badly for setattr. It
>> always encode all xattrs(and other not so tiny fields) and write again
>> (Is this true?) though it could batch multi transaction's onode write
>> in short time.
>
> Yeah, this could be optimized so that we only unpack and repack the
> bufferlist, or do a single walk through the buffer to do the updates
> (similar to what TMAP used to do).
>
>> NewStore also employ much more workload to KeyValueDB compared to
>> FileStore, so maybe we need to consider the rich workload again
>> compared before. FileStore only use leveldb just for write workload
>> mainly so leveldb could fit into greatly, but currently overlay
>> keys(read) and onode(read) will occur a main latency source in normal
>> IO I think. The default kvdb like leveldb and rocksdb both plays not
>> well for random read workload, it maybe will be problem. Looking for
>> another kv db maybe a choice.
>
> I'm defaulting to rocksdb for now.  We should try LMDB at some point...
>

This might be a bit tangential to the ongoing effort, but I think the
idea combines a couple problems (solutions) together.

You could use make a store that use LMDB directly on the partition
(block device)... and in my mind that interesting because:
- You get a durable data store without write amplification of WAL or
LSM-Tree. It does this by having a COW BTree.
- You can batch "fsyncs". This would require some logic to merge
multiple unrelated Ceph OSD ops into a single LMDB transaction, but I
think it's doable.
- Theoretically you get to avoid a bunch of overhead of having a BTree
(database) on a BTree (filesystem).

-- 
Milosz Tanski
CTO
16 East 34th Street, 15th floor
New York, NY 10016

p: 646-253-9055
e: milosz@xxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html