Hello,
What IO size are you testing, Bluestore will only defer writes under
32kb is size by default. Unless you are writing sequentially,
only limited amount of buffering via SSD is going to help, you will
eventually hit the limits of the disk. Could you share some more
details as I'm interested in in this topic as well.
I'm testing 4kb random writes, mostly with iodepth=1 (single-thread
latency test). This is the main case which is expected to be sped up by
the SSD journal and also the worst case for SDS's :).
Interesting, will have to investigate this further!!! I wish there were
more details around this technology from HGST
It's simple to test yourself - similar thing is currently common in SMR
drives. Pick a random cheap 2.5" 1TB Seagate SMR HDD and test it with fio
with one of `sync` or `fsync` options and iodepth=32 - you'll see it
handles more than 1000 random 4Kb write iops. It only handles so much
until its buffer is full of course. When I tested one of these I found
that the buffer was 8 GB. After writing 8 GB the performance drops to
~30-50 iops, and when the drive is idle it starts to flush the buffer.
This process takes a lot of time if the buffer is full (several hours).
The difference between 2.5 SMR seagates and HGSTs is that HGSTs only
enable "media cache" when the volatile cache is disabled (which was a real
surprise to me), and SMRs keep it enabled all the time.
But the thing that really confused me was that Bluestore random write
performance - even single-threaded write performance (latency test) -
changed when I altered the parameter of the DATA device (not journal)! WHY
was it affected? Based on common sense and bluestore's documentation
random deferred write commit time when the system is not under load (and
with iodepth=1 it isn't) should only depend on the WAL device performance!
But it's also affected by the data device which tells us there is some
problem in the bluestore's implementation.
At the same time, deferred writes slightly help performance when you
don't have SSD. But the difference we talking is like tens of iops (30
vs 40), so it's not noticeable in the SSD era :).
What size IO's are these you are testing with? I see a difference going
from around 50IOPs up to over a thousand for a single
threaded 4kb sequential test.
4Kb random writes. The numbers of 30-40 iops are from small HDD-only
clusters (one 12x on 3 hosts, one 4x on ONE host - "scrap-ceph", home
version :)). I've tried to play with prefer_deferred_size_hdd there and
discovered that it had very little impact on random 4kb iodepth=128 iops.
Which I think is slightly counter-intuitive because the expectation is
that the deferred writes should increase random iops.
Careful here, Bluestore will only migrate the next level of its DB if it
can fit the entire DB on the flash device. These cutoff's
are around 3GB,30GB,300GB by default, so anything in-between will not be
used. In your example a 20GB flash partition will mean that
a large amount of RocksDB will end up on the spinning disk
(slowusedBytes)
Thanks, I didn't know that... I rechecked - all my 8TB osds with 20GB
partitions migrated their DBs to slow devices again. Previously I moved
them to SSDs with rebased Igor Fedotov's ceph-bluestool ... oops :)
ceph-bluestore-tool. Although I still don't understand where the number 3
comes from? Ceph's default bluestore_rocksdb_options states there are
4*256MB memtables, it's 1GB, not 3...
--
With best regards,
Vitaliy Filippov
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com