Re: Bluestore HDD Cluster Advice

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

What IO size are you testing, Bluestore will only defer writes under 32kb is size by default. Unless you are writing sequentially, only limited amount of buffering via SSD is going to help, you will eventually hit the limits of the disk. Could you share some more
details as I'm interested in in this topic as well.

I'm testing 4kb random writes, mostly with iodepth=1 (single-thread latency test). This is the main case which is expected to be sped up by the SSD journal and also the worst case for SDS's :).

Interesting, will have to investigate this further!!! I wish there were more details around this technology from HGST

It's simple to test yourself - similar thing is currently common in SMR drives. Pick a random cheap 2.5" 1TB Seagate SMR HDD and test it with fio with one of `sync` or `fsync` options and iodepth=32 - you'll see it handles more than 1000 random 4Kb write iops. It only handles so much until its buffer is full of course. When I tested one of these I found that the buffer was 8 GB. After writing 8 GB the performance drops to ~30-50 iops, and when the drive is idle it starts to flush the buffer. This process takes a lot of time if the buffer is full (several hours).

The difference between 2.5 SMR seagates and HGSTs is that HGSTs only enable "media cache" when the volatile cache is disabled (which was a real surprise to me), and SMRs keep it enabled all the time.

But the thing that really confused me was that Bluestore random write performance - even single-threaded write performance (latency test) - changed when I altered the parameter of the DATA device (not journal)! WHY was it affected? Based on common sense and bluestore's documentation random deferred write commit time when the system is not under load (and with iodepth=1 it isn't) should only depend on the WAL device performance! But it's also affected by the data device which tells us there is some problem in the bluestore's implementation.

At the same time, deferred writes slightly help performance when you
don't have SSD. But the difference we talking is like tens of iops (30
vs 40), so it's not noticeable in the SSD era :).

What size IO's are these you are testing with? I see a difference going from around 50IOPs up to over a thousand for a single
threaded 4kb sequential test.

4Kb random writes. The numbers of 30-40 iops are from small HDD-only clusters (one 12x on 3 hosts, one 4x on ONE host - "scrap-ceph", home version :)). I've tried to play with prefer_deferred_size_hdd there and discovered that it had very little impact on random 4kb iodepth=128 iops. Which I think is slightly counter-intuitive because the expectation is that the deferred writes should increase random iops.

Careful here, Bluestore will only migrate the next level of its DB if it can fit the entire DB on the flash device. These cutoff's are around 3GB,30GB,300GB by default, so anything in-between will not be used. In your example a 20GB flash partition will mean that a large amount of RocksDB will end up on the spinning disk (slowusedBytes)

Thanks, I didn't know that... I rechecked - all my 8TB osds with 20GB partitions migrated their DBs to slow devices again. Previously I moved them to SSDs with rebased Igor Fedotov's ceph-bluestool ... oops :) ceph-bluestore-tool. Although I still don't understand where the number 3 comes from? Ceph's default bluestore_rocksdb_options states there are 4*256MB memtables, it's 1GB, not 3...

--
With best regards,
  Vitaliy Filippov
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux