On 06/14/2016 06:17 AM, Sage Weil wrote:
On Tue, 14 Jun 2016, Igor Fedotov wrote:
This result are for compression = none and write block size limited to
4K.
I've been thinking more about this and I'm wondering if we should revisit
the choice to use a min_alloc_size of 4K on flash. If it's 4K, then a 4K
write means
- 4K write (to newly allocated block)
- bdev flush
- kv commit (4k-ish?)
- bdev flush
AFAIK these flushes should happen async under the hood (ie almost free)
on devices with proper power loss protection.
which puts a 2 write lower bound on latency. If we have min_alloc_size of
8K or 16K, then a 4K write is
- kv commit (4K + 4k-ish)
- bdev flush
- [async] 4k write
Given what I've seen about how rocksdb behaves (even on ramdisk), I
think this is actually going to be worse than above in a lot of cases.
I could be wrong though. For SSDs that don't have PLP this might be
significantly faster.
Fewer bdev flushes, and only marginally more writes to the device. I
guess the question is is whether write-amp is really that important for a
4k workload?
The upside of a larger min_alloc_size is the worst case metadata (onode)
size is 1/2 or 1/4. The sequential read cost of a previously
random-written object will also be better (fewer IOs).
There is probably a case where 4k min_alloc_size is the right choice but
it feels like we're optimizing for write-amp to the detriment of other
more important things. For example, even after we improve the onode
encoding, it may be that the larger metadata results in more write-amp
than the WAL for the 4k writes does.
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html