Re: WAL/DB size

Mark Nelson <mnelson@xxxxxxxxxx> · Thu, 15 Aug 2019 04:18:08 -0500

Hi Folks,

The basic idea behind the WAL is that for every DB write transaction you 
first write it into an in-memory buffer and to a region on disk.  
RocksDB typically is setup to have multiple WAL buffers, and when one or 
more fills up, it will start flushing the data to L0 while new writes 
are written to the next buffer.  If rocksdb can't flush data fast 
enough, it will throttle write throughput down so that hopefully you 
don't fill all of of the buffers up and stall before a flush completes.  
The combined total size/number of buffers governs both how much disk 
space you need for the WAL and how much RAM is needed to store incoming 
IO that hasn't finished flushing into the DB.  There are various 
tradeoffs when adjust the size, number, and behavior of the WAL.  On one 
hand there's an advantage to having small buffers to favor frequent 
swift flush events and hopefully keep overall memory usage low and CPU 
overhead of key comparisons low.  On the other hand, having large WAL 
buffers means you have more runway both in terms of being able to absorb 
longer L0 compaction events but also potentially in terms of being able 
to avoid writing pglog entries to L0 entirely if a tombstone lands in 
the same WAL buffer as the initial write.  We've seen evidence that 
write amplification is (sometimes much) lower with bigger WAL buffers 
and we think this is a big part of the reason why.

Right now our default WAL settings for rocksdb is:

max_write_buffer_number=4

min_write_buffer_number_to_merge=1

write_buffer_size=268435456

which means we will store up to 4 256MB buffers and start flushing as 
soon as 1 fills up.  Alternate strategies could be to something like 16 
64MB buffers, and set min_write_buffer_number_to_merge to something like 
4.  Potentially that might provide slightly more fine grained control 
and also may be advantageous with a larger number of column families, 
but we haven't seen evidence yet that splitting the buffers into more 
smaller segments definitely improves things.  Probably the bigger 
take-away is that you can't simply make the WAL huge to give yourself 
extra runway for writes unless you are also willing to eat the RAM cost 
of storing all of that data in-memory as well. That's one of the reasons 
why we tell people regularly that 1-2GB is enough for the WAL.  With a 
target OSD memory of 4GB, (up to) 1GB for the WAL is already pushing 
it.  Luckily in most cases it doesn't actually use the full 1GB though.  
RocksDB will throttle before you get to that point so in reality it's 
more likely the WAL is probably using more like 0-512MB of Disk/RAM with 
2-3 extra buffers of capacity in case things get hairy.

Mark

On 8/15/19 1:59 AM, Janne Johansson wrote:
Den tors 15 aug. 2019 kl 00:16 skrev Anthony D'Atri 
<aad@xxxxxxxxxxxxxx <mailto:aad@xxxxxxxxxxxxxx>>:

    Good points in both posts, but I think there’s still some unclarity.

...

    We’ve seen good explanations on the list of why only specific DB
    sizes, say 30GB, are actually used _for the DB_.
    If the WAL goes along with the DB, shouldn’t we also explicitly
    determine an appropriate size N for the WAL, and make the
    partition (30+N) GB?
    If so, how do we derive N?  Or is it a constant?

    Filestore was so much simpler, 10GB set+forget for the journal. 
    Not that I miss XFS, mind you.

But we got a simple handwaving-best-effort-guesstimate that went "WAL 
1GB is fine, yes." so there you have an N you can use for the
30+N or 60+N sizings.
Can't see how that N needs more science than the filestore N=10G you 
showed. Not that I think journal=10G was wrong or anything.

--
May the most significant bit of your life be positive.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com