On 11/29/18 1:33 AM, lin yunfan wrote:
Hi Mark,
Thank you for your detailed explanation about how osd uses the memory.
There are some more detail about osd memory usage I can't fully understand.
1.Would the memory consumption increase with bigger disk when the pg
number remains the same? Do more objects use more memory?
There was an assertion a while back that OSD memory requirements sort of
grew linearly with the number of objects, but I never really understood
the reasoning for it to be honest. There's certainly some level of
inefficiency when dealing with lots of objects, but it's certainly not
linear. It was true that previously during recovery the length of the
pglog could grow in an unbounded way, but afaik Neha recently reworked it:
https://github.com/ceph/ceph/pull/23098
I suppose some of the topology information might become a larger portion
of the memory consumption if you have a very large cluster though.
2.How does osd use much more memory during recovery?
I understand when pgs are clean osd uses osd_pg_min_log_entries and
when pgs are not clean osd uses osd_pg_max_log_entries, Is this the
main reason why osd uses more memory during recovery?
I don't think it will typically hang out around the
osd_min_pg_log_entries value during regular operation. Probably best to
get Josh or Neha's feedback though. See the note above about the max
length (it used to be unbounded during recovery but isn't anymore). I
suspect that there's probably some additional overhead during recovery
in other places too, and I wouldn't be surprised if it throws tcmalloc
for a bit of a loop. TCMalloc works hard to combat memory fragmentation
but I've noticed that when we start doing a bunch of different looking
memory allocations it can make things less efficient for a while. The
autotuner can really make tcmalloc unhappy if it's super aggressive at
shifting memory around quickly.
3.Does a EC pool use more memory than a replicated pool?
That's sort of a complicated question. Probably, but there's a lot of
factors that could affect it. Probably best to just test. One of the
reasons I wrote the autotuner is so that we could try to smooth this
sort of thing out and just let users target how much memory they want
the OSD to consume.
4.How would each bluestore cache setting effect the performance?I
understand this is a vague question but any guide line would be
appreciate.
It very much is complicated. :)
In a nutshell: Bluestore has two caches it keeps itself, the onode
cache and the buffer cache. RocksDB also (can) maintain multiple
caches, but for now we'll just focus on the block cache. There's alos
the WAL buffers (memtables), and the pg log.
Onode cache is for metadata and buffer cache is for data. The buffer
cache is what gives you fast reads from memory when you've recently read
(or possibly written if you enable bluestore_default_buffered_write)
data. The onode cache is far more interesting imho as onode data is
checked several times prior to issuing a write. If bluestore can read
this in it's raw unencoded form straight from memory it is much faster.
This is especially true when you've got really fast storage devices like
NVMe drives.
Next, there's the rocksdb block cache. It stores a variety of
information, including a varint encoded form of the onodes, omap data,
rocksdb indexes, and rocksdb bloomfilters. We also use it for some
bluestore disk layout information, but that's relatively small
comparatively.
So onodes and omap data is pretty straightforward. If an onode isn't in
the bluestore onode cache it's possible it could be here depending on
the size of each cache. Frankly there doesn't seem to be much advantage
to having the hierarchical caching scheme here. It doesn't appear we
typically save a ton of space with the varint encoding and the
encode/decode step slows things down enough that I'm not sure it's worth
it. On the OMAP side, this is the only cache we currently have, so it's
important to make sure that if there's a lot of OMAP data being used
that this cache is large enough to provide some benefit.
Ok, so what happens if rocksdb data isn't in the block cache? RocksDB
is hirearchical in nature and data can be stored at any point (and
sometimes multiple points) in the tree. Without bloomfilters, you
potentially could have to perform reads at every single level to look
for it. Bloomfilters give you a very high probability of being able to
avoid reads when data isn't in a given level. IE It's really important
to make sure that bloomfilters stay in cache. That's the basis for why
the autotuner design implements the concept of priorities for every
cache. We really want to keep bloomfilters in rocksdb cache, and
generally we want to keep onodes in the onode cache, but other things
are more fluid based on the workload.
Next we have the rocksdb WAL buffers, and this sort of ties directly
into the pglog. Whenever you write into rocksdb it first appends the
write to an on-disk log and an in-memory buffer (memtable). Eventually
it will compact this into L0 and later may compact it into higher
levels. Typical data like an onode will follow this pattern and all is
good. The PG log however writes short lived pg log entries. The write
takes place, and then typically a delete will happen a short while later
to tell rocksdb to get rid of the previous write. That is done by
writing something called a tombstone. If the initial write, and the
tombstone happen to land in the same WAL buffer, rocksdb can cancel them
out when it performs the compaction and neither of them get written to
L0 (and tie up resources in bloomfilter/index creation, additional
compactions, etc). That's an argument for having very large WAL
buffers, and indeed we see that large WAL buffers does improve our write
performance and write-amplification in practice. The trade-off is that
large WAL buffers means longer compactions (but less compacted data
overall!) and extra CPU overhead iterating through the memtable when
adding data to it. So the gist of it is that the current pg log design
has a really big impact on how RocksDB is being tuned and affects
RocksDB performance in a variety of ways. That can really impact other
things, like how quickly RBD writes can be performed if onodes aren't in
the bluestore onode cache, or how fast RGW bucket indexes can be
retrieved if RocksDB is already busy.
The number of pglog entries may have an affect as well. The shorter the
length of the log, potentially the shorter amount of time between a
memtable insert and a tombstone in RocksDB. You might be able to get
away with a smaller buffer with a shorter log. On other other hand, a
short log means you have a shorter window for log based recovery which
can potentially mean a bigger recovery workload when things go wrong.
There's some other corner cases that play into all of this too, but
that's sort of the high level view. The idea behind the autotuner is to
make the OSD take care of figuring out how to balance all of this and
let the user just say "The OSD should use 4GB of RAM" and figure out the
best way to keep it there and use the memory.
Mark
Thanks
YunFan
Mark Nelson <mark.a.nelson@xxxxxxxxx> 于2018年11月29日周四 上午5:44写道:
On 11/28/18 3:06 PM, Erwan Velu wrote:
Hi,
The ceph-nano/container project offer users to use almost "master" code
into containers.
We do consume the latest packages to test the future release of Ceph.
Obviously that's not "supported" but interesting for testing.
In this example, the container have been built with Ceph
14.0.1-1022.gc881d63 rpms on a Centos container.
A user reported an OOM in one of his testing container
(https://github.com/ceph/cn/issues/94).
Yes, the container is limited to 512MB of ram in Ceph-nano, that isn't
much but wondered how much that value was the root cause of this OOM.
I bootep the same container twice and decided to monitor the memoire
usage of this container in two contextes :
- an idle context as a reference
- a constant rados workload to see the impact of IOs on this issue
The Rados load was a simple loop of adding the ceph tree as objects,
remove them and restart the loop.
I plotted the result (you can download it at http://pubz.free.fr/leak.png)
It appear that :
- an idle cluster, reached the 504 MB memory limit in 782 minutes. That
means that's a 231 memory increase in 782 minutes; 17.7 MB/hour
- a working cluster, reached the 500 MB memory limit (when the OOM
killed ceph-mon) in 692 minutes. That means that's a 229 MB memory
increase in 692 minutes; 19.85MB/hour
That really looks like we do leak a lot of memory in this version and as
the container is very limited in memory, put it to OOM state and die.
Does any of you have seen something similar ?
My next step is to monitor every ceph-process during that time to see
what process is growing too fast.
Erwan,
Hi Erwan,
Out of curiosity did you look at the mempool stats at all? It's pretty
likely you'll run out of memory with 512MB given our current defaults
and the memory autotuner won't be able to keep up (it will do it's best,
but can't work miracles).
To start out, the number of pglog entries is going to grow as time goes
on. I don't know how many PGs you have, but figure that
osd_min_pg_log_entries = 3000 and if you have say 100 PGs / OSD:
3000*100 = 300,000 * <size of pg log entry> bytes of memory usage.
Now figure that each pg log entry is going to store the name of the
object, so if you have objects with short names that could be small, say:
100B * 300,000 =~ 30MB + overhead
But if you had objects with large names:
1KB * 300,000 =~ 300MB + overhead
That's not the only thing that will grow. By default in master the
minimum autotuned size of the bluestore onode cache, buffer cache, and
rocksdb block cache are set to 128MB each. That doesn't mean they will
all necessarily be full at any given time, but it's quite likely that
over time you'll have some data in each of those.
Another potential user of memory is the rocksdb WAL buffers (ie
memtables). By default we currently will fill up a 256MB WAL buffer and
once it's full start compacting it into L0 while new writes will go to
up to 3 additional 256MB buffers before any new writes are stalled. In
practice it doesn't look like we often have all 4 full and the autotuner
will adjust the caches down if filling them would exceed the OSD memory
target, but it's another factor to consider.
In practice if I set the osd_memory_target to 1GB the autotuner will
make the OSD RSS memory bounce between like 800MB-1.3GB and constantly
wrestle to keep things under control. It would probably do a bit better
with a lower osd_min_pg_log_entry value (500? 1000?), lower minimum
cache sizes (maybe 64MB) and a lower WAL buffer size/count (say 2 64MB
buffers). That might be enough to support 1GB OSDs, but I'm not sure it
would be enough to keep the OSD consistently under 512MB.
In any event, when I've tested OSDs with that little memory there's been
fairly dramatic performance impacts in a variety of ways depending on
what you change. In practice the minimum amount of memory we can
reasonable work with right now is probably around 1.5-2GB, and we do a
lot better with 3-4GB+.
Mark