Re: Possible memory leak in Ceph 14.0.1-1022.gc881d63

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Thu, 29 Nov 2018 07:26:22 -0600

On 11/29/18 1:33 AM, lin yunfan wrote:
Hi Mark,

Thank you for your detailed explanation about how osd uses the memory.

There are some more detail about osd memory usage I can't fully understand.

1.Would the memory consumption increase with bigger disk when the pg
number remains the same? Do more objects use more memory?

There was an assertion a while back that OSD memory requirements sort of 
grew linearly with the number of objects, but I never really understood 
the reasoning for it to be honest.  There's certainly some level of 
inefficiency when dealing with lots of objects, but it's certainly not 
linear.  It was true that previously during recovery the length of the 
pglog could grow in an unbounded way, but afaik Neha recently reworked it:

https://github.com/ceph/ceph/pull/23098

I suppose some of the topology information might become a larger portion 
of the memory consumption if you have a very large cluster though.

2.How does osd use much more memory during recovery?
I understand when pgs are clean osd uses osd_pg_min_log_entries and
when pgs are not clean osd uses osd_pg_max_log_entries, Is this the
main reason why osd uses more memory during recovery?

I don't think it will typically hang out around the 
osd_min_pg_log_entries value during regular operation.  Probably best to 
get Josh or Neha's feedback though.  See the note above about the max 
length (it used to be unbounded during recovery but isn't anymore).  I 
suspect that there's probably some additional overhead during recovery 
in other places too, and I wouldn't be surprised if it throws tcmalloc 
for a bit of a loop.  TCMalloc works hard to combat memory fragmentation 
but I've noticed that when we start doing a bunch of different looking 
memory allocations it can make things less efficient for a while.  The 
autotuner can really make tcmalloc unhappy if it's super aggressive at 
shifting memory around quickly.

3.Does a EC pool use more memory than a replicated pool?

That's sort of a complicated question.  Probably, but there's a lot of 
factors that could affect it.  Probably best to just test.  One of the 
reasons I wrote the autotuner is so that we could try to smooth this 
sort of thing out and just let users target how much memory they want 
the OSD to consume.

4.How would each bluestore cache setting effect the performance?I
understand this is a vague question but any guide line would be
appreciate.

It very much is complicated. :)

In a nutshell:  Bluestore has two caches it keeps itself, the onode 
cache and the buffer cache.  RocksDB also (can) maintain multiple 
caches, but for now we'll just focus on the block cache.  There's alos 
the WAL buffers (memtables), and the pg log.

Onode cache is for metadata and buffer cache is for data.  The buffer 
cache is what gives you fast reads from memory when you've recently read 
(or possibly written if you enable bluestore_default_buffered_write) 
data.  The onode cache is far more interesting imho as onode data is 
checked several times prior to issuing a write.  If bluestore can read 
this in it's raw unencoded form straight from memory it is much faster. 
This is especially true when you've got really fast storage devices like 
NVMe drives.

Next, there's the rocksdb block cache.  It stores a variety of 
information, including a varint encoded form of the onodes, omap data, 
rocksdb indexes, and rocksdb bloomfilters.  We also use it for some 
bluestore disk layout information, but that's relatively small 
comparatively.

So onodes and omap data is pretty straightforward.  If an onode isn't in 
the bluestore onode cache it's possible it could be here depending on 
the size of each cache. Frankly there doesn't seem to be much advantage 
to having the hierarchical caching scheme here.  It doesn't appear we 
typically save a ton of space with the varint encoding and the 
encode/decode step slows things down enough that I'm not sure it's worth 
it.  On the OMAP side, this is the only cache we currently have, so it's 
important to make sure that if there's a lot of OMAP data being used 
that this cache is large enough to provide some benefit.

Ok, so what happens if rocksdb data isn't in the block cache?  RocksDB 
is hirearchical in nature and data can be stored at any point (and 
sometimes multiple points) in the tree.  Without bloomfilters, you 
potentially could have to perform reads at every single level to look 
for it.  Bloomfilters give you a very high probability of being able to 
avoid reads when data isn't in a given level.  IE It's really important 
to make sure that bloomfilters stay in cache.  That's the basis for why 
the autotuner design implements the concept of priorities for every 
cache.  We really want to keep bloomfilters in rocksdb cache, and 
generally we want to keep onodes in the onode cache, but other things 
are more fluid based on the workload.

Next we have the rocksdb WAL buffers, and this sort of ties directly 
into the pglog.  Whenever you write into rocksdb it first appends the 
write to an on-disk log and an in-memory buffer (memtable).  Eventually 
it will compact this into L0 and later may compact it into higher 
levels.  Typical data like an onode will follow this pattern and all is 
good.  The PG log however writes short lived pg log entries.  The write 
takes place, and then typically a delete will happen a short while later 
to tell rocksdb to get rid of the previous write.  That is done by 
writing something called a tombstone.  If the initial write, and the 
tombstone happen to land in the same WAL buffer, rocksdb can cancel them 
out when it performs the compaction and neither of them get written to 
L0 (and tie up resources in bloomfilter/index creation, additional 
compactions, etc).  That's an argument for having very large WAL 
buffers, and indeed we see that large WAL buffers does improve our write 
performance and write-amplification in practice.  The trade-off is that 
large WAL buffers means longer compactions (but less compacted data 
overall!) and extra CPU overhead iterating through the memtable when 
adding data to it.  So the gist of it is that the current pg log design 
has a really big impact on how RocksDB is being tuned and affects 
RocksDB performance in a variety of ways.  That can really impact other 
things, like how quickly RBD writes can be performed if onodes aren't in 
the bluestore onode cache, or how fast RGW bucket indexes can be 
retrieved if RocksDB is already busy.

The number of pglog entries may have an affect as well.  The shorter the 
length of the log, potentially the shorter amount of time between a 
memtable insert and a tombstone in RocksDB.  You might be able to get 
away with a smaller buffer with a shorter log.  On other other hand, a 
short log means you have a shorter window for log based recovery which 
can potentially mean a bigger recovery workload when things go wrong.

There's some other corner cases that play into all of this too, but 
that's sort of the high level view.  The idea behind the autotuner is to 
make the OSD take care of figuring out how to balance all of this and 
let the user just say "The OSD should use 4GB of RAM" and figure out the 
best way to keep it there and use the memory.

Mark

Thanks

YunFan
Mark Nelson <mark.a.nelson@xxxxxxxxx> 于2018年11月29日周四 上午5:44写道：

On 11/28/18 3:06 PM, Erwan Velu wrote:
Hi,

The ceph-nano/container project offer users to use almost "master" code
into containers.

We do consume the latest packages to test the future release of Ceph.
Obviously that's not "supported" but interesting for testing.

In this example, the container have been built with Ceph
14.0.1-1022.gc881d63 rpms on a Centos container.

A user reported an OOM in one of his testing container
(https://github.com/ceph/cn/issues/94).

Yes, the container is limited to 512MB of ram in Ceph-nano, that isn't
much but wondered how much that value was the root cause of this OOM.

I bootep the same container twice and decided to monitor the memoire
usage of this container in two contextes :

- an idle context as a reference

- a constant rados workload to see the impact of IOs on this issue

The Rados load was a simple loop of adding the ceph tree as objects,
remove them and restart the loop.

I plotted the result (you can download it at http://pubz.free.fr/leak.png)

It appear that :

- an idle cluster, reached the 504 MB memory limit in 782 minutes. That
means that's a 231 memory increase in 782 minutes; 17.7 MB/hour

- a working cluster, reached the 500 MB memory limit (when the OOM
killed ceph-mon) in 692 minutes. That means that's a 229 MB memory
increase in 692 minutes; 19.85MB/hour

That really looks like we do leak a lot of memory in this version and as
the container is very limited in memory, put it to OOM state and die.

Does any of you have seen something similar ?

My next step is to monitor every ceph-process during that time to see
what process is growing too fast.

Erwan,

Hi Erwan,

Out of curiosity did you look at the mempool stats at all?  It's pretty
likely you'll run out of memory with 512MB given our current defaults
and the memory autotuner won't be able to keep up (it will do it's best,
but can't work miracles).

To start out, the number of pglog entries is going to grow as time goes
on.  I don't know how many PGs you have, but figure that
osd_min_pg_log_entries = 3000 and if you have say 100 PGs / OSD:

3000*100 =  300,000 * <size of pg log entry> bytes of memory usage.

Now figure that each pg log entry is going to store the name of the
object, so if you have objects with short names that could be small, say:

100B * 300,000 =~ 30MB + overhead

But if you had objects with large names:

1KB * 300,000 =~ 300MB + overhead

That's not the only thing that will grow.  By default in master the
minimum autotuned size of the bluestore onode cache, buffer cache, and
rocksdb block cache are set to 128MB each.  That doesn't mean they will
all necessarily be full at any given time, but it's quite likely that
over time you'll have some data in each of those.

Another potential user of memory is the rocksdb WAL buffers (ie
memtables).  By default we currently will fill up a 256MB WAL buffer and
once it's full start compacting it into L0 while new writes will go to
up to 3 additional 256MB buffers before any new writes are stalled.  In
practice it doesn't look like we often have all 4 full and the autotuner
will adjust the caches down if filling them would exceed the OSD memory
target, but it's another factor to consider.

In practice if I set the osd_memory_target to 1GB the autotuner will
make the OSD RSS memory bounce between like 800MB-1.3GB and constantly
wrestle to keep things under control.  It would probably do a bit better
with a lower osd_min_pg_log_entry value (500? 1000?), lower minimum
cache sizes (maybe 64MB) and a lower WAL buffer size/count (say 2 64MB
buffers).  That might be enough to support 1GB OSDs, but I'm not sure it
would be enough to keep the OSD consistently under 512MB.

In any event, when I've tested OSDs with that little memory there's been
fairly dramatic performance impacts in a variety of ways depending on
what you change.  In practice the minimum amount of memory we can
reasonable work with right now is probably around 1.5-2GB, and we do a
lot better with 3-4GB+.

Mark