Re: Possible memory leak in Ceph 14.0.1-1022.gc881d63

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/29/18 1:33 AM, lin yunfan wrote:
Hi Mark,

Thank you for your detailed explanation about how osd uses the memory.

There are some more detail about osd memory usage I can't fully understand.

1.Would the memory consumption increase with bigger disk when the pg
number remains the same? Do more objects use more memory?

There was an assertion a while back that OSD memory requirements sort of grew linearly with the number of objects, but I never really understood the reasoning for it to be honest. There's certainly some level of inefficiency when dealing with lots of objects, but it's certainly not linear. It was true that previously during recovery the length of the pglog could grow in an unbounded way, but afaik Neha recently reworked it:

https://github.com/ceph/ceph/pull/23098

I suppose some of the topology information might become a larger portion of the memory consumption if you have a very large cluster though.


2.How does osd use much more memory during recovery?
I understand when pgs are clean osd uses osd_pg_min_log_entries and
when pgs are not clean osd uses osd_pg_max_log_entries, Is this the
main reason why osd uses more memory during recovery?

I don't think it will typically hang out around the osd_min_pg_log_entries value during regular operation. Probably best to get Josh or Neha's feedback though. See the note above about the max length (it used to be unbounded during recovery but isn't anymore). I suspect that there's probably some additional overhead during recovery in other places too, and I wouldn't be surprised if it throws tcmalloc for a bit of a loop. TCMalloc works hard to combat memory fragmentation but I've noticed that when we start doing a bunch of different looking memory allocations it can make things less efficient for a while. The autotuner can really make tcmalloc unhappy if it's super aggressive at shifting memory around quickly.



3.Does a EC pool use more memory than a replicated pool?

That's sort of a complicated question. Probably, but there's a lot of factors that could affect it. Probably best to just test. One of the reasons I wrote the autotuner is so that we could try to smooth this sort of thing out and just let users target how much memory they want the OSD to consume.


4.How would each bluestore cache setting effect the performance?I
understand this is a vague question but any guide line would be
appreciate.

It very much is complicated. :)

In a nutshell: Bluestore has two caches it keeps itself, the onode cache and the buffer cache. RocksDB also (can) maintain multiple caches, but for now we'll just focus on the block cache. There's alos the WAL buffers (memtables), and the pg log.

Onode cache is for metadata and buffer cache is for data. The buffer cache is what gives you fast reads from memory when you've recently read (or possibly written if you enable bluestore_default_buffered_write) data. The onode cache is far more interesting imho as onode data is checked several times prior to issuing a write. If bluestore can read this in it's raw unencoded form straight from memory it is much faster. This is especially true when you've got really fast storage devices like NVMe drives.

Next, there's the rocksdb block cache. It stores a variety of information, including a varint encoded form of the onodes, omap data, rocksdb indexes, and rocksdb bloomfilters. We also use it for some bluestore disk layout information, but that's relatively small comparatively.

So onodes and omap data is pretty straightforward. If an onode isn't in the bluestore onode cache it's possible it could be here depending on the size of each cache. Frankly there doesn't seem to be much advantage to having the hierarchical caching scheme here. It doesn't appear we typically save a ton of space with the varint encoding and the encode/decode step slows things down enough that I'm not sure it's worth it. On the OMAP side, this is the only cache we currently have, so it's important to make sure that if there's a lot of OMAP data being used that this cache is large enough to provide some benefit.

Ok, so what happens if rocksdb data isn't in the block cache? RocksDB is hirearchical in nature and data can be stored at any point (and sometimes multiple points) in the tree. Without bloomfilters, you potentially could have to perform reads at every single level to look for it. Bloomfilters give you a very high probability of being able to avoid reads when data isn't in a given level. IE It's really important to make sure that bloomfilters stay in cache. That's the basis for why the autotuner design implements the concept of priorities for every cache. We really want to keep bloomfilters in rocksdb cache, and generally we want to keep onodes in the onode cache, but other things are more fluid based on the workload.

Next we have the rocksdb WAL buffers, and this sort of ties directly into the pglog. Whenever you write into rocksdb it first appends the write to an on-disk log and an in-memory buffer (memtable). Eventually it will compact this into L0 and later may compact it into higher levels. Typical data like an onode will follow this pattern and all is good. The PG log however writes short lived pg log entries. The write takes place, and then typically a delete will happen a short while later to tell rocksdb to get rid of the previous write. That is done by writing something called a tombstone. If the initial write, and the tombstone happen to land in the same WAL buffer, rocksdb can cancel them out when it performs the compaction and neither of them get written to L0 (and tie up resources in bloomfilter/index creation, additional compactions, etc). That's an argument for having very large WAL buffers, and indeed we see that large WAL buffers does improve our write performance and write-amplification in practice. The trade-off is that large WAL buffers means longer compactions (but less compacted data overall!) and extra CPU overhead iterating through the memtable when adding data to it. So the gist of it is that the current pg log design has a really big impact on how RocksDB is being tuned and affects RocksDB performance in a variety of ways. That can really impact other things, like how quickly RBD writes can be performed if onodes aren't in the bluestore onode cache, or how fast RGW bucket indexes can be retrieved if RocksDB is already busy.

The number of pglog entries may have an affect as well. The shorter the length of the log, potentially the shorter amount of time between a memtable insert and a tombstone in RocksDB. You might be able to get away with a smaller buffer with a shorter log. On other other hand, a short log means you have a shorter window for log based recovery which can potentially mean a bigger recovery workload when things go wrong.

There's some other corner cases that play into all of this too, but that's sort of the high level view. The idea behind the autotuner is to make the OSD take care of figuring out how to balance all of this and let the user just say "The OSD should use 4GB of RAM" and figure out the best way to keep it there and use the memory.

Mark


Thanks

YunFan
Mark Nelson <mark.a.nelson@xxxxxxxxx> 于2018年11月29日周四 上午5:44写道:

On 11/28/18 3:06 PM, Erwan Velu wrote:
Hi,


The ceph-nano/container project offer users to use almost "master" code
into containers.

We do consume the latest packages to test the future release of Ceph.
Obviously that's not "supported" but interesting for testing.

In this example, the container have been built with Ceph
14.0.1-1022.gc881d63 rpms on a Centos container.

A user reported an OOM in one of his testing container
(https://github.com/ceph/cn/issues/94).

Yes, the container is limited to 512MB of ram in Ceph-nano, that isn't
much but wondered how much that value was the root cause of this OOM.


I bootep the same container twice and decided to monitor the memoire
usage of this container in two contextes :

- an idle context as a reference

- a constant rados workload to see the impact of IOs on this issue


The Rados load was a simple loop of adding the ceph tree as objects,
remove them and restart the loop.

I plotted the result (you can download it at http://pubz.free.fr/leak.png)


It appear that :

- an idle cluster, reached the 504 MB memory limit in 782 minutes. That
means that's a 231 memory increase in 782 minutes; 17.7 MB/hour

- a working cluster, reached the 500 MB memory limit (when the OOM
killed ceph-mon) in 692 minutes. That means that's a 229 MB memory
increase in 692 minutes; 19.85MB/hour


That really looks like we do leak a lot of memory in this version and as
the container is very limited in memory, put it to OOM state and die.

Does any of you have seen something similar ?


My next step is to monitor every ceph-process during that time to see
what process is growing too fast.


Erwan,

Hi Erwan,


Out of curiosity did you look at the mempool stats at all?  It's pretty
likely you'll run out of memory with 512MB given our current defaults
and the memory autotuner won't be able to keep up (it will do it's best,
but can't work miracles).

To start out, the number of pglog entries is going to grow as time goes
on.  I don't know how many PGs you have, but figure that
osd_min_pg_log_entries = 3000 and if you have say 100 PGs / OSD:

3000*100 =  300,000 * <size of pg log entry> bytes of memory usage.

Now figure that each pg log entry is going to store the name of the
object, so if you have objects with short names that could be small, say:

100B * 300,000 =~ 30MB + overhead

But if you had objects with large names:

1KB * 300,000 =~ 300MB + overhead

That's not the only thing that will grow.  By default in master the
minimum autotuned size of the bluestore onode cache, buffer cache, and
rocksdb block cache are set to 128MB each.  That doesn't mean they will
all necessarily be full at any given time, but it's quite likely that
over time you'll have some data in each of those.

Another potential user of memory is the rocksdb WAL buffers (ie
memtables).  By default we currently will fill up a 256MB WAL buffer and
once it's full start compacting it into L0 while new writes will go to
up to 3 additional 256MB buffers before any new writes are stalled.  In
practice it doesn't look like we often have all 4 full and the autotuner
will adjust the caches down if filling them would exceed the OSD memory
target, but it's another factor to consider.

In practice if I set the osd_memory_target to 1GB the autotuner will
make the OSD RSS memory bounce between like 800MB-1.3GB and constantly
wrestle to keep things under control.  It would probably do a bit better
with a lower osd_min_pg_log_entry value (500? 1000?), lower minimum
cache sizes (maybe 64MB) and a lower WAL buffer size/count (say 2 64MB
buffers).  That might be enough to support 1GB OSDs, but I'm not sure it
would be enough to keep the OSD consistently under 512MB.

In any event, when I've tested OSDs with that little memory there's been
fairly dramatic performance impacts in a variety of ways depending on
what you change.  In practice the minimum amount of memory we can
reasonable work with right now is probably around 1.5-2GB, and we do a
lot better with 3-4GB+.

Mark








[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux