See http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029780.html for the original thread.
Here is a trivial reproducer not using any aio or dynamically allocated memory to store the objects read.
It simply reads 20,000 1MB large objects sequentially: when run, instead of using a roughly fixed amount of memory, the RSS increases up to 400 MB.
<code>
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <rados/librados.h>
#define SIZE 1000000
void readobj(rados_ioctx_t* io, char objname[], unsigned long size) {
static char data[SIZE];
unsigned long bytes_read;
int retval;
rados_read_op_t read_op = rados_create_read_op();
rados_read_op_read(read_op, 0, size, data, &bytes_read, &retval);
retval = rados_read_op_operate(read_op, *io, objname, 0);
assert(retval == 0);
rados_release_read_op(read_op);
}
int main() {
rados_t cluster;
rados_create(&cluster, "test");
rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
rados_connect(cluster);
rados_ioctx_t io;
rados_ioctx_create(cluster, "test", &io);
char smallobj_name[16];
int i;
long long total_bytes_read = 0;
for (i = 0; i < 20000; i++) {
sprintf(smallobj_name, "1mobj_%d", i);
readobj(&io, smallobj_name, SIZE);
readobj(&io, smallobj_name, 1);
total_bytes_read += SIZE;
printf("Read %s for total %lld\n", smallobj_name, total_bytes_read);
}
return 0;
}
</code>
Running it under valgrind with --show-reachable=yes shows where the problem is:
==10253== 423,755,776 bytes in 424 blocks are still reachable in loss record 882 of 882
==10253== at 0x4C2FEA6: memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10253== by 0x4C2FFB1: posix_memalign (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==10253== by 0x57A3C06: ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int) (in /usr/lib/ceph/libceph-common.so.0)
==10253== by 0x5984B49: AsyncConnection::process() (in /usr/lib/ceph/libceph-common.so.0)
==10253== by 0x5994C97: EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (in /usr/lib/ceph/libceph-common.so.0)
==10253== by 0x5999817: ??? (in /usr/lib/ceph/libceph-common.so.0)
==10253== by 0xE960C7F: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
==10253== by 0xE6926B9: start_thread (pthread_create.c:333)
==10253== by 0x52A041C: clone (clone.S:109)
We can see that the memory is allocated in AsyncConnection::process, precisely here: https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L614
From my understanding - I have never read ceph code before, so my understanding might be flawed - this code is allocating a buffer to store the data contained in the message received from a ceph node and adding it to a buffers list `data_buf`.
Note that this list of buffers is per AsyncConnection, that's important.
Unfortunately, looking through the file shows that `data_buf` only gets cleared here: https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L477
when it transitions to the STATE_OPEN_MESSAGE_HEADER state, IIUC only occurs when it starts receiving a new CEPH_MSGR_TAG_MSG message.
Which means that basically the following happens - the below is for a single AsyncConnection:
repeat:
0. wait for message
1. parse message header
2. realize it's an CEPH_MSGR_TAG_MSG message (not a heartbeat, etc): transition to STATE_OPEN_MESSAGE_HEADER
3. clear all buffers (https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L477)
4. start parsing message, allocate memory buffer (https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L614)
...
n. done processing message
Which means that the memory allocated when processing a message - at step 4 - is only released when a subsequent CEPH_MSGR_TAG_MSG message is received.
For a single connection that's not too bad - memory is just freed later than it could, but with a large OSD pool, it adds up.
Our cluster is made of over 40 ceph OSDs, which results in around 400 connections. Which means that if reading for example 1MB objects, this delayed freeing can use up to 400MB (1MB per connection), which is exactly what we're seeing.
It would also explains:
- why reading a small subset of objects over and over again doesn't result in a memory increase: because they are distributed over few ceph OSDs, hence use few connections
- the shape of the memory consumption over time: if it were a leak (even with reachable objects), it would grow linearly, whereas we expect it to grow slower and slower as we've connected all ceph OSDs in the pool
- why interspersing small reads helps: if one does a large read follows by a small read over the same connection, the large read buffer will be cleared before the small read buffer is allocated
The followingg is enough to keep the memory usage under control:
<diff>
--- a.c 2018-09-14 12:56:13.770350595 +0100
+++ b.c 2018-09-14 12:55:32.921856548 +0100
@@ -34,7 +34,6 @@
for (i = 0; i < 20000; i++) {
sprintf(smallobj_name, "1mobj_%d", i);
readobj(&io, smallobj_name, SIZE);
+ readobj(&io, smallobj_name, 1);
total_bytes_read += SIZE;
printf("Read %s for total %lld\n", smallobj_name, total_bytes_read);
</diff>
Reading just 1 byte of the same object after having read the object will make the AsyncConnection clear the buffer allocated for processing the large read result, and allocate a tiny buffer for the 1 byte read result.
TLDR: It seems that the buffers allocated while processing a message in AsyncConnection are not cleared when the message has been processed, but only when a subsequent message is received, which yields to a significant memory usage if there are many connections in use (objects stored on many different OSDs).
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com