[Adding ceph-devel] On Fri, Sep 14, 2018 at 5:22 AM, Charles-François Natali <cf.natali@xxxxxxxxx> wrote: > See > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-September/029780.html > for the original thread. > > Here is a trivial reproducer not using any aio or dynamically allocated > memory to store the objects read. > It simply reads 20,000 1MB large objects sequentially: when run, instead of > using a roughly fixed amount of memory, the RSS increases up to 400 MB. > > <code> > #include <assert.h> > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <rados/librados.h> > > #define SIZE 1000000 > > void readobj(rados_ioctx_t* io, char objname[], unsigned long size) { > static char data[SIZE]; > unsigned long bytes_read; > int retval; > > rados_read_op_t read_op = rados_create_read_op(); > rados_read_op_read(read_op, 0, size, data, &bytes_read, &retval); > retval = rados_read_op_operate(read_op, *io, objname, 0); > assert(retval == 0); > rados_release_read_op(read_op); > } > > int main() { > rados_t cluster; > rados_create(&cluster, "test"); > rados_conf_read_file(cluster, "/etc/ceph/ceph.conf"); > rados_connect(cluster); > > rados_ioctx_t io; > rados_ioctx_create(cluster, "test", &io); > > char smallobj_name[16]; > int i; > long long total_bytes_read = 0; > > for (i = 0; i < 20000; i++) { > sprintf(smallobj_name, "1mobj_%d", i); > readobj(&io, smallobj_name, SIZE); > readobj(&io, smallobj_name, 1); > > total_bytes_read += SIZE; > printf("Read %s for total %lld\n", smallobj_name, total_bytes_read); > } > > return 0; > } > </code> > > > Running it under valgrind with --show-reachable=yes shows where the problem > is: > > ==10253== 423,755,776 bytes in 424 blocks are still reachable in loss record > 882 of 882 > ==10253== at 0x4C2FEA6: memalign (in > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10253== by 0x4C2FFB1: posix_memalign (in > /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so) > ==10253== by 0x57A3C06: ceph::buffer::create_aligned_in_mempool(unsigned > int, unsigned int, int) (in /usr/lib/ceph/libceph-common.so.0) > ==10253== by 0x5984B49: AsyncConnection::process() (in > /usr/lib/ceph/libceph-common.so.0) > ==10253== by 0x5994C97: EventCenter::process_events(int, > std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (in > /usr/lib/ceph/libceph-common.so.0) > ==10253== by 0x5999817: ??? (in /usr/lib/ceph/libceph-common.so.0) > ==10253== by 0xE960C7F: ??? (in > /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21) > ==10253== by 0xE6926B9: start_thread (pthread_create.c:333) > ==10253== by 0x52A041C: clone (clone.S:109) > > We can see that the memory is allocated in AsyncConnection::process, > precisely here: > https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L614 > > From my understanding - I have never read ceph code before, so my > understanding might be flawed - this code is allocating a buffer to store > the data contained in the message received from a ceph node and adding it to > a buffers list `data_buf`. > Note that this list of buffers is per AsyncConnection, that's important. > > Unfortunately, looking through the file shows that `data_buf` only gets > cleared here: > https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L477 > when it transitions to the STATE_OPEN_MESSAGE_HEADER state, IIUC only occurs > when it starts receiving a new CEPH_MSGR_TAG_MSG message. > > Which means that basically the following happens - the below is for a single > AsyncConnection: > > repeat: > 0. wait for message > 1. parse message header > 2. realize it's an CEPH_MSGR_TAG_MSG message (not a heartbeat, etc): > transition to STATE_OPEN_MESSAGE_HEADER > 3. clear all buffers > (https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L477) > 4. start parsing message, allocate memory buffer > (https://github.com/ceph/ceph/blob/master/src/msg/async/AsyncConnection.cc#L614) > ... > n. done processing message > > Which means that the memory allocated when processing a message - at step 4 > - is only released when a subsequent CEPH_MSGR_TAG_MSG message is received. > For a single connection that's not too bad - memory is just freed later than > it could, but with a large OSD pool, it adds up. > Our cluster is made of over 40 ceph OSDs, which results in around 400 > connections. Which means that if reading for example 1MB objects, this > delayed freeing can use up to 400MB (1MB per connection), which is exactly > what we're seeing. > > It would also explains: > - why reading a small subset of objects over and over again doesn't result > in a memory increase: because they are distributed over few ceph OSDs, hence > use few connections > - the shape of the memory consumption over time: if it were a leak (even > with reachable objects), it would grow linearly, whereas we expect it to > grow slower and slower as we've connected all ceph OSDs in the pool > - why interspersing small reads helps: if one does a large read follows by a > small read over the same connection, the large read buffer will be cleared > before the small read buffer is allocated > > The followingg is enough to keep the memory usage under control: > <diff> > --- a.c 2018-09-14 12:56:13.770350595 +0100 > +++ b.c 2018-09-14 12:55:32.921856548 +0100 > @@ -34,7 +34,6 @@ > for (i = 0; i < 20000; i++) { > sprintf(smallobj_name, "1mobj_%d", i); > readobj(&io, smallobj_name, SIZE); > + readobj(&io, smallobj_name, 1); > > total_bytes_read += SIZE; > printf("Read %s for total %lld\n", smallobj_name, > total_bytes_read); > </diff> > > Reading just 1 byte of the same object after having read the object will > make the AsyncConnection clear the buffer allocated for processing the large > read result, and allocate a tiny buffer for the 1 byte read result. > > > TLDR: It seems that the buffers allocated while processing a message in > AsyncConnection are not cleared when the message has been processed, but > only when a subsequent message is received, which yields to a significant > memory usage if there are many connections in use (objects stored on many > different OSDs). Nice sleuthing! This does look correct to me. I've created a tracker issue (http://tracker.ceph.com/issues/35987) and pull request (https://github.com/ceph/ceph/pull/24111). If you wanted to try testing that branch (or cherry-picking the patch, if you're building it yourself) with your librados client to see if that resolves the issue, it'd be helpful! The packages should be available shortly from the dev branch repository (see http://docs.ceph.com/docs/master/install/get-packages/#add-ceph-development). -Greg > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >