Re: RADOS async client memory usage explodes when reading several objects in sequence

Casey Bodley <cbodley@xxxxxxxxxx> · Wed, 12 Sep 2018 10:04:13 -0400

On 09/12/2018 05:29 AM, Daniel Goldbach wrote:
Hi all,

We're reading from a Ceph Luminous pool using the librados asychronous 
I/O API. We're seeing some concerning memory usage patterns when we 
read many objects in sequence.

The expected behaviour is that our memory usage stabilises at a small 
amount, since we're just fetching objects and ignoring their data. 
What we instead find is that the memory usage of our program grows 
linearly with the amount of data read for an interval of time, and 
then continues to grow at a much slower but still consistent pace. 
This memory is not freed until program termination. My guess is that 
this is an issue with Ceph's memory allocator.

To demonstrate, we create 20000 objects of size 10KB, and of size 
100KB, and of size 1MB:

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <rados/librados.h>

    int main() {
rados_t cluster;
rados_create(&cluster, "test");
rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
rados_connect(cluster);

rados_ioctx_t io;
rados_ioctx_create(cluster, "test", &io);

        char data[1000000];
memset(data, 'a', 1000000);

        char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
        int i;
        for (i = 0; i < 20000; i++) {
sprintf(smallobj_name, "10kobj_%d", i);
rados_write(io, smallobj_name, data, 10000, 0);

sprintf(mediumobj_name, "100kobj_%d", i);
rados_write(io, mediumobj_name, data, 100000, 0);

sprintf(largeobj_name, "1mobj_%d", i);
rados_write(io, largeobj_name, data, 1000000, 0);

printf("wrote %s of size 10000, %s of size 100000, %s of size 1000000\n",
      smallobj_name, mediumobj_name, largeobj_name);
        }

return 0;
    }

    $ gcc create.c -lrados -o create
    $ ./create
    wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000, 1mobj_0 of 
size 1000000
    wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000, 1mobj_1 of 
size 1000000
    [...]
    wrote 10kobj_19998 of size 10000, 100kobj_19998 of size 100000, 
1mobj_19998 of size 1000000
    wrote 10kobj_19999 of size 10000, 100kobj_19999 of size 100000, 
1mobj_19999 of size 1000000

Now we read each of these objects with the async API, into the same 
buffer. First we read just the the 10KB objects first:

    #include <assert.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <rados/librados.h>

    void readobj(rados_ioctx_t* io, char objname[]);

    int main() {
        rados_t cluster;
rados_create(&cluster, "test");
rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
rados_connect(cluster);

rados_ioctx_t io;
rados_ioctx_create(cluster, "test", &io);

        char smallobj_name[16];
        int i, total_bytes_read = 0;

        for (i = 0; i < 20000; i++) {
sprintf(smallobj_name, "10kobj_%d", i);
readobj(&io, smallobj_name);

total_bytes_read += 10000;
printf("Read %s for total %d\n", smallobj_name, total_bytes_read);
        }

getchar();
        return 0;
    }

    void readobj(rados_ioctx_t* io, char objname[]) {
        char data[1000000];
        unsigned long bytes_read;
rados_completion_t completion;
        int retval;

rados_read_op_t read_op = rados_create_read_op();
rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval);
        retval = rados_aio_create_completion(NULL, NULL, NULL, 
&completion);
assert(retval == 0);

        retval = rados_aio_read_op_operate(read_op, *io, completion, 
objname, 0);
assert(retval == 0);

rados_aio_wait_for_complete(completion);
rados_aio_get_return_value(completion);
    }

    $ gcc read.c -lrados -o read_small -Wall -g && ./read_small
    Read 10kobj_0 for total 10000
    Read 10kobj_1 for total 20000
    [...]
    Read 10kobj_19998 for total 199990000
    Read 10kobj_19999 for total 200000000

We read 200MB. A graph of the resident set size of the program is 
attached as mem-graph-10k.png, with seconds on x axis and KB on the y 
axis. You can see that the memory usage increases throughout, which 
itself is unexpected since that memory should be freed over time and 
we should only hold 10KB of object data in memory at a time. The rate 
of growth decreases and eventually stabilises, and by the end we've 
used 60MB of RAM.

We repeat this experiment for the 100KB and 1MB objects and find that 
after all reads they use 140MB and 500MB of RAM, and memory usage 
presumably would continue to grow if there were more objects. This is 
orders of magnitude more memory than what I would expect these 
programs to use.

  * We do not get this behaviour with the synchronous API, and the
    memory usage remains stable at just a few MB.
  * We've found that for some reason, this doesn't happen (or doesn't
    happen as severely) if we intersperse large reads with much
    smaller reads. In this case, the memory usage seems to stabilise
    at a reasonable number.
  * Valgrind only reports a trivial amount of unreachable memory.
  * Memory usage doesn't increase in this manner if we repeatedly read
    the same object over and over again. It hovers around 20MB.
  * In other experiments we've done, with different object data and
    distributions of object sizes, we've seen memory usage grow even
    larger in proportion to the amount of data read.

We maintain a long-running (order of weeks) services that read objects 
from Ceph and send them elsewhere. Over time, the memory usage of some 
of these services have grown to more than 6GB, which is unreasonable.

--
Regards,
Dan G

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

It looks like the async example is missing calls to rados_aio_release() 
to clean up the completions. I'm not sure that would account for all of 
the memory growth, but that's where I would start. Past that, running 
the client under valgrind massif should help with further investigation.

Casey
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com