Re: RADOS async client memory usage explodes when reading several objects in sequence

Gregory Farnum <gfarnum@xxxxxxxxxx> · Wed, 12 Sep 2018 08:04:14 -0700

Yep, those completions are maintaining bufferlist references IIRC, so they’re definitely holding the memory buffers in place!
On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:

On 09/12/2018 05:29 AM, Daniel Goldbach wrote:

> Hi all,

>

> We're reading from a Ceph Luminous pool using the librados asychronous 

> I/O API. We're seeing some concerning memory usage patterns when we 

> read many objects in sequence.

>

> The expected behaviour is that our memory usage stabilises at a small 

> amount, since we're just fetching objects and ignoring their data. 

> What we instead find is that the memory usage of our program grows 

> linearly with the amount of data read for an interval of time, and 

> then continues to grow at a much slower but still consistent pace. 

> This memory is not freed until program termination. My guess is that 

> this is an issue with Ceph's memory allocator.

>

> To demonstrate, we create 20000 objects of size 10KB, and of size 

> 100KB, and of size 1MB:

>

>     #include <stdio.h>

>     #include <stdlib.h>

>     #include <string.h>

>     #include <rados/librados.h>

>

>     int main() {

> rados_t cluster;

> rados_create(&cluster, "test");

> rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");

> rados_connect(cluster);

>

> rados_ioctx_t io;

> rados_ioctx_create(cluster, "test", &io);

>

>         char data[1000000];

> memset(data, 'a', 1000000);

>

>         char smallobj_name[16], mediumobj_name[16], largeobj_name[16];

>         int i;

>         for (i = 0; i < 20000; i++) {

> sprintf(smallobj_name, "10kobj_%d", i);

> rados_write(io, smallobj_name, data, 10000, 0);

>

> sprintf(mediumobj_name, "100kobj_%d", i);

> rados_write(io, mediumobj_name, data, 100000, 0);

>

> sprintf(largeobj_name, "1mobj_%d", i);

> rados_write(io, largeobj_name, data, 1000000, 0);

>

> printf("wrote %s of size 10000, %s of size 100000, %s of size 1000000\n",

>       smallobj_name, mediumobj_name, largeobj_name);

>         }

>

> return 0;

>     }

>

>     $ gcc create.c -lrados -o create

>     $ ./create

>     wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000, 1mobj_0 of 

> size 1000000

>     wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000, 1mobj_1 of 

> size 1000000

>     [...]

>     wrote 10kobj_19998 of size 10000, 100kobj_19998 of size 100000, 

> 1mobj_19998 of size 1000000

>     wrote 10kobj_19999 of size 10000, 100kobj_19999 of size 100000, 

> 1mobj_19999 of size 1000000

>

> Now we read each of these objects with the async API, into the same 

> buffer. First we read just the the 10KB objects first:

>

>     #include <assert.h>

>     #include <stdio.h>

>     #include <stdlib.h>

>     #include <string.h>

>     #include <rados/librados.h>

>

>     void readobj(rados_ioctx_t* io, char objname[]);

>

>     int main() {

>         rados_t cluster;

> rados_create(&cluster, "test");

> rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");

> rados_connect(cluster);

>

> rados_ioctx_t io;

> rados_ioctx_create(cluster, "test", &io);

>

>         char smallobj_name[16];

>         int i, total_bytes_read = 0;

>

>         for (i = 0; i < 20000; i++) {

> sprintf(smallobj_name, "10kobj_%d", i);

> readobj(&io, smallobj_name);

>

> total_bytes_read += 10000;

> printf("Read %s for total %d\n", smallobj_name, total_bytes_read);

>         }

>

> getchar();

>         return 0;

>     }

>

>     void readobj(rados_ioctx_t* io, char objname[]) {

>         char data[1000000];

>         unsigned long bytes_read;

> rados_completion_t completion;

>         int retval;

>

> rados_read_op_t read_op = rados_create_read_op();

> rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval);

>         retval = rados_aio_create_completion(NULL, NULL, NULL, 

> &completion);

> assert(retval == 0);

>

>         retval = rados_aio_read_op_operate(read_op, *io, completion, 

> objname, 0);

> assert(retval == 0);

>

> rados_aio_wait_for_complete(completion);

> rados_aio_get_return_value(completion);

>     }

>

>     $ gcc read.c -lrados -o read_small -Wall -g && ./read_small

>     Read 10kobj_0 for total 10000

>     Read 10kobj_1 for total 20000

>     [...]

>     Read 10kobj_19998 for total 199990000

>     Read 10kobj_19999 for total 200000000

>

> We read 200MB. A graph of the resident set size of the program is 

> attached as mem-graph-10k.png, with seconds on x axis and KB on the y 

> axis. You can see that the memory usage increases throughout, which 

> itself is unexpected since that memory should be freed over time and 

> we should only hold 10KB of object data in memory at a time. The rate 

> of growth decreases and eventually stabilises, and by the end we've 

> used 60MB of RAM.

>

> We repeat this experiment for the 100KB and 1MB objects and find that 

> after all reads they use 140MB and 500MB of RAM, and memory usage 

> presumably would continue to grow if there were more objects. This is 

> orders of magnitude more memory than what I would expect these 

> programs to use.

>

>   * We do not get this behaviour with the synchronous API, and the

>     memory usage remains stable at just a few MB.

>   * We've found that for some reason, this doesn't happen (or doesn't

>     happen as severely) if we intersperse large reads with much

>     smaller reads. In this case, the memory usage seems to stabilise

>     at a reasonable number.

>   * Valgrind only reports a trivial amount of unreachable memory.

>   * Memory usage doesn't increase in this manner if we repeatedly read

>     the same object over and over again. It hovers around 20MB.

>   * In other experiments we've done, with different object data and

>     distributions of object sizes, we've seen memory usage grow even

>     larger in proportion to the amount of data read.

>

> We maintain a long-running (order of weeks) services that read objects 

> from Ceph and send them elsewhere. Over time, the memory usage of some 

> of these services have grown to more than 6GB, which is unreasonable.

>

> -- 

> Regards,

> Dan G

>

>

> _______________________________________________

> ceph-users mailing list

> ceph-users@xxxxxxxxxxxxxx

> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

It looks like the async example is missing calls to rados_aio_release() 

to clean up the completions. I'm not sure that would account for all of 

the memory growth, but that's where I would start. Past that, running 

the client under valgrind massif should help with further investigation.

Casey

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com