On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
On 09/12/2018 05:29 AM, Daniel Goldbach wrote:
> Hi all,
>
> We're reading from a Ceph Luminous pool using the librados asychronous
> I/O API. We're seeing some concerning memory usage patterns when we
> read many objects in sequence.
>
> The expected behaviour is that our memory usage stabilises at a small
> amount, since we're just fetching objects and ignoring their data.
> What we instead find is that the memory usage of our program grows
> linearly with the amount of data read for an interval of time, and
> then continues to grow at a much slower but still consistent pace.
> This memory is not freed until program termination. My guess is that
> this is an issue with Ceph's memory allocator.
>
> To demonstrate, we create 20000 objects of size 10KB, and of size
> 100KB, and of size 1MB:
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <rados/librados.h>
>
> int main() {
> rados_t cluster;
> rados_create(&cluster, "test");
> rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> rados_connect(cluster);
>
> rados_ioctx_t io;
> rados_ioctx_create(cluster, "test", &io);
>
> char data[1000000];
> memset(data, 'a', 1000000);
>
> char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
> int i;
> for (i = 0; i < 20000; i++) {
> sprintf(smallobj_name, "10kobj_%d", i);
> rados_write(io, smallobj_name, data, 10000, 0);
>
> sprintf(mediumobj_name, "100kobj_%d", i);
> rados_write(io, mediumobj_name, data, 100000, 0);
>
> sprintf(largeobj_name, "1mobj_%d", i);
> rados_write(io, largeobj_name, data, 1000000, 0);
>
> printf("wrote %s of size 10000, %s of size 100000, %s of size 1000000\n",
> smallobj_name, mediumobj_name, largeobj_name);
> }
>
> return 0;
> }
>
> $ gcc create.c -lrados -o create
> $ ./create
> wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000, 1mobj_0 of
> size 1000000
> wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000, 1mobj_1 of
> size 1000000
> [...]
> wrote 10kobj_19998 of size 10000, 100kobj_19998 of size 100000,
> 1mobj_19998 of size 1000000
> wrote 10kobj_19999 of size 10000, 100kobj_19999 of size 100000,
> 1mobj_19999 of size 1000000
>
> Now we read each of these objects with the async API, into the same
> buffer. First we read just the the 10KB objects first:
>
> #include <assert.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <rados/librados.h>
>
> void readobj(rados_ioctx_t* io, char objname[]);
>
> int main() {
> rados_t cluster;
> rados_create(&cluster, "test");
> rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> rados_connect(cluster);
>
> rados_ioctx_t io;
> rados_ioctx_create(cluster, "test", &io);
>
> char smallobj_name[16];
> int i, total_bytes_read = 0;
>
> for (i = 0; i < 20000; i++) {
> sprintf(smallobj_name, "10kobj_%d", i);
> readobj(&io, smallobj_name);
>
> total_bytes_read += 10000;
> printf("Read %s for total %d\n", smallobj_name, total_bytes_read);
> }
>
> getchar();
> return 0;
> }
>
> void readobj(rados_ioctx_t* io, char objname[]) {
> char data[1000000];
> unsigned long bytes_read;
> rados_completion_t completion;
> int retval;
>
> rados_read_op_t read_op = rados_create_read_op();
> rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval);
> retval = rados_aio_create_completion(NULL, NULL, NULL,
> &completion);
> assert(retval == 0);
>
> retval = rados_aio_read_op_operate(read_op, *io, completion,
> objname, 0);
> assert(retval == 0);
>
> rados_aio_wait_for_complete(completion);
> rados_aio_get_return_value(completion);
> }
>
> $ gcc read.c -lrados -o read_small -Wall -g && ./read_small
> Read 10kobj_0 for total 10000
> Read 10kobj_1 for total 20000
> [...]
> Read 10kobj_19998 for total 199990000
> Read 10kobj_19999 for total 200000000
>
> We read 200MB. A graph of the resident set size of the program is
> attached as mem-graph-10k.png, with seconds on x axis and KB on the y
> axis. You can see that the memory usage increases throughout, which
> itself is unexpected since that memory should be freed over time and
> we should only hold 10KB of object data in memory at a time. The rate
> of growth decreases and eventually stabilises, and by the end we've
> used 60MB of RAM.
>
> We repeat this experiment for the 100KB and 1MB objects and find that
> after all reads they use 140MB and 500MB of RAM, and memory usage
> presumably would continue to grow if there were more objects. This is
> orders of magnitude more memory than what I would expect these
> programs to use.
>
> * We do not get this behaviour with the synchronous API, and the
> memory usage remains stable at just a few MB.
> * We've found that for some reason, this doesn't happen (or doesn't
> happen as severely) if we intersperse large reads with much
> smaller reads. In this case, the memory usage seems to stabilise
> at a reasonable number.
> * Valgrind only reports a trivial amount of unreachable memory.
> * Memory usage doesn't increase in this manner if we repeatedly read
> the same object over and over again. It hovers around 20MB.
> * In other experiments we've done, with different object data and
> distributions of object sizes, we've seen memory usage grow even
> larger in proportion to the amount of data read.
>
> We maintain a long-running (order of weeks) services that read objects
> from Ceph and send them elsewhere. Over time, the memory usage of some
> of these services have grown to more than 6GB, which is unreasonable.
>
> --
> Regards,
> Dan G
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
It looks like the async example is missing calls to rados_aio_release()
to clean up the completions. I'm not sure that would account for all of
the memory growth, but that's where I would start. Past that, running
the client under valgrind massif should help with further investigation.
Casey
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com