Hi all,
We're reading from a Ceph Luminous pool using the librados asychronous
I/O API. We're seeing some concerning memory usage patterns when we
read many objects in sequence.
The expected behaviour is that our memory usage stabilises at a small
amount, since we're just fetching objects and ignoring their data.
What we instead find is that the memory usage of our program grows
linearly with the amount of data read for an interval of time, and
then continues to grow at a much slower but still consistent pace.
This memory is not freed until program termination. My guess is that
this is an issue with Ceph's memory allocator.
To demonstrate, we create 20000 objects of size 10KB, and of size
100KB, and of size 1MB:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <rados/librados.h>
int main() {
rados_t cluster;
rados_create(&cluster, "test");
rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
rados_connect(cluster);
rados_ioctx_t io;
rados_ioctx_create(cluster, "test", &io);
char data[1000000];
memset(data, 'a', 1000000);
char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
int i;
for (i = 0; i < 20000; i++) {
sprintf(smallobj_name, "10kobj_%d", i);
rados_write(io, smallobj_name, data, 10000, 0);
sprintf(mediumobj_name, "100kobj_%d", i);
rados_write(io, mediumobj_name, data, 100000, 0);
sprintf(largeobj_name, "1mobj_%d", i);
rados_write(io, largeobj_name, data, 1000000, 0);
printf("wrote %s of size 10000, %s of size 100000, %s of size 1000000\n",
smallobj_name, mediumobj_name, largeobj_name);
}
return 0;
}
$ gcc create.c -lrados -o create
$ ./create
wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000, 1mobj_0 of
size 1000000
wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000, 1mobj_1 of
size 1000000
[...]
wrote 10kobj_19998 of size 10000, 100kobj_19998 of size 100000,
1mobj_19998 of size 1000000
wrote 10kobj_19999 of size 10000, 100kobj_19999 of size 100000,
1mobj_19999 of size 1000000
Now we read each of these objects with the async API, into the same
buffer. First we read just the the 10KB objects first:
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <rados/librados.h>
void readobj(rados_ioctx_t* io, char objname[]);
int main() {
rados_t cluster;
rados_create(&cluster, "test");
rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
rados_connect(cluster);
rados_ioctx_t io;
rados_ioctx_create(cluster, "test", &io);
char smallobj_name[16];
int i, total_bytes_read = 0;
for (i = 0; i < 20000; i++) {
sprintf(smallobj_name, "10kobj_%d", i);
readobj(&io, smallobj_name);
total_bytes_read += 10000;
printf("Read %s for total %d\n", smallobj_name, total_bytes_read);
}
getchar();
return 0;
}
void readobj(rados_ioctx_t* io, char objname[]) {
char data[1000000];
unsigned long bytes_read;
rados_completion_t completion;
int retval;
rados_read_op_t read_op = rados_create_read_op();
rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval);
retval = rados_aio_create_completion(NULL, NULL, NULL,
&completion);
assert(retval == 0);
retval = rados_aio_read_op_operate(read_op, *io, completion,
objname, 0);
assert(retval == 0);
rados_aio_wait_for_complete(completion);
rados_aio_get_return_value(completion);
}
$ gcc read.c -lrados -o read_small -Wall -g && ./read_small
Read 10kobj_0 for total 10000
Read 10kobj_1 for total 20000
[...]
Read 10kobj_19998 for total 199990000
Read 10kobj_19999 for total 200000000
We read 200MB. A graph of the resident set size of the program is
attached as mem-graph-10k.png, with seconds on x axis and KB on the y
axis. You can see that the memory usage increases throughout, which
itself is unexpected since that memory should be freed over time and
we should only hold 10KB of object data in memory at a time. The rate
of growth decreases and eventually stabilises, and by the end we've
used 60MB of RAM.
We repeat this experiment for the 100KB and 1MB objects and find that
after all reads they use 140MB and 500MB of RAM, and memory usage
presumably would continue to grow if there were more objects. This is
orders of magnitude more memory than what I would expect these
programs to use.
* We do not get this behaviour with the synchronous API, and the
memory usage remains stable at just a few MB.
* We've found that for some reason, this doesn't happen (or doesn't
happen as severely) if we intersperse large reads with much
smaller reads. In this case, the memory usage seems to stabilise
at a reasonable number.
* Valgrind only reports a trivial amount of unreachable memory.
* Memory usage doesn't increase in this manner if we repeatedly read
the same object over and over again. It hovers around 20MB.
* In other experiments we've done, with different object data and
distributions of object sizes, we've seen memory usage grow even
larger in proportion to the amount of data read.
We maintain a long-running (order of weeks) services that read objects
from Ceph and send them elsewhere. Over time, the memory usage of some
of these services have grown to more than 6GB, which is unreasonable.
--
Regards,
Dan G
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com