Re: RADOS async client memory usage explodes when reading several objects in sequence

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



One of my colleagues believes he's tracked down the source of the missing deallocations in the librados code. I'll let him reply himself with his findings. For now, we've found a workaround: when an object is reread, the memory allocated for it seems to be freed and a new block is allocated for the size of the new read. So you can effectively "free" the memory allocated for an object by rereading one byte from it.

In my example, if after each read operation you execute a new read operation that simply reads one byte from the same object, the memory usage stabilises at a handful of MB, as expected.

On Thu, Sep 13, 2018 at 6:47 PM Daniel Goldbach <dan.goldbach@xxxxxxxxx> wrote:
Yes I understand that. If you look at the example, the data buffer is stack allocated and hence its memory is freed when the stack frame for readobj is destroyed. Additionally, no leak occurs if I comment out the rados_read_op_operate line. This is a problem with librados, not with my example.



On Thu, Sep 13, 2018 at 6:22 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
On Thu, Sep 13, 2018 at 6:35 AM Daniel Goldbach <dan.goldbach@xxxxxxxxx> wrote:
I'm sure I'm not forgetting to free any buffers. I'm not even allocating any heap memory in the example above.

On further investigation, the same issue *does* happen with the synchronous read operation API. I erroneously said that the issue doesn't happen with the synchronous API when what I meant to say was that the issue doesn't happen when calling the rados_read() function directly to read from an object.

Using the synchronous read operation API, the following four commands are sufficient to trigger the 'leak':

    rados_read_op_t read_op = rados_create_read_op();
    rados_read_op_read(read_op, 0, SIZE, data, &bytes_read, &retval);
    retval = rados_read_op_operate(read_op, *io, objname, 0);
    assert(retval == 0);
    rados_release_read_op(read_op);

Curiously, the presence of the final line "rados_release_read_op(read_op);" does not seem to affect the memory usage of the program at all! It can be removed and the program's memory profile looks the same. It's as if the release mechanism is totally broken.

I agree that the source is most likely mismanagement of C_bl_to_buf. rados_release_read_op should call Objecter.h:~ObjectOperation() on the read op, which in turn should delete the the C_bl_to_buf whose pointer was pushed onto the ObjectOperation's out_handler vector in ObjectOperation::read(..). Something around that bit of code must be broken.

Okay, but you get that *you* are responsible for cleaning up the "data" buffer, right? librados isn't going to delete that for you automatically as it has no way of knowing when it should do so.
 



On Wed, Sep 12, 2018 at 5:35 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
That code bit is just "we have an incoming message with data", which is what we'd expect, but means it's not very helpful for tracking down the source of any leaks.

My guess is still very much that somehow there are deallocations missing here. Internally, the synchronous API is wrapping the async one so it'd be subject to the same bugs. (Although perhaps not terrible malloc behavior, but if valgrind reports the same memory usage as the RSS being reported then I don't think it can be malloc behavior.)

Are you actually freeing the buffers you provide when you're done with them? I guess there could be something with the "C_bl_to_buf" structure getting managed wrong as well since that looks to be unique to this code path, but it wouldn't depend on the size of the objects since it's just 4 pointers/ints.
-Greg

On Wed, Sep 12, 2018 at 8:43 AM Daniel Goldbach <dan.goldbach@xxxxxxxxx> wrote:
The issue continues even when I do rados_aio_release(completion) at the end of the readobj(..) definition in the example. Also, in our production code we call rados_aio_release for each completion and we still see the issue there. The release command doesn't guarantee instant release, so could it be that the release operations are getting queued up but never executed?

Valgrind massif shows that the relevant allocations are all happening in the bit of code in the following stack trace:

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 62    166,854,775       82,129,696       81,808,615       321,081            0
 63    168,025,321       83,155,872       82,834,072       321,800            0
99.61% (82,834,072B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->93.75% (77,955,072B) 0x579AC05: ceph::buffer::create_aligned_in_mempool(unsigned int, unsigned int, int) (in /usr/lib/ceph/libceph-common.so.0)
| ->93.75% (77,955,072B) 0x597BB48: AsyncConnection::process() (in /usr/lib/ceph/libceph-common.so.0)
| | ->93.75% (77,955,072B) 0x598BC96: EventCenter::process_events(int, std::chrono::duration<unsigned long, std::ratio<1l, 1000000000l> >*) (in /usr/lib/ceph/libceph-common.so.0)
| |   ->93.75% (77,955,072B) 0x5990816: ??? (in /usr/lib/ceph/libceph-common.so.0)
| |     ->93.75% (77,955,072B) 0xE957C7E: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
| |       ->93.75% (77,955,072B) 0xE6896B8: start_thread (pthread_create.c:333)
| |         ->93.75% (77,955,072B) 0x529741B: clone (clone.S:109)
| |
| ->00.00% (0B) in 1+ places, all below ms_print's threshold (01.00%)
|
->05.87% (4,879,000B) in 387 places, all below massif's threshold (1.00%)



On Wed, Sep 12, 2018 at 4:05 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
Yep, those completions are maintaining bufferlist references IIRC, so they’re definitely holding the memory buffers in place!
On Wed, Sep 12, 2018 at 7:04 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:


On 09/12/2018 05:29 AM, Daniel Goldbach wrote:
> Hi all,
>
> We're reading from a Ceph Luminous pool using the librados asychronous
> I/O API. We're seeing some concerning memory usage patterns when we
> read many objects in sequence.
>
> The expected behaviour is that our memory usage stabilises at a small
> amount, since we're just fetching objects and ignoring their data.
> What we instead find is that the memory usage of our program grows
> linearly with the amount of data read for an interval of time, and
> then continues to grow at a much slower but still consistent pace.
> This memory is not freed until program termination. My guess is that
> this is an issue with Ceph's memory allocator.
>
> To demonstrate, we create 20000 objects of size 10KB, and of size
> 100KB, and of size 1MB:
>
>     #include <stdio.h>
>     #include <stdlib.h>
>     #include <string.h>
>     #include <rados/librados.h>
>
>     int main() {
> rados_t cluster;
> rados_create(&cluster, "test");
> rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> rados_connect(cluster);
>
> rados_ioctx_t io;
> rados_ioctx_create(cluster, "test", &io);
>
>         char data[1000000];
> memset(data, 'a', 1000000);
>
>         char smallobj_name[16], mediumobj_name[16], largeobj_name[16];
>         int i;
>         for (i = 0; i < 20000; i++) {
> sprintf(smallobj_name, "10kobj_%d", i);
> rados_write(io, smallobj_name, data, 10000, 0);
>
> sprintf(mediumobj_name, "100kobj_%d", i);
> rados_write(io, mediumobj_name, data, 100000, 0);
>
> sprintf(largeobj_name, "1mobj_%d", i);
> rados_write(io, largeobj_name, data, 1000000, 0);
>
> printf("wrote %s of size 10000, %s of size 100000, %s of size 1000000\n",
>       smallobj_name, mediumobj_name, largeobj_name);
>         }
>
> return 0;
>     }
>
>     $ gcc create.c -lrados -o create
>     $ ./create
>     wrote 10kobj_0 of size 10000, 100kobj_0 of size 100000, 1mobj_0 of
> size 1000000
>     wrote 10kobj_1 of size 10000, 100kobj_1 of size 100000, 1mobj_1 of
> size 1000000
>     [...]
>     wrote 10kobj_19998 of size 10000, 100kobj_19998 of size 100000,
> 1mobj_19998 of size 1000000
>     wrote 10kobj_19999 of size 10000, 100kobj_19999 of size 100000,
> 1mobj_19999 of size 1000000
>
> Now we read each of these objects with the async API, into the same
> buffer. First we read just the the 10KB objects first:
>
>     #include <assert.h>
>     #include <stdio.h>
>     #include <stdlib.h>
>     #include <string.h>
>     #include <rados/librados.h>
>
>     void readobj(rados_ioctx_t* io, char objname[]);
>
>     int main() {
>         rados_t cluster;
> rados_create(&cluster, "test");
> rados_conf_read_file(cluster, "/etc/ceph/ceph.conf");
> rados_connect(cluster);
>
> rados_ioctx_t io;
> rados_ioctx_create(cluster, "test", &io);
>
>         char smallobj_name[16];
>         int i, total_bytes_read = 0;
>
>         for (i = 0; i < 20000; i++) {
> sprintf(smallobj_name, "10kobj_%d", i);
> readobj(&io, smallobj_name);
>
> total_bytes_read += 10000;
> printf("Read %s for total %d\n", smallobj_name, total_bytes_read);
>         }
>
> getchar();
>         return 0;
>     }
>
>     void readobj(rados_ioctx_t* io, char objname[]) {
>         char data[1000000];
>         unsigned long bytes_read;
> rados_completion_t completion;
>         int retval;
>
> rados_read_op_t read_op = rados_create_read_op();
> rados_read_op_read(read_op, 0, 10000, data, &bytes_read, &retval);
>         retval = rados_aio_create_completion(NULL, NULL, NULL,
> &completion);
> assert(retval == 0);
>
>         retval = rados_aio_read_op_operate(read_op, *io, completion,
> objname, 0);
> assert(retval == 0);
>
> rados_aio_wait_for_complete(completion);
> rados_aio_get_return_value(completion);
>     }
>
>     $ gcc read.c -lrados -o read_small -Wall -g && ./read_small
>     Read 10kobj_0 for total 10000
>     Read 10kobj_1 for total 20000
>     [...]
>     Read 10kobj_19998 for total 199990000
>     Read 10kobj_19999 for total 200000000
>
> We read 200MB. A graph of the resident set size of the program is
> attached as mem-graph-10k.png, with seconds on x axis and KB on the y
> axis. You can see that the memory usage increases throughout, which
> itself is unexpected since that memory should be freed over time and
> we should only hold 10KB of object data in memory at a time. The rate
> of growth decreases and eventually stabilises, and by the end we've
> used 60MB of RAM.
>
> We repeat this experiment for the 100KB and 1MB objects and find that
> after all reads they use 140MB and 500MB of RAM, and memory usage
> presumably would continue to grow if there were more objects. This is
> orders of magnitude more memory than what I would expect these
> programs to use.
>
>   * We do not get this behaviour with the synchronous API, and the
>     memory usage remains stable at just a few MB.
>   * We've found that for some reason, this doesn't happen (or doesn't
>     happen as severely) if we intersperse large reads with much
>     smaller reads. In this case, the memory usage seems to stabilise
>     at a reasonable number.
>   * Valgrind only reports a trivial amount of unreachable memory.
>   * Memory usage doesn't increase in this manner if we repeatedly read
>     the same object over and over again. It hovers around 20MB.
>   * In other experiments we've done, with different object data and
>     distributions of object sizes, we've seen memory usage grow even
>     larger in proportion to the amount of data read.
>
> We maintain a long-running (order of weeks) services that read objects
> from Ceph and send them elsewhere. Over time, the memory usage of some
> of these services have grown to more than 6GB, which is unreasonable.
>
> --
> Regards,
> Dan G
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

It looks like the async example is missing calls to rados_aio_release()
to clean up the completions. I'm not sure that would account for all of
the memory growth, but that's where I would start. Past that, running
the client under valgrind massif should help with further investigation.

Casey
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Regards,
Dan Goldbach


--
Regards,
Dan Goldbach


--
Regards,
Dan Goldbach


--
Regards,
Dan Goldbach
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux