Re: rbd kernel block driver memory usage

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 26 Jan 2023 14:48:27 +0100

On Wed, Jan 25, 2023 at 5:57 PM Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
>
> Hi,
> What sort of memory usage is expected under heavy I/O to an rbd block
> device with O_DIRECT?
>
> For example:
> - Page cache: none (O_DIRECT)
> - Socket snd/rcv buffers: yes

Hi Stefan,

There is a socket open to each OSD (object storage daemon).  A Ceph
cluster may have tens, hundreds or even thousands of OSDs (although the
latter is rare -- usually folks end up with several smaller clusters
instead a single large cluster).  Under heavy random I/O and given
a big enough RBD image, it's reasonable to assume that most if not all
OSDs would be involved and therefore their sessions would be active.

A thing to note is that, by default, OSD sessions are shared between
RBD devices.  So as long as all RBD images that are mapped on a node
belong to the same cluster, the same set of sockets would be used.

Idle OSD sockets get closed after 60 seconds of inactivity.

> - Internal rbd buffers?
>
> I am trying to understand how similar Linux rbd block devices behave
> compared to local block device memory consumption (like NVMe PCI).

RBD doesn't do any internal buffering.  Data is read from/written to
the wire directly to/from BIO pages.  The only exception to that is the
"secure" mode -- built-in encryption for Ceph on-the-wire protocol.  In
that case the data is buffered, partly because RBD obviously can't mess
with plaintext data in the BIO and partly because the Linux kernel
crypto API isn't flexible enough.

There is some memory overhead associated with each I/O (OSD request
metadata encoding, mostly).  It's surely larger than in the NVMe PCI
case.  I don't have the exact number but it should be less than 4K per
I/O in almost all cases.  This memory is coming out of private SLAB
caches and could be reclaimable had we set SLAB_RECLAIM_ACCOUNT on
them.

Thanks,

                Ilya