Re: rbd kernel block driver memory usage

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Thu, 26 Jan 2023 20:14:22 +0200

in case of object map which the driver loads, takes 2 bits per 4 MB of 
image size. 16 TB image requires 1 MB of memory.

I was trying to get a sense ofwhether to look deeper into the rbd driver in a OOM kill scenario.

If you are looking into OOM, maybe look into lowering queue_depth which you can specify when you map the image. Technically it belongs to the block layer queue rather than the rbd driver itself, If you write 4MB block size and your queue_depth is 1000, you need 4GB memory for inflight data for that single image, if you have many images it could add up.

/maged

On 26/01/2023 16:36, Stefan Hajnoczi wrote:
On Thu, Jan 26, 2023 at 02:48:27PM +0100, Ilya Dryomov wrote:
On Wed, Jan 25, 2023 at 5:57 PM Stefan Hajnoczi <stefanha@xxxxxxxxxx> wrote:
Hi,
What sort of memory usage is expected under heavy I/O to an rbd block
device with O_DIRECT?

For example:
- Page cache: none (O_DIRECT)
- Socket snd/rcv buffers: yes
Hi Stefan,

There is a socket open to each OSD (object storage daemon).  A Ceph
cluster may have tens, hundreds or even thousands of OSDs (although the
latter is rare -- usually folks end up with several smaller clusters
instead a single large cluster).  Under heavy random I/O and given
a big enough RBD image, it's reasonable to assume that most if not all
OSDs would be involved and therefore their sessions would be active.

A thing to note is that, by default, OSD sessions are shared between
RBD devices.  So as long as all RBD images that are mapped on a node
belong to the same cluster, the same set of sockets would be used.

Idle OSD sockets get closed after 60 seconds of inactivity.

- Internal rbd buffers?

I am trying to understand how similar Linux rbd block devices behave
compared to local block device memory consumption (like NVMe PCI).
RBD doesn't do any internal buffering.  Data is read from/written to
the wire directly to/from BIO pages.  The only exception to that is the
"secure" mode -- built-in encryption for Ceph on-the-wire protocol.  In
that case the data is buffered, partly because RBD obviously can't mess
with plaintext data in the BIO and partly because the Linux kernel
crypto API isn't flexible enough.

There is some memory overhead associated with each I/O (OSD request
metadata encoding, mostly).  It's surely larger than in the NVMe PCI
case.  I don't have the exact number but it should be less than 4K per
I/O in almost all cases.  This memory is coming out of private SLAB
caches and could be reclaimable had we set SLAB_RECLAIM_ACCOUNT on
them.
Thanks, this information is very useful. I was trying to get a sense of
whether to look deeper into the rbd driver in a OOM kill scenario.

Stefan