Re: CephFS Metadata Pool bandwidth usage

Andras Sali <sali.andrew@xxxxxxxxx> · Thu, 9 Dec 2021 19:48:44 +0100

Hi Greg,

Much appreciated for the reply, the image is also available at:
https://tracker.ceph.com/attachments/download/5808/Bytes_per_op.png

How the graph is generated: we back the cephfs metadata pool with Azure
ultrassd disks. Azure reports for the disks each minute the average
read/write iops (operations per sec) and average read/write throughput (in
bytes per sec).

We then divide the write throughput with the write IOPS number - this is
the average write bytes / operation (we plot this in the above graph). We
observe that this increases up to around 300kb, whilst after resetting the
MDS, it stays around 32kb for some time (then starts increasing). The read
bytes / operation are constantly much smaller.

The issue is that once we are in the "high" regime, for the same operation
that does for example 1000 IOPS, we need 300MB throughput, instead of 30MB
throughput that we observe after a restart. The high throughput often
results in reaching the VM level limits in Azure and after this the queue
depth explodes and operations begin stalling.

We will do the dump and report it as well once we have it.

Thanks again for any ideas on this.

Kind regards,

Andras

On Thu, Dec 9, 2021, 15:07 Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:

> Andras,
>
> Unfortunately your attachment didn't come through the list. (It might
> work if you embed it inline? Not sure.) I don't know if anybody's
> looked too hard at this before, and without the image I don't know
> exactly what metric you're using to say something's 320KB in size. Can
> you explain more?
>
> It might help if you dump the objecter_requests from the MDS and share
> those — it'll display what objects are being written to with what
> sizes.
> -Greg
>
>
> On Wed, Dec 8, 2021 at 9:00 AM Andras Sali <sali.andrew@xxxxxxxxx> wrote:
> >
> > Hi All,
> >
> > We have been observing that if we let our MDS run for some time, the
> > bandwidth usage of the disks in the metadata pool starts increasing
> > significantly (whilst IOPS is about constant), even though the number of
> > clients, the workloads or anything else doesn't change.
> >
> > However, after restarting the MDS, the issue goes away for some time and
> > the same workloads require 1/10th of the metadata disk bandwidth whilst
> > doing the same IOPS.
> >
> > We run our CephFS cluster in a cloud environment where the disk
> throughput
> > / bandwidth capacity is quite expensive to increase and we are hitting
> > bandwidth / throughput limits, even though we still have a lot of IOPS
> > capacity left.
> >
> > We suspect that somehow the journaling of the MDS becomes more extensive
> > (i.e. larger journal updates for each operation), but we couldn't really
> > pin down which parameter might affect this.
> >
> > I attach a plot of how the Bytes / Operation (throughput in MBps / IOPS)
> > evolves over time, when we restart the MDS, it drops to around 32kb (even
> > though the min block size for the metadata pool OSDs is 4kb in our
> > settings) and then increases over time to around 300kb.
> >
> > Any ideas on how to "fix" this and have a significantly lower bandwidth
> > usage would be really-really appreciated!
> >
> > Thank you and kind regards,
> >
> > Andras
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx