Re: weird performance issue on ceph

Hans van den Bogert <hansbogert@xxxxxxxxx> · Tue, 26 Jul 2022 21:25:11 +0200

Is rook/CSI still not using efficient rbd object maps ?

It could be that you issued a new benchmark while ceph was busy
(inefficiently) removing the  old rbd images. This is quite a stretch but
could be worth exploring.

On Mon, Jul 25, 2022, 21:42 Mark Nelson <mnelson@xxxxxxxxxx> wrote:

> I don't think so if this is just plain old RBD.  RBD  shouldn't require
> a bunch of RocksDB iterator seeks in the read/write hot path and writes
> should pretty quickly clear out tombstones as part of the memtable flush
> and compaction process even in the slow case.  Maybe in some kind of
> pathologically bad read-only corner case with no onode cache but it
> would be bad for more reasons than what's happening in that tracker
> ticket imho (even reading onodes from rocksdb block cache is
> significantly slower than BlueStore's onode cache).
>
> If RBD mirror (or snapshots) are involved that could be a different
> story though.  I believe to deal with deletes in that case we have to go
> through iteration/deletion loops that have same root issue as what's
> going on in the tracker ticket and it can end up impacting client IO.
> Gabi and Paul and testing/reworking how the snapmapper works and I've
> started a sort of a catch-all PR for improving our RocksDB tunings/glue
> here:
>
>
> https://github.com/ceph/ceph/pull/47221
>
>
> Mark
>
> On 7/25/22 12:48, Frank Schilder wrote:
> > Could it be related to this performance death trap:
> https://tracker.ceph.com/issues/55324 ?
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Mark Nelson <mnelson@xxxxxxxxxx>
> > Sent: 25 July 2022 18:50
> > To: ceph-users@xxxxxxx
> > Subject:  Re: weird performance issue on ceph
> >
> > Hi Zoltan,
> >
> >
> > We have a very similar setup with one of our upstream community
> > performance test clusters.  60 4TB PM983 drives spread across 10 nodes.
> > We get similar numbers to what you are initially seeing (scaled down to
> > 60 drives) though with somewhat lower random read IOPS (we tend to max
> > out at around 2M with 60 drives on this HW). I haven't seen any issues
> > with quincy like what you are describing, but on this cluster most of
> > the tests have been on bare metal.  One issue we have noticed with the
> > PM983 drives is that they may be more susceptible to non-optimal write
> > patterns causing slowdowns vs other NVMe drives in the lab.  We actually
> > had to issue a last minute PR for quincy to change the disk allocation
> > behavior to deal with it.  See:
> >
> >
> > https://github.com/ceph/ceph/pull/45771
> >
> > https://github.com/ceph/ceph/pull/45884
> >
> >
> > I don't *think* this is the issue you are hitting since the fix in
> > #45884 should have taken care of it, but it might be something to keep
> > in the back of your mind.  Otherwise, the fact that you are seeing such
> > a dramatic difference across both small and large read/write benchmarks
> > makes me think there is something else going on.  Is there any chance
> > that some other bottleneck is being imposed when the pods and volumes
> > are deleted and recreated? Might be worth looking at memory and CPU
> > usage of the OSDs in all of the cases and RocksDB flushing/compaction
> > stats from the OSD logs.  Also a quick check with collectl/iostat/sar
> > during the slow case to make sure none of the drives are showing latency
> > and built up IOs in the device queues.
> >
> > If you want to go deeper down the rabbit hole you can try running my
> > wallclock profiler against one of your OSDs in the fast/slow cases, but
> > you'll have to make sure it has access to debug symbols:
> >
> >
> > https://github.com/markhpc/uwpmp.git
> >
> >
> > run it like:
> >
> >
> > ./uwpmp -n 10000 -p <pid of ceph-osd> -b libdw > output.txt
> >
> >
> > If the libdw backend is having problems you can use -b libdwarf instead,
> > but it's much slower and takes longer to collect as many samples (you
> > might want to do -n 1000 instead).
> >
> >
> > Mark
> >
> >
> > On 7/25/22 11:17, Zoltan Langi wrote:
> >> Hi people, we got an interesting issue here and I would like to ask if
> >> anyone seen anything like this before.
> >>
> >>
> >> First: our system:
> >>
> >> The ceph version is 17.2.1 but we also seen the same behaviour on
> 16.2.9.
> >>
> >> Our kernel version is 5.13.0-51 and our NVMe disks are Samsung PM983.
> >>
> >> In our deployment we got 12 nodes in total, 72 disks and 2 osd per
> >> disk makes 144 osd in total.
> >>
> >> The depoyment was done by ceph-rook with default values, 6 CPU cores
> >> allocated to the OSD each and 4GB of memory allocated to each OSD.
> >>
> >>
> >> The issue we are experiencing: We create for example 100 volumes via
> >> ceph-csi and attach it to kubernetes pods via rbd. We talk about 100
> >> volumes in total, 2GB each. We run fio performance tests (read, write,
> >> mixed) on them so the volumes are being used heavily. Ceph delivers
> >> good performance, no problems as all.
> >>
> >> Performance we get for example: read iops 3371027 write iops: 727714
> >> read bw: 79.9 GB/s write bw: 31.2 GB/s
> >>
> >>
> >> After the tests are complete, these volumes just sitting there doing
> >> nothing for a longer period of time for example 48 hours. After that,
> >> we clean the pods up, clean the volumes up and delete them.
> >>
> >> Recreate the volumes and pods once more, same spec (2GB each 100 pods)
> >> then run the same tests once again. We don’t even have half the
> >> performance of that we have measured before leaving the pods sitting
> >> there doing notning for 2 days.
> >>
> >>
> >> Performance we get after deleting the volumes and recreating them,
> >> rerun the tests: read iops: 1716239 write iops: 370631 read bw: 37.8
> >> GB/s write bw: 7.47 GB/s
> >>
> >> We can clearly see that it’s a big performance loss.
> >>
> >>
> >> If we clean up the ceph deployment, wipe the disks out completely and
> >> redeploy, the cluster once again delivering great performance.
> >>
> >>
> >> We haven’t seen such a behaviour with ceph version 14.x
> >>
> >>
> >> Has anyone seen such a thing? Thanks in advance!
> >>
> >> Zoltan
> >> _______________________________________________
> >> ceph-users mailing list -- ceph-users@xxxxxxx
> >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx