Re: weird performance issue on ceph

Marc <Marc@xxxxxxxxxxxxxxxxx> · Tue, 26 Jul 2022 20:21:08 +0000

Afaik is csi just some go code that maps an rbd image, it does as you would do it from the command line. Then again they really do not understand csi there, and are just developing a kubernetes 'driver'.

> 
> Is rook/CSI still not using efficient rbd object maps ?
> 
> It could be that you issued a new benchmark while ceph was busy
> (inefficiently) removing the  old rbd images. This is quite a stretch
> but
> could be worth exploring.
> 
> On Mon, Jul 25, 2022, 21:42 Mark Nelson <mnelson@xxxxxxxxxx> wrote:
> 
> > I don't think so if this is just plain old RBD.  RBD  shouldn't
> require
> > a bunch of RocksDB iterator seeks in the read/write hot path and
> writes
> > should pretty quickly clear out tombstones as part of the memtable
> flush
> > and compaction process even in the slow case.  Maybe in some kind of
> > pathologically bad read-only corner case with no onode cache but it
> > would be bad for more reasons than what's happening in that tracker
> > ticket imho (even reading onodes from rocksdb block cache is
> > significantly slower than BlueStore's onode cache).
> >
> > If RBD mirror (or snapshots) are involved that could be a different
> > story though.  I believe to deal with deletes in that case we have to
> go
> > through iteration/deletion loops that have same root issue as what's
> > going on in the tracker ticket and it can end up impacting client IO.
> > Gabi and Paul and testing/reworking how the snapmapper works and I've
> > started a sort of a catch-all PR for improving our RocksDB
> tunings/glue
> > here:
> >
> >
> > https://github.com/ceph/ceph/pull/47221
> >
> >
> > Mark
> >
> > On 7/25/22 12:48, Frank Schilder wrote:
> > > Could it be related to this performance death trap:
> > https://tracker.ceph.com/issues/55324 ?
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> > > ________________________________________
> > > From: Mark Nelson <mnelson@xxxxxxxxxx>
> > > Sent: 25 July 2022 18:50
> > > To: ceph-users@xxxxxxx
> > > Subject:  Re: weird performance issue on ceph
> > >
> > > Hi Zoltan,
> > >
> > >
> > > We have a very similar setup with one of our upstream community
> > > performance test clusters.  60 4TB PM983 drives spread across 10
> nodes.
> > > We get similar numbers to what you are initially seeing (scaled down
> to
> > > 60 drives) though with somewhat lower random read IOPS (we tend to
> max
> > > out at around 2M with 60 drives on this HW). I haven't seen any
> issues
> > > with quincy like what you are describing, but on this cluster most
> of
> > > the tests have been on bare metal.  One issue we have noticed with
> the
> > > PM983 drives is that they may be more susceptible to non-optimal
> write
> > > patterns causing slowdowns vs other NVMe drives in the lab.  We
> actually
> > > had to issue a last minute PR for quincy to change the disk
> allocation
> > > behavior to deal with it.  See:
> > >
> > >
> > > https://github.com/ceph/ceph/pull/45771
> > >
> > > https://github.com/ceph/ceph/pull/45884
> > >
> > >
> > > I don't *think* this is the issue you are hitting since the fix in
> > > #45884 should have taken care of it, but it might be something to
> keep
> > > in the back of your mind.  Otherwise, the fact that you are seeing
> such
> > > a dramatic difference across both small and large read/write
> benchmarks
> > > makes me think there is something else going on.  Is there any
> chance
> > > that some other bottleneck is being imposed when the pods and
> volumes
> > > are deleted and recreated? Might be worth looking at memory and CPU
> > > usage of the OSDs in all of the cases and RocksDB
> flushing/compaction
> > > stats from the OSD logs.  Also a quick check with
> collectl/iostat/sar
> > > during the slow case to make sure none of the drives are showing
> latency
> > > and built up IOs in the device queues.
> > >
> > > If you want to go deeper down the rabbit hole you can try running my
> > > wallclock profiler against one of your OSDs in the fast/slow cases,
> but
> > > you'll have to make sure it has access to debug symbols:
> > >
> > >
> > > https://github.com/markhpc/uwpmp.git
> > >
> > >
> > > run it like:
> > >
> > >
> > > ./uwpmp -n 10000 -p <pid of ceph-osd> -b libdw > output.txt
> > >
> > >
> > > If the libdw backend is having problems you can use -b libdwarf
> instead,
> > > but it's much slower and takes longer to collect as many samples
> (you
> > > might want to do -n 1000 instead).
> > >
> > >
> > > Mark
> > >
> > >
> > > On 7/25/22 11:17, Zoltan Langi wrote:
> > >> Hi people, we got an interesting issue here and I would like to ask
> if
> > >> anyone seen anything like this before.
> > >>
> > >>
> > >> First: our system:
> > >>
> > >> The ceph version is 17.2.1 but we also seen the same behaviour on
> > 16.2.9.
> > >>
> > >> Our kernel version is 5.13.0-51 and our NVMe disks are Samsung
> PM983.
> > >>
> > >> In our deployment we got 12 nodes in total, 72 disks and 2 osd per
> > >> disk makes 144 osd in total.
> > >>
> > >> The depoyment was done by ceph-rook with default values, 6 CPU
> cores
> > >> allocated to the OSD each and 4GB of memory allocated to each OSD.
> > >>
> > >>
> > >> The issue we are experiencing: We create for example 100 volumes
> via
> > >> ceph-csi and attach it to kubernetes pods via rbd. We talk about
> 100
> > >> volumes in total, 2GB each. We run fio performance tests (read,
> write,
> > >> mixed) on them so the volumes are being used heavily. Ceph delivers
> > >> good performance, no problems as all.
> > >>
> > >> Performance we get for example: read iops 3371027 write iops:
> 727714
> > >> read bw: 79.9 GB/s write bw: 31.2 GB/s
> > >>
> > >>
> > >> After the tests are complete, these volumes just sitting there
> doing
> > >> nothing for a longer period of time for example 48 hours. After
> that,
> > >> we clean the pods up, clean the volumes up and delete them.
> > >>
> > >> Recreate the volumes and pods once more, same spec (2GB each 100
> pods)
> > >> then run the same tests once again. We don’t even have half the
> > >> performance of that we have measured before leaving the pods
> sitting
> > >> there doing notning for 2 days.
> > >>
> > >>
> > >> Performance we get after deleting the volumes and recreating them,
> > >> rerun the tests: read iops: 1716239 write iops: 370631 read bw:
> 37.8
> > >> GB/s write bw: 7.47 GB/s
> > >>
> > >> We can clearly see that it’s a big performance loss.
> > >>
> > >>
> > >> If we clean up the ceph deployment, wipe the disks out completely
> and
> > >> redeploy, the cluster once again delivering great performance.
> > >>
> > >>
> > >> We haven’t seen such a behaviour with ceph version 14.x
> > >>
> > >>
> > >> Has anyone seen such a thing? Thanks in advance!
> > >>
> > >> Zoltan
> > >> _______________________________________________
> > >> ceph-users mailing list -- ceph-users@xxxxxxx
> > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx