Re: CephFS performance

"Robert W. Eckert" <rob@xxxxxxxxxxxxxxx> · Wed, 23 Nov 2022 18:59:42 +0000

Have you tested having the block.db and WAL  for each OSD on a faster SSD/NVME device/ partition?

I have a bit smaller environment, but was able to take a 2 Tb SSD, split it into 4 partitions and use it for the db and WAL for the 4 Drives.     By Default if you move the block.db to a different device, the WAL moves there too, but I see you can also have the block, block.db and WAL on separate devices.

The generic method is outlined  here. – one thing I found I had to do was to use the ceph-volume commands from inside “cephadm shell” since I am using the containers for running ceph.  Which also required me to copy the ceph keyring from the host /var/lib/ceph/bootstrap-osd to the same virtual location in the container.  (Using scp <user@host>:/var/lib/ceph/bootstrap-osd/ceph.keyring /var/lib/ceph/bootstrap-osd )

https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/

-Rob

From: quaglio@xxxxxxxxxx <quaglio@xxxxxxxxxx>
Sent: Wednesday, November 23, 2022 12:28 PM
To: gfarnum@xxxxxxxxxx; dcsysengineer@xxxxxxxxx
Cc: ceph-users@xxxxxxx
Subject:  Re: CephFS performance

Hi Gregory,
     Thanks for your reply!

     We are evaluating possibilities to increase storage performance.

     I understand that Ceph has has better capability in data resiliency. This has been one of the arguments I use to keep this tool in our storage.
     I say this mainly in failure events (in the case of disks or even machines crashes). In the case of BeeGFS, if there is a problem on any machine, the whole cluster becomes inconsistent (at the point of my tests, I'm not working with that).

     One of the foreseen situations is precisely to put the metadata in SSD (as you said).
     Another situation is to put an entire filesystem on SSD (scratch area for the HPC area) or even a cache tier.

     With that in mind, my manager is weighing the costs of maintaining Ceph.

     However, in all tests, CephFS performance is inferior to BeeGFS.
     I'm running out of arguments to keep Ceph storage solution here where I work.

     The benchmark tests I did were as follows:

1-) Ceph with data and metadata on HDD
2-) Ceph with data on HDD and metadata on SSD
3-) Ceph with data and metadata on SSD
4-) Nowsync, fscache mount parameters
5-) Ceph with tier cache enabled (cache on SSD)
6-) Ceph with OS adjustments and configuration optimizations in OSD, MDS, MON, client, more buffering in the network interface.

On the larger cluster configuration, we have 50G interfaces on each of the 6 disk servers (each with 22 disks).

Obrigado
Rafael.

________________________________

De: "Gregory Farnum" <gfarnum@xxxxxxxxxx<mailto:gfarnum@xxxxxxxxxx>>
Enviada: 2022/11/22 14:49:12
Para: dcsysengineer@xxxxxxxxx<mailto:dcsysengineer@xxxxxxxxx>
Cc: quaglio@xxxxxxxxxx<mailto:quaglio@xxxxxxxxxx>, ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
Assunto: Re:  Re: CephFS performance

In addition to not having resiliency by default, my recollection is
that BeeGFS also doesn't guarantee metadata durability in the event of
a crash or hardware failure like CephFS does. There's not really a way
for us to catch up to their "in-memory metadata IOPS" with our
"on-disk metadata IOPS". :(

If that kind of cached performance is your main concern, CephFS is
probably not going to make you happy.

That said, if you've been happy using CephFS with hard drives and
gigabit ethernet, it will be much faster if you store the metadata on
SSD and can increase the size of the MDS cache in memory. More
specific tuning options than that would depend on your workload.
-Greg

On Tue, Nov 22, 2022 at 7:28 AM David C <dcsysengineer@xxxxxxxxx<mailto:dcsysengineer@xxxxxxxxx>> wrote:
>
> My understanding is BeeGFS doesn't offer data redundancy by default,
> you have to configure mirroring. You've not said how your Ceph cluster
> is configured but my guess is you have the recommended 3x replication
> - I wouldn't be surprised if BeeGFS was much faster than Ceph in this
> case. I'd be interested to see your results after ensuring equivalent
> data redundancy between the platforms.
>
> On Thu, Oct 20, 2022 at 9:02 PM quaglio@xxxxxxxxxx<mailto:quaglio@xxxxxxxxxx> <quaglio@xxxxxxxxxx<mailto:quaglio@xxxxxxxxxx>> wrote:
> >
> > Hello everyone,
> > I have some considerations and doubts to ask...
> >
> > I work at an HPC center and my doubts stem from performance in this environment. All clusters here was suffering from NFS performance and also problems of a single point of failure it has. We were suffering from the performance of NFS and also the single point of failure it has.
> >
> > At that time, we decided to evaluate some available SDS and the chosen one was Ceph (first for its resilience and later for its performance).
> > I deployed CephFS in a small cluster: 6 nodes and 1 HDD per machine with 1Gpbs connection.
> > The performance was as good as a large NFS we have on another cluster (spending much less). In addition, it was able to evaluate all the benefits of resiliency that Ceph offers (such as activating an OSD, MDS, MON or MGR server) and the objects/services to settle on other nodes. All this in a way that the user did not even notice.
> >
> > Given this information, a new storage cluster was acquired last year with 6 machines and 22 disks (HDDs) per machine. The need was for the amount of available GBs. The amount of IOPs was not so important at that time.
> >
> > Right at the beginning, I had a lot of work to optimize the performance in the cluster (the main deficiency was in the performance in the access/write of metadata). The problem was not at the job execution, but the user's perception of slowness when executing interactive commands (my perception was in the slowness of Ceph metadata).
> > There were a few months of high loads in which storage was the bottleneck of the environment.
> >
> > After a lot of research in documentation, I made several optimizations on the available parameters and currently CephFS is able to reach around 10k IOPS (using size=2).
> >
> > Anyway, my boss asked for other solutions to be evaluated to verify the performance issue.
> > First of all, it was suggested to put the metadata on SSD disks for a higher amount of IOPS.
> > In addition, a test environment was set up and the solution that made the most difference in performance was with BeeGFS.
> >
> > In some situations, BeeGFS is many times faster than Ceph in the same tests and under the same hardware conditions. This happens in both the throuput (BW) and IOPS.
> >
> > We tested it using io500 as follows:
> > 1-) An individual process
> > 2-) 8 processes (4 processes on 2 different machines)
> > 3-) 16 processes (8 processes on 2 different machines)
> >
> > I did tests configuring CephFS to use:
> > * HDD only (for both data and metadata)
> > * Metadata on SSD
> > * Using Linux FSCache features
> > * With some optimizations (increasing MDS memory, client memory, inflight parameters, etc)
> > * Cache tier with SSD
> >
> > Even so, the benchmark score was lower than the BeeGFS installed without any optimization. This difference becomes even more evident as the number of simultaneous accesses increases.
> >
> > The two best results of CephFS were using metadata on SSD and also doing TIER on SSD.
> >
> > Here is the result of Ceph's performance when compared to BeeGFS:
> >
> > Bandwith Test (bw is in GB/s):
> >
> > ==================================================
> > |fs |bw |process |
> > ==================================================
> > |beegfs-metassd |0.078933 |01 |
> > |beegfs-metassd |0.051855 |08 |
> > |beegfs-metassd |0.039459 |16 |
> > ==================================================
> > |cephmetassd |0.022489 |01 |
> > |cephmetassd |0.009789 |08 |
> > |cephmetassd |0.002957 |16 |
> > ==================================================
> > |cephcache |0.023966 |01 |
> > |cephcache |0.021131 |08 |
> > |cephcache |0.007782 |16 |
> > ==================================================
> >
> > IOPS Test:
> >
> > ==================================================
> > |fs |iops |process |
> > ==================================================
> > |beegfs-metassd |0.740658 |01 |
> > |beegfs-metassd |3.508879 |08 |
> > |beegfs-metassd |6.514768 |16 |
> > ==================================================
> > |cephmetassd |1.224963 |01 |
> > |cephmetassd |3.762794 |08 |
> > |cephmetassd |3.188686 |16 |
> > ==================================================
> > |cephcache |1.829107 |01 |
> > |cephcache |3.257963 |08 |
> > |cephcache |3.524081 |16 |
> > ==================================================
> >
> > I imagine that if I test with 32 processes, BeeGFS is even better.
> >
> > Do you have any recommendations for me to apply to Ceph without reducing resilience?
> >
> > Rafael._______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>
>
>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx