Re: How can I use not-replicated pool (replication 1 or raid-0)

mhnx <morphinwithyou@xxxxxxxxx> · Wed, 10 May 2023 15:29:55 +0300

I'm talking about bluestore db+wal caching. It's good to know cache
tier is deprecated now, I should check why.

It's not possible because I don't have enough slots on servers. I'm
considering buying nvme in pci form.
Now I'm trying to speed up the rep 2 pool for the file size between
10K-700K millions of small files.
With compression the write speed is %5 reduced but the delete speed is
%30 increased.
Do you have any tuning advice for me?

Best regards,

Frank Schilder <frans@xxxxxx>, 9 May 2023 Sal, 11:02 tarihinde şunu yazdı:
>
> When you say cache device, do you mean a ceph cache pool as a tier to a rep-2 pool? If so, you might want to reconsider, cache pools are deprecated and will be removed from ceph at some point.
>
> If you have funds to buy new drives, you can just as well deploy a beegfs (or something else) on these. It is no problem to run ceph and beegfs on the same hosts. The disks should not be shared, but that's all. This might still be a simpler config than introducing a cache tier just to cover up for rep-2 overhead.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: mhnx <morphinwithyou@xxxxxxxxx>
> Sent: Friday, May 5, 2023 9:26 PM
> To: Frank Schilder
> Cc: Janne Johansson; Ceph Users
> Subject: Re:  Re: How can I use not-replicated pool (replication 1 or raid-0)
>
> Hello Frank.
>
> >If your only tool is a hammer ...
> >Sometimes its worth looking around.
>
> You are absolutely right! But I have limitations because my customer
> is a startup and they want to create a hybrid system with current
> hardware for all their needs. That's why I'm spending time to find a
> work around. They are using cephfs on their Software and I moved them
> on this path from NFS. At the beginning they were only looking for a
> rep2 pool for their important data and Ceph was an absolutely great
> idea. Now the system is running smoothly but they also want to move
> the [garbage data] on the same system but as I told you, the data flow
> is different and the current hardware (non plp sata ssd's without
> bluestore cache) can not supply the required speed with replication 2.
> They are happy with replication 1 speed but I'm not because when any
> network, disk, or node goes down, the cluster will be suspended due to
> rep1.
>
> Now I advised at least adding low latency PCI-Nvme's as a cache device
> to force rep2 pool. I will solve the Write latency with PLP low
> latency nvme's but still I need to solve deletion speed too. Actually
> with the random write-delete I was trying to tell the difference on
> delete speed. You are right, /dev/random requires cpu power and it
> will create latency and it should not used for write speed tests.
>
> Currently I'm working on development of an automation script to fix
> any problem for replication 1 pool.
> It is what it is.
>
> Best regards.
>
>
>
>
> Frank Schilder <frans@xxxxxx>, 3 May 2023 Çar, 11:50 tarihinde şunu yazdı:
>
>
> >
> > Hi mhnx.
> >
> > > I also agree with you, Ceph is not designed for this kind of use case
> > > but I tried to continue what I know.
> > If your only tool is a hammer ...
> > Sometimes its worth looking around.
> >
> > While your tests show that a rep-1 pool is faster than a rep-2 pool, the values are not exactly impressive. There are 2 things that are relevant here: ceph is a high latency system, its software stack is quite heavy-weight. Even for a rep-1 pool its doing a lot to ensure data integrity. BeeGFS is a lightweight low-latency system skipping a lot of magic, which makes it very suited for performance critical tasks but less for long-term archival applications.
> >
> > The second is that the device /dev/urandom is actually very slow (and even unpredictable on some systems, it might wait for more entropy to be created). Your times are almost certainly affected by that. If you want to have comparable and close to native storage performance, create the files you want to write to storage first in RAM and then copy from RAM to storage. Using random data is a good idea to bypass potential built-in accelerations for special data, like all-zeros. However, exclude the random number generator from the benchmark and generate the data first before timing its use.
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: mhnx <morphinwithyou@xxxxxxxxx>
> > Sent: Tuesday, May 2, 2023 5:25 PM
> > To: Frank Schilder
> > Cc: Janne Johansson; Ceph Users
> > Subject: Re:  Re: How can I use not-replicated pool (replication 1 or raid-0)
> >
> > Thank you for the explanation Frank.
> >
> > I also agree with you, Ceph is not designed for this kind of use case
> > but I tried to continue what I know.
> > My idea was exactly what you described, I was trying to automate
> > cleaning or recreating on any failure.
> >
> > As you can see below, rep1 pool is very fast:
> > - Create: time for i in {00001..99999}; do head -c 1K </dev/urandom
> > >randfile$i; done
> > replication 2 : 31m59.917s
> > replication 1 : 7m6.046s
> > --------------------------------
> > - Delete: time rm -rf testdir/
> > replication 2 : 11m56.994s
> > replication 1 : 0m40.756s
> > -------------------------------------
> >
> > I started learning DRBD, I will also check BeeGFS thanks for the advice.
> >
> > Regards.
> >
> > Frank Schilder <frans@xxxxxx>, 1 May 2023 Pzt, 10:27 tarihinde şunu yazdı:
> > >
> > > I think you misunderstood Janne's reply. The main statement is at the end, ceph is not designed for an "I don't care about data" use case. If you need speed for temporary data where you can sustain data loss, go for something simpler. For example, we use beegfs with great success for a burst buffer for an HPC cluster. It is very lightweight and will pull out all performance your drives can offer. In case of disaster it is easily possible to clean up. Beegfs does not care about lost data, such data will simply become inaccessible while everything else just moves on. It will not try to self-heal either. It doesn't even scrub data, so no competition of users with admin IO.
> > >
> > > Its pretty much your use case. We clean it up every 6-8 weeks and if something breaks we just redeploy the whole thing from scratch. Performance is great and its a very simple and economic system to administrate. No need for the whole ceph daemon engine with large RAM requirements and extra admin daemons.
> > >
> > > Use ceph for data you want to survive a nuclear blast. Don't use it for things its not made for and then complain.
> > >
> > > Best regards,
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > >
> > > ________________________________________
> > > From: mhnx <morphinwithyou@xxxxxxxxx>
> > > Sent: Saturday, April 29, 2023 5:48 AM
> > > To: Janne Johansson
> > > Cc: Ceph Users
> > > Subject:  Re: How can I use not-replicated pool (replication 1 or raid-0)
> > >
> > > Hello Janne, thank you for your response.
> > >
> > > I understand your advice and be sure that I've designed too many EC
> > > pools and I know the mess. This is not an option because I need SPEED.
> > >
> > > Please let me tell you, my hardware first to meet the same vision.
> > > Server: R620
> > > Cpu: 2 x Xeon E5-2630 v2 @ 2.60GHz
> > > Ram: 128GB - DDR3
> > > Disk1: 20x Samsung SSD 860 2TB
> > > Disk2: 10x Samsung SSD 870 2TB
> > >
> > > My ssds does not have PLP. Because of that, every ceph write also
> > > waits for TRIM. I want to know how much latency we are talking about
> > > because I'm thinking of adding PLP NVME for wal+db cache to gain some
> > > speed.
> > > As you can see, I even try to gain from every TRIM command.
> > > Currently I'm testing replication 2 pool and even this speed is not
> > > enough for my use case.
> > > Now I'm trying to boost the deletion speed because I'm writing and
> > > deleting files all the time and this never ends.
> > > I write this mail because replication 1 will decrease the deletion
> > > speed but still I'm trying to tune some MDS+ODS parameters to increase
> > > delete speed.
> > >
> > > Any help and idea will be great for me. Thanks.
> > > Regards.
> > >
> > >
> > >
> > > Janne Johansson <icepic.dz@xxxxxxxxx>, 12 Nis 2023 Çar, 10:10
> > > tarihinde şunu yazdı:
> > > >
> > > > Den mån 10 apr. 2023 kl 22:31 skrev mhnx <morphinwithyou@xxxxxxxxx>:
> > > > > Hello.
> > > > > I have a 10 node cluster. I want to create a non-replicated pool
> > > > > (replication 1) and I want to ask some questions about it:
> > > > >
> > > > > Let me tell you my use case:
> > > > > - I don't care about losing data,
> > > > > - All of my data is JUNK and these junk files are usually between 1KB to 32MB.
> > > > > - These files will be deleted in 5 days.
> > > > > - Writable space and I/O speed is more important.
> > > > > - I have high Write/Read/Delete operations, minimum 200GB a day.
> > > >
> > > > That is "only" 18MB/s which should easily be doable even with
> > > > repl=2,3,4. or EC. This of course depends on speed of drives, network,
> > > > cpus and all that, but in itself it doesn't seem too hard to achieve
> > > > in terms of average speeds. We have EC8+3 rgw backed by some 12-14 OSD
> > > > hosts with hdd and nvme (for wal+db) that can ingest over 1GB/s if you
> > > > parallelize the rgw streams, so 18MB/s seems totally doable with 10
> > > > decent machines. Even with replication.
> > > >
> > > > > I'm afraid that, in any failure, I won't be able to access the whole
> > > > > cluster. Losing data is okay but I have to ignore missing files,
> > > >
> > > > Even with repl=1, in case of a failure, the cluster will still aim at
> > > > fixing itself rather than ignoring currently lost data and moving on,
> > > > so any solution that involves "forgetting" about lost data would need
> > > > a ceph operator telling the cluster to ignore all the missing parts
> > > > and to recreate the broken PGs. This would not be automatic.
> > > >
> > > >
> > > > --
> > > > May the most significant bit of your life be positive.
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx