Re: How can I use not-replicated pool (replication 1 or raid-0)

Frank Schilder <frans@xxxxxx> · Mon, 1 May 2023 07:27:41 +0000

I think you misunderstood Janne's reply. The main statement is at the end, ceph is not designed for an "I don't care about data" use case. If you need speed for temporary data where you can sustain data loss, go for something simpler. For example, we use beegfs with great success for a burst buffer for an HPC cluster. It is very lightweight and will pull out all performance your drives can offer. In case of disaster it is easily possible to clean up. Beegfs does not care about lost data, such data will simply become inaccessible while everything else just moves on. It will not try to self-heal either. It doesn't even scrub data, so no competition of users with admin IO.

Its pretty much your use case. We clean it up every 6-8 weeks and if something breaks we just redeploy the whole thing from scratch. Performance is great and its a very simple and economic system to administrate. No need for the whole ceph daemon engine with large RAM requirements and extra admin daemons.

Use ceph for data you want to survive a nuclear blast. Don't use it for things its not made for and then complain.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: mhnx <morphinwithyou@xxxxxxxxx>
Sent: Saturday, April 29, 2023 5:48 AM
To: Janne Johansson
Cc: Ceph Users
Subject:  Re: How can I use not-replicated pool (replication 1 or raid-0)

Hello Janne, thank you for your response.

I understand your advice and be sure that I've designed too many EC
pools and I know the mess. This is not an option because I need SPEED.

Please let me tell you, my hardware first to meet the same vision.
Server: R620
Cpu: 2 x Xeon E5-2630 v2 @ 2.60GHz
Ram: 128GB - DDR3
Disk1: 20x Samsung SSD 860 2TB
Disk2: 10x Samsung SSD 870 2TB

My ssds does not have PLP. Because of that, every ceph write also
waits for TRIM. I want to know how much latency we are talking about
because I'm thinking of adding PLP NVME for wal+db cache to gain some
speed.
As you can see, I even try to gain from every TRIM command.
Currently I'm testing replication 2 pool and even this speed is not
enough for my use case.
Now I'm trying to boost the deletion speed because I'm writing and
deleting files all the time and this never ends.
I write this mail because replication 1 will decrease the deletion
speed but still I'm trying to tune some MDS+ODS parameters to increase
delete speed.

Any help and idea will be great for me. Thanks.
Regards.

Janne Johansson <icepic.dz@xxxxxxxxx>, 12 Nis 2023 Çar, 10:10
tarihinde şunu yazdı:
>
> Den mån 10 apr. 2023 kl 22:31 skrev mhnx <morphinwithyou@xxxxxxxxx>:
> > Hello.
> > I have a 10 node cluster. I want to create a non-replicated pool
> > (replication 1) and I want to ask some questions about it:
> >
> > Let me tell you my use case:
> > - I don't care about losing data,
> > - All of my data is JUNK and these junk files are usually between 1KB to 32MB.
> > - These files will be deleted in 5 days.
> > - Writable space and I/O speed is more important.
> > - I have high Write/Read/Delete operations, minimum 200GB a day.
>
> That is "only" 18MB/s which should easily be doable even with
> repl=2,3,4. or EC. This of course depends on speed of drives, network,
> cpus and all that, but in itself it doesn't seem too hard to achieve
> in terms of average speeds. We have EC8+3 rgw backed by some 12-14 OSD
> hosts with hdd and nvme (for wal+db) that can ingest over 1GB/s if you
> parallelize the rgw streams, so 18MB/s seems totally doable with 10
> decent machines. Even with replication.
>
> > I'm afraid that, in any failure, I won't be able to access the whole
> > cluster. Losing data is okay but I have to ignore missing files,
>
> Even with repl=1, in case of a failure, the cluster will still aim at
> fixing itself rather than ignoring currently lost data and moving on,
> so any solution that involves "forgetting" about lost data would need
> a ceph operator telling the cluster to ignore all the missing parts
> and to recreate the broken PGs. This would not be automatic.
>
>
> --
> May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx