Re: DB sizing for lots of large files

Christian Wuerdig <christian.wuerdig@xxxxxxxxx> · Fri, 27 Nov 2020 08:19:06 +1300



Sorry, I replied to the wrong email thread before, so reposting this:
I think it's time to start pointing out the the 3/30/300 logic not really
holds any longer true post Octopus:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/CKRCB3HUR7UDRLHQGC7XXZPWCWNJSBNT/
Although I suppose in a way this makes it even harder to provide a sizing
recommendation

On Fri, 27 Nov 2020 at 04:49, Burkhard Linke <
Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> wrote:

> Hi,
>
> On 11/26/20 12:45 PM, Richard Thornton wrote:
> > Hi,
> >
> > Sorry to bother you all.
> >
> > It’s a home server setup.
> >
> > Three nodes (ODROID-H2+ with 32GB RAM and dual 2.5Gbit NICs), two 14TB
> > 7200rpm SATA drives and an Optane 118GB NVMe in each node (OS boots from
> > eMMC).
>
>
> *snipsnap*
>
> > Is there a rough CephFS calculation (each file uses x bytes of
> metadata), I
> > think I should be safe with 30GB, now I read I should double that (you
> > should allocate twice the size of the biggest layer to allow for
> > compaction) but I only have 118GB and two OSDs so I will have to go for
> > 59GB (or whatever will fit)?
>
> The recommended size of 30 GB is due to the level design of rocksdb;
> data is stored in different cache levels with increasing level sizes. 30
> GB is a kind of sweet spot between 3 GB and 300 GB (too small / way too
> large for most use case). The recommendation for doubling the size for
> compaction is OK, but you will waste capacity most the time.
>
> In our cephfs instance we have ~ 115.000.000 files. Metadata is stored
> on 18 SSD based OSDs. About 30-35 GB raw capacity of the data is
> currently in use, almost exclusively for metadata, omap and other stuff.
> You might be able to scale this down to your use case. Our average file
> size approx. 5 MB, so you can also put a little bit on top in your case.
>
> If your working set (files accesses in a time span) is rather small, you
> also have the option to use the SSD for some block device caching layer
> like bcache or dmcache. In this setup the whole capacity will be used,
> and also data operations on the OSDs will benefit from the faster SSDs.
> Your failure domain will be the same; if the SSD dies your data disks
> will be useless.
>
> Otherwise I would recommend to use DB partitions of the recommended size
> (do not forget to include some extra space for the WAL), and use the
> remaining capacity for extra SSD based OSDs similar to our setup. This
> willensure that metadata access will be fast[tm].
>
>
> Regards,
>
> Burkhard
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx