Re: Ceph flash deployment

"Alexander E. Patrakov" <patrakov@xxxxxxxxx> · Wed, 4 Nov 2020 09:29:48 +0500

With the latest kernel, this is not valid for all-flash clusters.
Simply because cfq is not an option at all there, and readahead
usefulness depends on your workload (in other words, it can help or
hurt) and therefore cannot be included in a universally-applicable set
of tuning recommendations. Also, look again: the title talks about
all-flash deployments, while the context of the benchmark talks about
7200RPM HDDs!

On Wed, Nov 4, 2020 at 12:37 AM Seena Fallah <seenafallah@xxxxxxxxx> wrote:
>
> Thanks for your useful information.
>
> Can you please also point to the kernel and disk configuration that are still valid for bluestore or not? I mean the read_ahead_kb and disk scheduler.
>
> Thanks.
>
> On Tue, Nov 3, 2020 at 10:55 PM Alexander E. Patrakov <patrakov@xxxxxxxxx> wrote:
>>
>> On Tue, Nov 3, 2020 at 6:30 AM Seena Fallah <seenafallah@xxxxxxxxx> wrote:
>> >
>> > Hi all,
>> >
>> > Does this guid is still valid for a bluestore deployment with nautilus or
>> > octopus?
>> > https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments
>>
>> Some of the guidance is of course outdated.
>>
>> E.g., at the time of that writing, 1x 40GbE was indeed state of the
>> art in the networking world, but now 100GbE network cards are
>> affordable, and with 6 NVMe drives per server, even that might be a
>> bottleneck if the clients use a large block size (>64KB) and do an
>> fsync() only at the end.
>>
>> Regarding NUMA tuning, Ceph made some progress. If it finds that your
>> NVMe and your network card are on the same NUMA node, then, with
>> Nautilus or later, the OSD will pin itself to that NUMA node
>> automatically. I.e.: choose strategically which PCIe slots to use,
>> maybe use two network cards, and you will not have to do any tuning or
>> manual pinning.
>>
>> Partitioning the NVMe was also a popular advice in the past, but now
>> that there are "osd op num shards" and "osd op num threads per shard"
>> parameters, with sensible default values, this is something that tends
>> not to help.
>>
>> Filesystem considerations in that document obviously apply only to
>> Filestore, which is something you should not use.
>>
>> Large PG number per OSD helps more uniform data distribution, but
>> actually hurts performance a little bit.
>>
>> The advice regarding the "performance" cpufreq governor is valid, but
>> you might also look at (i.e. benchmark for your workload specifically)
>> disabling the deepest idle states.
>>
>> --
>> Alexander E. Patrakov
>> CV: http://pc.cd/PLz7

-- 
Alexander E. Patrakov
CV: http://pc.cd/PLz7
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx