That's actually an interesting question. On the 5.9 kernels cfq seems not available: # cat /sys/block/sdj/queue/scheduler [mq-deadline] kyber bfq none What is the recommendation here? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Seena Fallah <seenafallah@xxxxxxxxx> Sent: 04 November 2020 19:01:18 To: Alexander E. Patrakov Cc: ceph-users Subject: Re: Ceph flash deployment I see in this thread that someone is saying that bluestore is only works good with cfq scheduler: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-November/031063.html For readahead, do you have any measurements to see how I can measure my workload to see if I should increase it or not? Thanks. On Wed, Nov 4, 2020 at 8:00 AM Alexander E. Patrakov <patrakov@xxxxxxxxx> wrote: > With the latest kernel, this is not valid for all-flash clusters. > Simply because cfq is not an option at all there, and readahead > usefulness depends on your workload (in other words, it can help or > hurt) and therefore cannot be included in a universally-applicable set > of tuning recommendations. Also, look again: the title talks about > all-flash deployments, while the context of the benchmark talks about > 7200RPM HDDs! > > On Wed, Nov 4, 2020 at 12:37 AM Seena Fallah <seenafallah@xxxxxxxxx> > wrote: > > > > Thanks for your useful information. > > > > Can you please also point to the kernel and disk configuration that are > still valid for bluestore or not? I mean the read_ahead_kb and disk > scheduler. > > > > Thanks. > > > > On Tue, Nov 3, 2020 at 10:55 PM Alexander E. Patrakov < > patrakov@xxxxxxxxx> wrote: > >> > >> On Tue, Nov 3, 2020 at 6:30 AM Seena Fallah <seenafallah@xxxxxxxxx> > wrote: > >> > > >> > Hi all, > >> > > >> > Does this guid is still valid for a bluestore deployment with > nautilus or > >> > octopus? > >> > > https://tracker.ceph.com/projects/ceph/wiki/Tuning_for_All_Flash_Deployments > >> > >> Some of the guidance is of course outdated. > >> > >> E.g., at the time of that writing, 1x 40GbE was indeed state of the > >> art in the networking world, but now 100GbE network cards are > >> affordable, and with 6 NVMe drives per server, even that might be a > >> bottleneck if the clients use a large block size (>64KB) and do an > >> fsync() only at the end. > >> > >> Regarding NUMA tuning, Ceph made some progress. If it finds that your > >> NVMe and your network card are on the same NUMA node, then, with > >> Nautilus or later, the OSD will pin itself to that NUMA node > >> automatically. I.e.: choose strategically which PCIe slots to use, > >> maybe use two network cards, and you will not have to do any tuning or > >> manual pinning. > >> > >> Partitioning the NVMe was also a popular advice in the past, but now > >> that there are "osd op num shards" and "osd op num threads per shard" > >> parameters, with sensible default values, this is something that tends > >> not to help. > >> > >> Filesystem considerations in that document obviously apply only to > >> Filestore, which is something you should not use. > >> > >> Large PG number per OSD helps more uniform data distribution, but > >> actually hurts performance a little bit. > >> > >> The advice regarding the "performance" cpufreq governor is valid, but > >> you might also look at (i.e. benchmark for your workload specifically) > >> disabling the deepest idle states. > >> > >> -- > >> Alexander E. Patrakov > >> CV: http://pc.cd/PLz7 > > > > -- > Alexander E. Patrakov > CV: http://pc.cd/PLz7 > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx