Re: Ceph Bluestore tweaks for Bcache

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I found a way to preserve the rotational=1 flag for bcache-backed OSDs between
reboots. Using a systemd drop-in for ceph-osd@.service, it now uses lsblk to
look for a bcache device somewhere below the OSD, but sets rotational=1 in the
uppermost LVM volume device mapper target only. This is sufficient to
keep the osd metadata on rotational:1.

Setting rotational=1 additionally on the bcache device itself or along
the path up in the stack would certainly be possible, but be even more
convoluted than this.

This simpler version works well when all bcache-backed OSDs are finally
backed by rotating media. If you mix bcaches backed by HDD and backed
by flash on the same host you would need to dig further.

Regards
Matthias

#-----------------------------------------------------------------------

systemd drop-in at
/etc/systemd/system/ceph-osd@.service.d/10-set-bcache-rotational-flags.conf:

[Service]
ExecStartPre=/usr/local/sbin/set-bcache-rotational-flags %i

#-----------------------------------------------------------------------

/usr/local/sbin/set-bcache-rotational-flags:

#!/bin/sh

OSD_ID=$1
echo "# set rotational flag for osd.${OSD_ID} block device"

date
grep . /sys/block/*/queue/rotational | grep -v -e '^/sys/block/loop' -e '/sys/block/sr'

sleep 1

# ASSUMPTION: on this host, any bcache device holding OSD data is
#               backed by rotating media.
#
# This simplification allows to skip determining the exact bcache
# device and the exact backing device underneath this bcache device
# from where to copy the rotational flag to the ceph OSD LVM
# volume rotational flag.
#
# Instead, set the LVM volume rotational flag to '1' if lsblk
# finds any bcache device underneath. The rotational flag of the
# bcache device itself is not modified.

dev_basename=`readlink -f /var/lib/ceph/osd/ceph-${OSD_ID}/block | xargs -r basename`

if [ -n "${dev_basename}" ]; then
    bcache_major=`awk '$2=="bcache" { print $1; }' /proc/devices`
    if [ -n "${bcache_major}" ]; then
        if lsblk --list --inverse --noheadings -o name,maj:min /dev/${dev_basename} | grep -q "^bcache[0-9]*[[:space:]][[:space:]]*${bcache_major}:.*$"; then
            # this OSD sits on a bcache
            r=/sys/block/${dev_basename}/queue/rotational
            if [ -e "${r}" ]; then
                echo "# setting rotational=1 on ${r}"
                echo "1" >${r}
            fi
        fi
    fi
fi


#-----------------------------------------------------------------------


On Thu, Feb 02, 2023 at 12:18:55AM +0100, Matthias Ferdinand wrote:
> ceph version: 17.2.0 on Ubuntu 22.04
>               non-containerized ceph from Ubuntu repos
>               cluster started on luminous
> 
> I have been using bcache on filestore on rotating disks for many years
> without problems.  Now converting OSDs to bluestore, there are some
> strange effects.
> 
> If I create the bcache device, set its rotational flag to '1', then do
>     ceph-volume lvm create ... --crush-device-class=hdd
> the OSD comes up with the right parameters and much improved latency
> compared to OSD directly on /dev/sdX. 
> 
>     ceph osd metatdata ...
> shows
>     "bluestore_bdev_type": "hdd",
>     "rotational": "1"
> 
> But after reboot, bcache rotational flag is set '0' again, and the OSD
> now comes up with "rotational": "0"
> Latency immediately starts to increase (and continually increases over
> the next days, possibly due to accumulating fragmention).
> 
> These wrong settings stay in place even if I stop the OSD, set the
> bcache rotational flag to '1' again and restart the OSD. I have found no
> way to get back to the original settings other than destroying and
> recreating the OSD. I guess I am just not seeing something obvious, like
> from where these settings get pulled at OSD startup.
> 
> I even created udev rules to set bcache rotational=1 at boot time,
> before any ceph daemon starts, but it did not help. Something running
> after these rules reset the bcache rotationl flags back to 0.
> Haven't found the culprit yet, but not sure if it even matters.
> 
> Are these OSD settings (bluestore_bdev_type, rotational) persisted
> somewhere and can they be edited and pinned?
> 
> Alternatively, can I manually set and persist the relevant bluestore
> tunables (per OSD / per device class) so as to make the bcache
> rotational flag irrelevant after the OSD is first created?
> 
> Regards
> Matthias
> 
> 
> On Fri, Apr 08, 2022 at 03:05:38PM +0300, Igor Fedotov wrote:
> > Hi Frank,
> > 
> > in fact this parameter impacts OSD behavior at both build-time and during
> > regular operationing. It simply substitutes hdd/ssd auto-detection with
> > manual specification.  And hence relevant config parameters are applied. If
> > e.g. min_alloc_size is persistent after OSD creation - it wouldn't be
> > updated. But if specific setting allows at run-time - it would be altered.
> > 
> > So the proper usage would definitely be manual ssd/hdd mode selection before
> > the first OSD creation and keeping it in that mode along the whole OSD
> > lifecycle. But technically one can change the mode at any arbitrary point in
> > time which would result in run-rime setting being out-of-sync with creation
> > ones. With some unclear side-effects..
> > 
> > Please also note that this setting was orignally intended mostly for
> > development/testing purposes not regular usage. Hence it's flexible but
> > rather unsafe if used improperly.
> > 
> > 
> > Thanks,
> > 
> > Igor
> > 
> > On 4/7/2022 2:40 PM, Frank Schilder wrote:
> > > Hi Richard and Igor,
> > > 
> > > are these tweaks required at build-time (osd prepare) only or are they required for every restart?
> > > 
> > > Is this setting "bluestore debug enforce settings=hdd" in the ceph config data base or set somewhere else? How does this work if deploying HDD- and SSD-OSDs at the same time?
> > > 
> > > Ideally, all these tweaks should be applicable and settable at creation time only without affecting generic settings (that is, at the ceph-volume command line and not via config side effects). Otherwise it becomes really tedious to manage these.
> > > 
> > > For example, would the following work-flow apply the correct settings *permanently* across restarts:
> > > 
> > > 1) Prepare OSD on fresh HDD with ceph-volume lvm batch --prepare ...
> > > 2) Assign dm_cache to logical OSD volume created in step 1
> > > 3) Start OSD, restart OSDs, boot server ...
> > > 
> > > I would assume that the HDD settings are burned into the OSD in step 1 and will be used in all future (re-)starts without the need to do anything despite the device being detected as non-rotational after step 2. Is this assumption correct?
> > > 
> > > Thanks and best regards,
> > > =================
> > > Frank Schilder
> > > AIT Risø Campus
> > > Bygning 109, rum S14
> > > 
> > > ________________________________________
> > > From: Richard Bade <hitrich@xxxxxxxxx>
> > > Sent: 06 April 2022 00:43:48
> > > To: Igor Fedotov
> > > Cc: Ceph Users
> > > Subject: [Warning Possible spam]   Re: Ceph Bluestore tweaks for Bcache
> > > 
> > > Just for completeness for anyone that is following this thread. Igor
> > > added that setting in Octopus, so unfortunately I am unable to use it
> > > as I am still on Nautilus.
> > > 
> > > Thanks,
> > > Rich
> > > 
> > > On Wed, 6 Apr 2022 at 10:01, Richard Bade <hitrich@xxxxxxxxx> wrote:
> > > > Thanks Igor for the tip. I'll see if I can use this to reduce the
> > > > number of tweaks I need.
> > > > 
> > > > Rich
> > > > 
> > > > On Tue, 5 Apr 2022 at 21:26, Igor Fedotov <igor.fedotov@xxxxxxxx> wrote:
> > > > > Hi Richard,
> > > > > 
> > > > > just FYI: one can use "bluestore debug enforce settings=hdd" config
> > > > > parameter to manually enforce HDD-related  settings for a BlueStore
> > > > > 
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > > Igor
> > > > > 
> > > > > On 4/5/2022 1:07 AM, Richard Bade wrote:
> > > > > > Hi Everyone,
> > > > > > I just wanted to share a discovery I made about running bluestore on
> > > > > > top of Bcache in case anyone else is doing this or considering it.
> > > > > > We've run Bcache under Filestore for a long time with good results but
> > > > > > recently rebuilt all the osds on bluestore. This caused some
> > > > > > degradation in performance that I couldn't quite put my finger on.
> > > > > > Bluestore osds have some smarts where they detect the disk type.
> > > > > > Unfortunately in the case of Bcache it detects as SSD, when in fact
> > > > > > the HDD parameters are better suited.
> > > > > > I changed the following parameters to match the HDD default values and
> > > > > > immediately saw my average osd latency during normal workload drop
> > > > > > from 6ms to 2ms. Peak performance didn't change really, but a test
> > > > > > machine that I have running a constant iops workload was much more
> > > > > > stable as was the average latency.
> > > > > > Performance has returned to Filestore or better levels.
> > > > > > Here are the parameters.
> > > > > > 
> > > > > >    ; Make sure that we use values appropriate for HDD not SSD - Bcache
> > > > > > gets detected as SSD
> > > > > >    bluestore_prefer_deferred_size = 32768
> > > > > >    bluestore_compression_max_blob_size = 524288
> > > > > >    bluestore_deferred_batch_ops = 64
> > > > > >    bluestore_max_blob_size = 524288
> > > > > >    bluestore_min_alloc_size = 65536
> > > > > >    bluestore_throttle_cost_per_io = 670000
> > > > > > 
> > > > > >    ; Try to improve responsiveness when some disks are fully utilised
> > > > > >    osd_op_queue = wpq
> > > > > >    osd_op_queue_cut_off = high
> > > > > > 
> > > > > > Hopefully someone else finds this useful.
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > > --
> > > > > Igor Fedotov
> > > > > Ceph Lead Developer
> > > > > 
> > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io
> > > > > 
> > > > > croit GmbH, Freseniusstr. 31h, 81247 Munich
> > > > > CEO: Martin Verges - VAT-ID: DE310638492
> > > > > Com. register: Amtsgericht Munich HRB 231263
> > > > > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> > > > > 
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > 
> > -- 
> > Igor Fedotov
> > Ceph Lead Developer
> > 
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> > 
> > croit GmbH, Freseniusstr. 31h, 81247 Munich
> > CEO: Martin Verges - VAT-ID: DE310638492
> > Com. register: Amtsgericht Munich HRB 231263
> > Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx
> > 
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux