Re: Dealing with changing EC Rules with drive classifications

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Tue, 15 Oct 2019 08:21:50 -0700

On Tue, Oct 15, 2019 at 2:42 AM Jeremi Avenant <jeremi@xxxxxxxxxx> wrote:
Good day

I'm currently administrating a Ceph cluster that consists out of HDDs & SSDs. The rule for cephfs_data (ec) is to write to both these drive classifications (HDD+SSD). I would like to change it so that cephfs_metadata (non-ec) writes to SSD & cephfs_data (erasure encoded "ec") writes to HDD since we're experiencing high disk latency.

1) The first option to come to mind would be to migrate each pool to a new rule but this would mean moving a tonne of data around. (How is disk space calculated on this, if I use 600 TB in an EC pool, do I need another 600 TB pool to move it over, or does it shrink the existing pool as it inflates the new pool while moving?)

2) I would like to know if the alternative is possible:
i.e. Delete the SSDs from the default host bucket (leave everything as it is) and move the metadata pool to the SSD based crush rule.

However I'm not sure if this is possible as it will be deleting a leaf from a bucket in our default root. Which means when you add a new SSD osd where does it end up?

crush map - http://pastefile.fr/6f37e7e594a61d0edd9dc947349c756b
ceph osd pool ls detail - http://pastefile.fr/0f215e1252ec58c144d9abfe1688adc8
osd tree - http://pastefile.fr/2acdd377a2db021b6af2996929b85082

If anyone has any input it would be greatly appreciated.

What version of Ceph are you running? You may be able to use device classes instead of munging the CRUSH tree.

Updating the rule to change the destinations will only move data around (it may be a large data movement) and will only need as much space as PGs in flight use. For instance if your PG size is 100 GB and an erasure encoding of 10+2, then each PG takes 10 GB on each OSD. If your osd_max_backfills = 1, then you only need 10 GB of head room on each OSD to make the data movement. If your osd_max_backfills = 2, then you need 20 GBs as two PGs may be moved onto the OSD before any PGs may be deleted off of it.

By changing the rule to only use HDD drive class, it will migrate the data off the SSDs and onto the HDDs (only moving PG shards as needed). Then you can change the replication rule for the metadata to only use SSD, then it will migrate the PG replicatas off the HDDs.

Setting the following in /etc/ceph/ceph.conf on the OSDs and restarting the OSDs before backfilling will reduce the impact of the backfills.

osd op queue = wpq

osd op queue cut off = high

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx