Re: Drop of performance after Nautilus to Pacific upgrade

Martin Mlynář <nextsux@xxxxxxxxx> · Tue, 7 Sep 2021 10:51:20 +0200

Hello,

we've noticed similar issue after upgrading our test 3 node cluster from
15.2.14-1~bpo10+1 to 16.1.0-1~bpo10+1.

quick tests using rados bench:

16.2.5-1~bpo10+1:

Total time run:         133.28

Total writes made:      576

Write size:             4194304

Object size:            4194304

Bandwidth (MB/sec):     17.2869

Stddev Bandwidth:       34.1485

Max bandwidth (MB/sec): 204

Min bandwidth (MB/sec): 0

Average IOPS:           4

Stddev IOPS:            8.55426

Max IOPS:               51

Min IOPS:               0

Average Latency(s):     3.59873

Stddev Latency(s):      5.99964

Max latency(s):         30.6307

Min latency(s):         0.0865062

after downgrading OSDs:

15.2.14-1~bpo10+1:

Total time run:         120.135

Total writes made:      16324

Write size:             4194304

Object size:            4194304

Bandwidth (MB/sec):     543.524

Stddev Bandwidth:       21.7548

Max bandwidth (MB/sec): 580

Min bandwidth (MB/sec): 436

Average IOPS:           135

Stddev IOPS:            5.43871

Max IOPS:               145

Min IOPS:               109

Average Latency(s):     0.117646

Stddev Latency(s):      0.0391269

Max latency(s):         0.544229

Min latency(s):         0.0602735

We currently run on this setup:

{
    "mon": {
        "ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a)
pacific (stable)": 2
    },
    "mgr": {
        "ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be)
octopus (stable)": 3
    },
    "osd": {
        "ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be)
octopus (stable)": 35
    },
    "mds": {},
    "overall": {
        "ceph version 15.2.14 (cd3bb7e87a2f62c1b862ff3fd8b1eec13391a5be)
octopus (stable)": 38,
        "ceph version 16.2.5 (0883bdea7337b95e4b611c768c0279868462204a)
pacific (stable)": 2
    }
}

which solved performance issues. All OSDs were newly created and fully
synced from other nodes when upgrading and downgrading back to 15.2.

Best Regards,

Martin

Dne 05. 09. 21 v 19:45 Luis Domingues napsal(a):
> Hello,
>
> I run a test cluster of 3 machines with 24 HDDs each, running bare-metal on CentOS 8. Long story short, I can have a bandwidth of ~ 1'200 MB/s when I do a rados bench, writing objects of 128k, when the cluster is installed with Nautilus.
>
> When I upgrade the cluster to Pacific, (using ceph-ansible to deploy and/or upgrade), my performances drop to ~400 MB/s of bandwidth doing the same rados bench.
>
> I am kind of clueless on what makes the performance drop so much. Does someone have some ideas where I can dig to find the root of this difference?
>
> Thanks,
> Luis Domingues

-- 
Martin Mlynář

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx