Hello, Our ceph cluster performance has become horrifically slow over the past few months. Nobody here is terribly familiar with ceph and we're inheriting this cluster without much direction. Architecture: 40Gbps QDR IB fabric between all ceph nodes and our ovirt VM hosts. 11 OSD nodes with a total of 163 OSDs. 14 pools, 3616 PGs, 1.19PB total capacity. Ceph versions: { "mon": { "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 3 }, "mgr": { "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 3 }, "osd": { "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 118, "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 22, "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 19 }, "mds": {}, "overall": { "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee) luminous (stable)": 124, "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777) luminous (stable)": 22, "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)": 19 } } The majority of disks are spindles but there are also NVMe SSDs. There is a lot of variability in drive sizes - two different sets of admins added disks sized between 6TB and 16TB and I suspect this and imbalanced weighting is to blame. Performance on the ovirt VMs can dip as low as several *kilobytes* per-second (!) on reads and a few MB/sec on writes. There are also several scrub errors. In short, it's a complete wreck. STATUS: [root@ceph-admin davei]# ceph -s cluster: id: 1b8d958c-e50b-40ef-a681-16cfeb9390b8 health: HEALTH_ERR 3 scrub errors Possible data damage: 3 pgs inconsistent services: mon: 3 daemons, quorum ceph1,ceph2,ceph3 mgr: ceph3(active), standbys: ceph2, ceph1 osd: 163 osds: 159 up, 158 in data: pools: 14 pools, 3616 pgs objects: 46.28M objects, 174TiB usage: 527TiB used, 694TiB / 1.19PiB avail pgs: 3609 active+clean 4 active+clean+scrubbing+deep 3 active+clean+inconsistent io: client: 74.3MiB/s rd, 96.0MiB/s wr, 3.85kop/s rd, 3.68kop/s wr --- HEALTH: [root@ceph-admin davei]# ceph health detail HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent OSD_SCRUB_ERRORS 3 scrub errors PG_DAMAGED Possible data damage: 3 pgs inconsistent pg 2.8a is active+clean+inconsistent, acting [13,152,127] pg 2.ce is active+clean+inconsistent, acting [145,13,152] pg 2.e8 is active+clean+inconsistent, acting [150,162,42] --- CEPH OSD DF: (not going to paste that all in here): https://pastebin.com/CNW5RKWx What else am I missing in terms of what to share with you all? Any advice on how we should 'reweight' these to get the performance to improve? Thanks all, -Dave _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx