Ceph OSD imbalance and performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

Our ceph cluster performance has become horrifically slow over the past few
months.

Nobody here is terribly familiar with ceph and we're inheriting this
cluster without much direction.

Architecture: 40Gbps QDR IB fabric between all ceph nodes and our ovirt VM
hosts. 11 OSD nodes with a total of 163 OSDs. 14 pools, 3616 PGs, 1.19PB
total capacity.

Ceph versions:

{
  "mon": {
    "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 3
  },
  "mgr": {
    "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 3
  },
  "osd": {
    "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 118,
    "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 22,
    "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e)
luminous (stable)": 19
  },
  "mds": {},
  "overall": {
    "ceph version 12.2.11 (26dc3775efc7bb286a1d6d66faee0ba30ea23eee)
luminous (stable)": 124,
    "ceph version 12.2.12 (1436006594665279fe734b4c15d7e08c13ebd777)
luminous (stable)": 22,
    "ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e)
luminous (stable)": 19
  }
}

The majority of disks are spindles but there are also NVMe SSDs. There is a
lot of variability in drive sizes - two different sets of admins added
disks sized between 6TB and 16TB and I suspect this and imbalanced
weighting is to blame.

Performance on the ovirt VMs can dip as low as several *kilobytes*
per-second (!) on reads and a few MB/sec on writes. There are also several
scrub errors. In short, it's a complete wreck.

STATUS:

[root@ceph-admin davei]# ceph -s
  cluster:
    id:     1b8d958c-e50b-40ef-a681-16cfeb9390b8
    health: HEALTH_ERR
            3 scrub errors
            Possible data damage: 3 pgs inconsistent

  services:
    mon: 3 daemons, quorum ceph1,ceph2,ceph3
    mgr: ceph3(active), standbys: ceph2, ceph1
    osd: 163 osds: 159 up, 158 in

  data:
    pools:   14 pools, 3616 pgs
    objects: 46.28M objects, 174TiB
    usage:   527TiB used, 694TiB / 1.19PiB avail
    pgs:     3609 active+clean
             4    active+clean+scrubbing+deep
             3    active+clean+inconsistent

  io:
    client:   74.3MiB/s rd, 96.0MiB/s wr, 3.85kop/s rd, 3.68kop/s wr

---
HEALTH:

[root@ceph-admin davei]# ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 3 pgs inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 3 pgs inconsistent
    pg 2.8a is active+clean+inconsistent, acting [13,152,127]
    pg 2.ce is active+clean+inconsistent, acting [145,13,152]
    pg 2.e8 is active+clean+inconsistent, acting [150,162,42]
---
CEPH OSD DF:

(not going to paste that all in here): https://pastebin.com/CNW5RKWx

What else am I missing in terms of what to share with you all?

Any advice on how we should 'reweight' these to get the performance to
improve?

Thanks all,
-Dave
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux