FWIW, the snapshot was in pool cephVMs01_comp, which does use compresion.
How is your pg distribution on your osd devices?
Looks like the PG's are not perfectly balanced, but doesn't seem to be
too bad:
ceph osd df tree
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS TYPE NAME
-1 13.10057 - 13 TiB 7.7 TiB 7.6 TiB 150 MiB 47
GiB 5.4 TiB 58.41 1.00 - root default
-3 4.36659 - 4.4 TiB 2.6 TiB 2.5 TiB 50 MiB 16
GiB 1.8 TiB 58.43 1.00 - host maigmo01
0 ssd 1.74660 1.00000 1.7 TiB 971 GiB 966 GiB 15 MiB 5.8
GiB 817 GiB 54.31 0.93 123 up osd.0
1 ssd 1.31000 1.00000 1.3 TiB 794 GiB 790 GiB 17 MiB 4.2
GiB 547 GiB 59.23 1.01 99 up osd.1
2 ssd 1.31000 1.00000 1.3 TiB 847 GiB 841 GiB 18 MiB 6.0
GiB 495 GiB 63.12 1.08 99 up osd.2
-5 4.36659 - 4.4 TiB 2.6 TiB 2.5 TiB 59 MiB 16
GiB 1.8 TiB 58.42 1.00 - host maigmo02
3 ssd 1.31000 1.00000 1.3 TiB 714 GiB 710 GiB 24 MiB 4.1
GiB 627 GiB 53.23 0.91 92 up osd.3
6 ssd 1.74660 1.00000 1.7 TiB 1.1 TiB 1.1 TiB 20 MiB 7.1
GiB 645 GiB 63.94 1.09 137 up osd.6
7 ssd 1.31000 1.00000 1.3 TiB 755 GiB 750 GiB 15 MiB 4.4
GiB 587 GiB 56.26 0.96 92 up osd.7
-2 4.36739 - 4.4 TiB 2.6 TiB 2.5 TiB 41 MiB 15
GiB 1.8 TiB 58.39 1.00 - host maigmo04
12 ssd 1.74660 1.00000 1.7 TiB 1.1 TiB 1.1 TiB 16 MiB 6.5
GiB 700 GiB 60.83 1.04 133 up osd.12
13 ssd 1.31039 1.00000 1.3 TiB 634 GiB 631 GiB 10 MiB 2.8
GiB 708 GiB 47.24 0.81 83 up osd.13
14 ssd 1.31039 1.00000 1.3 TiB 890 GiB 884 GiB 14 MiB 5.6
GiB 452 GiB 66.29 1.13 105 up osd.14
TOTAL 13 TiB 7.7 TiB 7.6 TiB 150 MiB 47
GiB 5.4 TiB 58.41
MIN/MAX VAR: 0.81/1.13 STDDEV: 5.72
This cluster creates data at a slow rate, maybe around 300GB a year.
Maybe it's time for a reweight...
Do you have enough assigned pgs?
Autoscaler is enabled and it believes that the pools have right amount
of PGs:
ceph osd pool autoscale-status
POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO
TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE
cephVMs01 19 3.0 13414G
0.0000 1.0 32 on
cephFS01_metadata 167.2M 3.0 13414G
0.0000 4.0 32 on
cephFS01_data 0 3.0 13414G
0.0000 1.0 32 on
cephDATA01 742.0G 3.0 13414G
0.1659 1.0 64 on
cephMYSQL01 357.7G 3.0 13414G
0.0800 1.0 32 on
device_health_metrics 249.8k 3.0 13414G
0.0000 1.0 1 on
cephVMs01_comp 1790G 3.0 13414G 0.4004
Istvan Szabo
Staff Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx
---------------------------------------------------
On 2023. Jan 27., at 23:30, Victor Rodriguez
<vrodriguez@xxxxxxxxxxxxx> wrote:
Email received from the internet. If in doubt, don't click any link
nor open any attachment !
________________________________
Ah yes, checked that too. Monitors and OSD's report with ceph config
show-with-defaults that bluefs_buffered_io is set to true as default
setting (it isn't overriden somewere).
On 1/27/23 17:15, Wesley Dillingham wrote:
I hit this issue once on a nautilus cluster and changed the OSD
parameter bluefs_buffered_io = true (was set at false). I believe the
default of this parameter was switched from false to true in release
14.2.20, however, perhaps you could still check what your osds are
configured with in regard to this config item.
Respectfully,
*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
On Fri, Jan 27, 2023 at 8:52 AM Victor Rodriguez
<vrodriguez@xxxxxxxxxxxxx> wrote:
Hello,
Asking for help with an issue. Maybe someone has a clue about what's
going on.
Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I
removed
it. A bit later, nearly half of the PGs of the pool entered
snaptrim and
snaptrim_wait state, as expected. The problem is that such operations
ran extremely slow and client I/O was nearly nothing, so all VMs
in the
cluster got stuck as they could not I/O to the storage. Taking and
removing big snapshots is a normal operation that we do often and
this
is the first time I see this issue in any of my clusters.
Disks are all Samsung PM1733 and network is 25G. It gives us
plenty of
performance for the use case and never had an issue with the
hardware.
Both disk I/O and network I/O was very low. Still, client I/O
seemed to
get queued forever. Disabling snaptrim (ceph osd set nosnaptrim)
stops
any active snaptrim operation and client I/O resumes back to normal.
Enabling snaptrim again makes client I/O to almost halt again.
I've been playing with some settings:
ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'
None really seemed to help. Also tried restarting OSD services.
This cluster was upgraded from 14.2.x to 15.2.17 a couple of
months. Is
there any setting that must be changed which may cause this problem?
I have scheduled a maintenance window, what should I look for to
diagnose this problem?
Any help is very appreciated. Thanks in advance.
Victor
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
------------------------------------------------------------------------
This message is confidential and is for the sole use of the
intended recipient(s). It may also be privileged or
otherwise protected by copyright or other legal rules. If
you have received it by mistake please let us know by reply
email and delete it from your system. It is prohibited to
copy this message or disclose its content to anyone. Any
confidentiality or privilege is not waived or lost by any
mistaken delivery or unauthorized disclosure of the message.
All messages sent to and from Agoda may be monitored to
ensure compliance with company policies, to protect the
company's interests and to remove potential malware.
Electronic messages may be intercepted, amended, lost or
deleted, or contain viruses.
--
_______________________________________________
SOLTECSIS SOLUCIONES TECNOLOGICAS, S.L.
Víctor Rodríguez Cortés
Departamento de I+D+I
Teléfono: 966 446 046
vrodriguez@xxxxxxxxxxxxx
www.soltecsis.com
_______________________________________________
La información contenida en este e-mail es confidencial,
siendo para uso exclusivo del destinatario arriba mencionado.
Le informamos que está totalmente prohibida cualquier
utilización, divulgación, distribución y/o reproducción de
esta comunicación sin autorización expresa en virtud de la
legislación vigente. Si ha recibido este mensaje por error,
le rogamos nos lo notifique inmediatamente por la misma vía
y proceda a su eliminación.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx