Re: Very slow snaptrim operations blocking client I/O

Victor Rodriguez <vrodriguez@xxxxxxxxxxxxx> · Fri, 27 Jan 2023 18:23:24 +0100

FWIW, the snapshot was in pool cephVMs01_comp, which does use compresion.

How is your pg distribution on your osd devices? 

Looks like the PG's are not perfectly balanced, but doesn't seem to be 
too bad:

ceph osd df tree
ID  CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA OMAP     META     
AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
-1         13.10057         -   13 TiB  7.7 TiB  7.6 TiB  150 MiB   47 
GiB  5.4 TiB  58.41  1.00    -          root default
-3          4.36659         -  4.4 TiB  2.6 TiB  2.5 TiB   50 MiB   16 
GiB  1.8 TiB  58.43  1.00    -              host maigmo01
 0    ssd   1.74660   1.00000  1.7 TiB  971 GiB  966 GiB   15 MiB  5.8 
GiB  817 GiB  54.31  0.93  123      up          osd.0
 1    ssd   1.31000   1.00000  1.3 TiB  794 GiB  790 GiB   17 MiB  4.2 
GiB  547 GiB  59.23  1.01   99      up          osd.1
 2    ssd   1.31000   1.00000  1.3 TiB  847 GiB  841 GiB   18 MiB  6.0 
GiB  495 GiB  63.12  1.08   99      up          osd.2
-5          4.36659         -  4.4 TiB  2.6 TiB  2.5 TiB   59 MiB   16 
GiB  1.8 TiB  58.42  1.00    -              host maigmo02
 3    ssd   1.31000   1.00000  1.3 TiB  714 GiB  710 GiB   24 MiB  4.1 
GiB  627 GiB  53.23  0.91   92      up          osd.3
 6    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB   20 MiB  7.1 
GiB  645 GiB  63.94  1.09  137      up          osd.6
 7    ssd   1.31000   1.00000  1.3 TiB  755 GiB  750 GiB   15 MiB  4.4 
GiB  587 GiB  56.26  0.96   92      up          osd.7
-2          4.36739         -  4.4 TiB  2.6 TiB  2.5 TiB   41 MiB   15 
GiB  1.8 TiB  58.39  1.00    -              host maigmo04
12    ssd   1.74660   1.00000  1.7 TiB  1.1 TiB  1.1 TiB   16 MiB  6.5 
GiB  700 GiB  60.83  1.04  133      up          osd.12
13    ssd   1.31039   1.00000  1.3 TiB  634 GiB  631 GiB   10 MiB  2.8 
GiB  708 GiB  47.24  0.81   83      up          osd.13
14    ssd   1.31039   1.00000  1.3 TiB  890 GiB  884 GiB   14 MiB  5.6 
GiB  452 GiB  66.29  1.13  105      up          osd.14
                        TOTAL   13 TiB  7.7 TiB  7.6 TiB  150 MiB   47 
GiB  5.4 TiB  58.41
MIN/MAX VAR: 0.81/1.13  STDDEV: 5.72

This cluster creates data at a slow rate, maybe around 300GB a year. 
Maybe it's time for a reweight...

Do you have enough assigned pgs?

Autoscaler is enabled and it believes that the pools have right amount 
of PGs:

ceph osd pool autoscale-status
POOL                     SIZE  TARGET SIZE  RATE  RAW CAPACITY RATIO  
TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM AUTOSCALE
cephVMs01                 19                 3.0        13414G 
0.0000                                  1.0      32 on
cephFS01_metadata      167.2M                3.0        13414G 
0.0000                                  4.0      32 on
cephFS01_data              0                 3.0        13414G 
0.0000                                  1.0      32 on
cephDATA01             742.0G                3.0        13414G 
0.1659                                  1.0      64 on
cephMYSQL01            357.7G                3.0        13414G 
0.0800                                  1.0      32 on
device_health_metrics  249.8k                3.0        13414G 
0.0000                                  1.0       1 on
cephVMs01_comp          1790G                3.0        13414G 0.4004

Istvan Szabo
Staff Infrastructure Engineer
---------------------------------------------------
Agoda Services Co., Ltd.
e: istvan.szabo@xxxxxxxxx
---------------------------------------------------

On 2023. Jan 27., at 23:30, Victor Rodriguez 
<vrodriguez@xxxxxxxxxxxxx> wrote:

Email received from the internet. If in doubt, don't click any link 
nor open any attachment !
________________________________

Ah yes, checked that too. Monitors and OSD's report with ceph config
show-with-defaults that bluefs_buffered_io is set to true as default
setting (it isn't overriden somewere).

On 1/27/23 17:15, Wesley Dillingham wrote:
I hit this issue once on a nautilus cluster and changed the OSD
parameter bluefs_buffered_io = true (was set at false). I believe the
default of this parameter was switched from false to true in release
14.2.20, however, perhaps you could still check what your osds are
configured with in regard to this config item.

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Fri, Jan 27, 2023 at 8:52 AM Victor Rodriguez
<vrodriguez@xxxxxxxxxxxxx> wrote:

   Hello,

   Asking for help with an issue. Maybe someone has a clue about what's
   going on.

   Using ceph 15.2.17 on Proxmox 7.3. A big VM had a snapshot and I
   removed
   it. A bit later, nearly half of the PGs of the pool entered
   snaptrim and
   snaptrim_wait state, as expected. The problem is that such operations
   ran extremely slow and client I/O was nearly nothing, so all VMs
   in the
   cluster got stuck as they could not I/O to the storage. Taking and
   removing big snapshots is a normal operation that we do often and
   this
   is the first time I see this issue in any of my clusters.

   Disks are all Samsung PM1733 and network is 25G. It gives us
   plenty of
   performance for the use case and never had an issue with the 
hardware.

   Both disk I/O and network I/O was very low. Still, client I/O
   seemed to
   get queued forever. Disabling snaptrim (ceph osd set nosnaptrim)
   stops
   any active snaptrim operation and client I/O resumes back to normal.
   Enabling snaptrim again makes client I/O to almost halt again.

   I've been playing with some settings:

   ceph tell 'osd.*' injectargs '--osd-max-trimming-pgs 1'
   ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep 30'
   ceph tell 'osd.*' injectargs '--osd-snap-trim-sleep-ssd 30'
   ceph tell 'osd.*' injectargs '--osd-pg-max-concurrent-snap-trims 1'

   None really seemed to help. Also tried restarting OSD services.

   This cluster was upgraded from 14.2.x to 15.2.17 a couple of
   months. Is
   there any setting that must be changed which may cause this problem?

   I have scheduled a maintenance window, what should I look for to
   diagnose this problem?

   Any help is very appreciated. Thanks in advance.

   Victor

   _______________________________________________
   ceph-users mailing list -- ceph-users@xxxxxxx
   To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

------------------------------------------------------------------------

          This message is confidential and is for the sole use of the
          intended recipient(s). It may also be privileged or
          otherwise protected by copyright or other legal rules. If
          you have received it by mistake please let us know by reply
          email and delete it from your system. It is prohibited to
          copy this message or disclose its content to anyone. Any
          confidentiality or privilege is not waived or lost by any
          mistaken delivery or unauthorized disclosure of the message.
          All messages sent to and from Agoda may be monitored to
          ensure compliance with company policies, to protect the
          company's interests and to remove potential malware.
          Electronic messages may be intercepted, amended, lost or
          deleted, or contain viruses.

--
_______________________________________________

SOLTECSIS SOLUCIONES TECNOLOGICAS, S.L.
Víctor Rodríguez Cortés
Departamento de I+D+I
Teléfono: 966 446 046
vrodriguez@xxxxxxxxxxxxx
www.soltecsis.com
_______________________________________________

La información contenida en este e-mail es confidencial,
siendo para uso exclusivo del destinatario arriba mencionado.
Le informamos que está totalmente prohibida cualquier
utilización, divulgación, distribución y/o reproducción de
esta comunicación sin autorización expresa en virtud de la
legislación vigente. Si ha recibido este mensaje por error,
le rogamos nos lo notifique inmediatamente por la misma vía
y proceda a su eliminación.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx