Re: SLOW_OPS problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tim,

do you see the behaviour across all devices or does it only affect one
type/manufacturer?

Joachim


www.clyso.com

Hohenzollernstr. 27, 80801 Munich

Utting a. A. | HR: Augsburg | HRB: 25866 | USt. ID-Nr.: DE2754306

Tim Sauerbein <sauerbein@xxxxxxxxxx> schrieb am So., 29. Sept. 2024, 23:32:

> Dear list,
>
> I have a small cluster (Reef 18.2.4) with 7 hosts and 3-4 OSDs each
> (960GB/1.92TB mixed Intel D3-S4610, Samsung SM883, PM897 SSDs):
>
>   cluster:
>     id:     ecff3ce8-539b-443e-a492-da428f4aa9e9
>     health: HEALTH_OK
>
>   services:
>     mon: 5 daemons, quorum titan,mangan,kalium,argon,chromium (age 2w)
>     mgr: mangan(active, since 2w), standbys: titan, argon
>     osd: 22 osds: 22 up (since 2w), 22 in (since 3M)
>
>   data:
>     pools:   2 pools, 513 pgs
>     objects: 2.76M objects, 7.0 TiB
>     usage:   16 TiB used, 15 TiB / 31 TiB avail
>     pgs:     513 active+clean
>
> On that cluster RBD volumes for virtual machines are stored.
>
> For a couple of months now the cluster reports slow ops for some OSDs and
> some PGs as laggy. This happens once or twice a day, sometimes more and
> sometimes not at all for a few days, at completely random times,
> independent of when snapshots are deleted and trimmed and independent of
> the I/O load or load on the hosts.
>
> After about 30 seconds, during which the write speed goes to zero on the
> VMs, everything returns to normal. I cannot reproduce the slow ops manually
> by creating write load on the cluster. Even writing continuously with
> 300-400 MB/s full speed for 20 minutes does not create any problems.
>
> See attached log file for an example of a typical occurrence. I have also
> measured write load on the disks during the problems with iostat which just
> shows how writes stall, see also attached.
>
> The OSDs with slow ops are completely random, any of the disks would show
> up once in a while.
>
> Current config (I've tried optimising snaptrim and scrub which didn't
> help):
>
> # ceph config dump
> WHO     MASK  LEVEL     OPTION                                 VALUE
>    RO
> global        advanced  auth_client_required                   cephx
>    *
> global        advanced  auth_cluster_required                  cephx
>    *
> global        advanced  auth_service_required                  cephx
>    *
> global        advanced  bdev_async_discard                     true
> global        advanced  bdev_enable_discard                    true
> global        advanced  public_network                         10.0.4.0/24
>  *
> mon           advanced  auth_allow_insecure_global_id_reclaim  false
> mgr           advanced  mgr/balancer/active                    true
> mgr           advanced  mgr/balancer/mode                      upmap
> mgr           unknown   mgr/pg_autoscaler/autoscale_profile    scale-up
>   *
> osd           basic     osd_memory_target                      4294967296
> osd           advanced  osd_pg_max_concurrent_snap_trims       1
> osd           advanced  osd_scrub_begin_hour                   23
> osd           advanced  osd_scrub_end_hour                     4
> osd           advanced  osd_scrub_sleep                        1.000000
> osd           advanced  osd_snap_trim_priority                 1
> osd           advanced  osd_snap_trim_sleep                    2.000000
> osd.0         basic     osd_mclock_max_capacity_iops_ssd       29199.674019
> osd.1         basic     osd_mclock_max_capacity_iops_ssd       31554.530141
> osd.10        basic     osd_mclock_max_capacity_iops_ssd       25949.821194
> osd.11        basic     osd_mclock_max_capacity_iops_ssd       26300.596265
> osd.12        basic     osd_mclock_max_capacity_iops_ssd       25167.331294
> osd.13        basic     osd_mclock_max_capacity_iops_ssd       21606.610828
> osd.14        basic     osd_mclock_max_capacity_iops_ssd       27894.095121
> osd.15        basic     osd_mclock_max_capacity_iops_ssd       25929.047047
> osd.16        basic     osd_mclock_max_capacity_iops_ssd       15423.600235
> osd.17        basic     osd_mclock_max_capacity_iops_ssd       25097.493934
> osd.18        basic     osd_mclock_max_capacity_iops_ssd       25966.188007
> osd.19        basic     osd_mclock_max_capacity_iops_ssd       23628.746459
> osd.2         basic     osd_mclock_max_capacity_iops_ssd       32157.280832
> osd.20        basic     osd_mclock_max_capacity_iops_ssd       22722.682745
> osd.3         basic     osd_mclock_max_capacity_iops_ssd       33951.086556
> osd.4         basic     osd_mclock_max_capacity_iops_ssd       22736.907664
> osd.5         basic     osd_mclock_max_capacity_iops_ssd       21916.777510
> osd.6         basic     osd_mclock_max_capacity_iops_ssd       29984.954749
> osd.7         basic     osd_mclock_max_capacity_iops_ssd       26757.965797
> osd.8         basic     osd_mclock_max_capacity_iops_ssd       22738.921429
> osd.9         basic     osd_mclock_max_capacity_iops_ssd       24635.156413
>
>
> Any help would be much appreciated!
>
> Thanks,
> Tim
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux