Re: Huge latency spikes

Kees Meijs <kees@xxxxxxxx> · Sat, 17 Nov 2018 20:40:42 +0100

Hi Alex,

What kind of clients do you use? Is it KVM (QEMU) using NBD driver,
kernel, or...?

Regards,
Kees

On 17-11-18 20:17, Alex Litvak wrote:
> Hello everyone,
>
> I am trying to troubleshoot cluster exhibiting huge spikes of latency.
> I cannot quite catch it because it happens during the light activity
> and randomly affects one osd node out of 3 in the pool.
>
> This is a file store.
> I see some osds exhibit applied latency  of 400 ms, 1 minute load
> average shuts to 60.  Client commit latency with queue shoots to 300ms
> and journal latency (return write ack for client) (journal on Intel
> DC-S3710 SSD) shoots on 40 ms
>
> op_w_process_latency showed 250 ms and client read-modify-write
> operation readable/applied latency jumped to 1.25 s on one of the OSDs
>
> I rescheduled the scrubbing and deep scrubbing and was watching ceph
> -w activity so it is definitely not related.
>
> At the same time node shows 98 % cpu idle no significant changes in
> memory utilization, no errors on network with bandwidth utilization
> between 20 - 50 Mbit on client and back end networks
>
> OSD node has 12 OSDs (2TB rust) 2 partitioned SSD journal disks, 32 GB
> RAM, dial 6 core / 12 thread CPUs
>
> This is perhaps the most relevant part of ceph config
>
> debug lockdep = 0/0
> debug context = 0/0
> debug crush = 0/0
> debug buffer = 0/0
> debug timer = 0/0
> debug journaler = 0/0
> debug osd = 0/0
> debug optracker = 0/0
> debug objclass = 0/0
> debug filestore = 0/0
> debug journal = 0/0
> debug ms = 0/0
> debug monc = 0/0
> debug tp = 0/0
> debug auth = 0/0
> debug finisher = 0/0
> debug heartbeatmap = 0/0
> debug perfcounter = 0/0
> debug asok = 0/0
> debug throttle = 0/0
>
> [osd]
>         journal_dio = true
>         journal_aio = true
>         osd_journal = /var/lib/ceph/osd/$cluster-$id-journal/journal
>         osd_journal_size = 2048     ; journal size, in megabytes
>     osd crush update on start = false
>         osd mount options xfs =
> "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
>         osd_op_threads = 5
>         osd_disk_threads = 4
>         osd_pool_default_size = 2
>         osd_pool_default_min_size = 1
>         osd_pool_default_pg_num = 512
>         osd_pool_default_pgp_num = 512
>         osd_crush_chooseleaf_type = 1
>         ; osd pool_default_crush_rule = 1
>     ; new options 04.12.2015
>     filestore_op_threads = 4
>         osd_op_num_threads_per_shard = 1
>         osd_op_num_shards = 25
>         filestore_fd_cache_size = 64
>         filestore_fd_cache_shards = 32
>     filestore_fiemap = false
>     ; Reduce impact of scrub (needs cfq on osds)
>     osd_disk_thread_ioprio_class = "idle"
>     osd_disk_thread_ioprio_priority = 7
>     osd_deep_scrub_interval = 1211600
>         osd_scrub_begin_hour = 19
>         osd_scrub_end_hour = 4
>         osd_scrub_sleep = 0.1
> [client]
>     rbd_cache = true
>     rbd_cache_size = 67108864
>     rbd_cache_max_dirty = 50331648
>     rbd_cache_target_dirty = 33554432
>     rbd_cache_max_dirty_age = 2
>     rbd_cache_writethrough_until_flush = true
>
> OSD logs and system log at that time show nothing interesting.
>
> Any clue of what to look for in order to diagnose the load / latency
> spikes would be really appreciated.
>
> Thank you
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com