I am using libvirt for block device (openstack, proxmox KVM VMs)
Also I am mounting cephfs inside of VMs and on bare metal hosts. In
this case it would be a kernel based client.
From what I can see based on pool stats cephfs pools have higher
utilization comparing to block pools during the spikes, how ever it is
still small.
On 11/17/2018 1:40 PM, Kees Meijs wrote:
Hi Alex,
What kind of clients do you use? Is it KVM (QEMU) using NBD driver,
kernel, or...?
Regards,
Kees
On 17-11-18 20:17, Alex Litvak wrote:
Hello everyone,
I am trying to troubleshoot cluster exhibiting huge spikes of latency.
I cannot quite catch it because it happens during the light activity
and randomly affects one osd node out of 3 in the pool.
This is a file store.
I see some osds exhibit applied latency of 400 ms, 1 minute load
average shuts to 60. Client commit latency with queue shoots to 300ms
and journal latency (return write ack for client) (journal on Intel
DC-S3710 SSD) shoots on 40 ms
op_w_process_latency showed 250 ms and client read-modify-write
operation readable/applied latency jumped to 1.25 s on one of the OSDs
I rescheduled the scrubbing and deep scrubbing and was watching ceph
-w activity so it is definitely not related.
At the same time node shows 98 % cpu idle no significant changes in
memory utilization, no errors on network with bandwidth utilization
between 20 - 50 Mbit on client and back end networks
OSD node has 12 OSDs (2TB rust) 2 partitioned SSD journal disks, 32 GB
RAM, dial 6 core / 12 thread CPUs
This is perhaps the most relevant part of ceph config
debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0
[osd]
journal_dio = true
journal_aio = true
osd_journal = /var/lib/ceph/osd/$cluster-$id-journal/journal
osd_journal_size = 2048 ; journal size, in megabytes
osd crush update on start = false
osd mount options xfs =
"rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M"
osd_op_threads = 5
osd_disk_threads = 4
osd_pool_default_size = 2
osd_pool_default_min_size = 1
osd_pool_default_pg_num = 512
osd_pool_default_pgp_num = 512
osd_crush_chooseleaf_type = 1
; osd pool_default_crush_rule = 1
; new options 04.12.2015
filestore_op_threads = 4
osd_op_num_threads_per_shard = 1
osd_op_num_shards = 25
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
filestore_fiemap = false
; Reduce impact of scrub (needs cfq on osds)
osd_disk_thread_ioprio_class = "idle"
osd_disk_thread_ioprio_priority = 7
osd_deep_scrub_interval = 1211600
osd_scrub_begin_hour = 19
osd_scrub_end_hour = 4
osd_scrub_sleep = 0.1
[client]
rbd_cache = true
rbd_cache_size = 67108864
rbd_cache_max_dirty = 50331648
rbd_cache_target_dirty = 33554432
rbd_cache_max_dirty_age = 2
rbd_cache_writethrough_until_flush = true
OSD logs and system log at that time show nothing interesting.
Any clue of what to look for in order to diagnose the load / latency
spikes would be really appreciated.
Thank you
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com