Hi Alex, What kind of clients do you use? Is it KVM (QEMU) using NBD driver, kernel, or...? Regards, Kees On 17-11-18 20:17, Alex Litvak wrote: > Hello everyone, > > I am trying to troubleshoot cluster exhibiting huge spikes of latency. > I cannot quite catch it because it happens during the light activity > and randomly affects one osd node out of 3 in the pool. > > This is a file store. > I see some osds exhibit applied latency of 400 ms, 1 minute load > average shuts to 60. Client commit latency with queue shoots to 300ms > and journal latency (return write ack for client) (journal on Intel > DC-S3710 SSD) shoots on 40 ms > > op_w_process_latency showed 250 ms and client read-modify-write > operation readable/applied latency jumped to 1.25 s on one of the OSDs > > I rescheduled the scrubbing and deep scrubbing and was watching ceph > -w activity so it is definitely not related. > > At the same time node shows 98 % cpu idle no significant changes in > memory utilization, no errors on network with bandwidth utilization > between 20 - 50 Mbit on client and back end networks > > OSD node has 12 OSDs (2TB rust) 2 partitioned SSD journal disks, 32 GB > RAM, dial 6 core / 12 thread CPUs > > This is perhaps the most relevant part of ceph config > > debug lockdep = 0/0 > debug context = 0/0 > debug crush = 0/0 > debug buffer = 0/0 > debug timer = 0/0 > debug journaler = 0/0 > debug osd = 0/0 > debug optracker = 0/0 > debug objclass = 0/0 > debug filestore = 0/0 > debug journal = 0/0 > debug ms = 0/0 > debug monc = 0/0 > debug tp = 0/0 > debug auth = 0/0 > debug finisher = 0/0 > debug heartbeatmap = 0/0 > debug perfcounter = 0/0 > debug asok = 0/0 > debug throttle = 0/0 > > [osd] > journal_dio = true > journal_aio = true > osd_journal = /var/lib/ceph/osd/$cluster-$id-journal/journal > osd_journal_size = 2048 ; journal size, in megabytes > osd crush update on start = false > osd mount options xfs = > "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M" > osd_op_threads = 5 > osd_disk_threads = 4 > osd_pool_default_size = 2 > osd_pool_default_min_size = 1 > osd_pool_default_pg_num = 512 > osd_pool_default_pgp_num = 512 > osd_crush_chooseleaf_type = 1 > ; osd pool_default_crush_rule = 1 > ; new options 04.12.2015 > filestore_op_threads = 4 > osd_op_num_threads_per_shard = 1 > osd_op_num_shards = 25 > filestore_fd_cache_size = 64 > filestore_fd_cache_shards = 32 > filestore_fiemap = false > ; Reduce impact of scrub (needs cfq on osds) > osd_disk_thread_ioprio_class = "idle" > osd_disk_thread_ioprio_priority = 7 > osd_deep_scrub_interval = 1211600 > osd_scrub_begin_hour = 19 > osd_scrub_end_hour = 4 > osd_scrub_sleep = 0.1 > [client] > rbd_cache = true > rbd_cache_size = 67108864 > rbd_cache_max_dirty = 50331648 > rbd_cache_target_dirty = 33554432 > rbd_cache_max_dirty_age = 2 > rbd_cache_writethrough_until_flush = true > > OSD logs and system log at that time show nothing interesting. > > Any clue of what to look for in order to diagnose the load / latency > spikes would be really appreciated. > > Thank you > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com