Are you running FileStore? (The config options you are using looks like a FileStore config) Try out BlueStore, we've found that it reduces random latency spikes due to filesystem weirdness a lot. Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 Am Sa., 17. Nov. 2018 um 21:07 Uhr schrieb Alex Litvak <alexander.v.litvak@xxxxxxxxx>: > > I am using libvirt for block device (openstack, proxmox KVM VMs) > Also I am mounting cephfs inside of VMs and on bare metal hosts. In > this case it would be a kernel based client. > > From what I can see based on pool stats cephfs pools have higher > utilization comparing to block pools during the spikes, how ever it is > still small. > > On 11/17/2018 1:40 PM, Kees Meijs wrote: > > Hi Alex, > > > > What kind of clients do you use? Is it KVM (QEMU) using NBD driver, > > kernel, or...? > > > > Regards, > > Kees > > > > On 17-11-18 20:17, Alex Litvak wrote: > >> Hello everyone, > >> > >> I am trying to troubleshoot cluster exhibiting huge spikes of latency. > >> I cannot quite catch it because it happens during the light activity > >> and randomly affects one osd node out of 3 in the pool. > >> > >> This is a file store. > >> I see some osds exhibit applied latency of 400 ms, 1 minute load > >> average shuts to 60. Client commit latency with queue shoots to 300ms > >> and journal latency (return write ack for client) (journal on Intel > >> DC-S3710 SSD) shoots on 40 ms > >> > >> op_w_process_latency showed 250 ms and client read-modify-write > >> operation readable/applied latency jumped to 1.25 s on one of the OSDs > >> > >> I rescheduled the scrubbing and deep scrubbing and was watching ceph > >> -w activity so it is definitely not related. > >> > >> At the same time node shows 98 % cpu idle no significant changes in > >> memory utilization, no errors on network with bandwidth utilization > >> between 20 - 50 Mbit on client and back end networks > >> > >> OSD node has 12 OSDs (2TB rust) 2 partitioned SSD journal disks, 32 GB > >> RAM, dial 6 core / 12 thread CPUs > >> > >> This is perhaps the most relevant part of ceph config > >> > >> debug lockdep = 0/0 > >> debug context = 0/0 > >> debug crush = 0/0 > >> debug buffer = 0/0 > >> debug timer = 0/0 > >> debug journaler = 0/0 > >> debug osd = 0/0 > >> debug optracker = 0/0 > >> debug objclass = 0/0 > >> debug filestore = 0/0 > >> debug journal = 0/0 > >> debug ms = 0/0 > >> debug monc = 0/0 > >> debug tp = 0/0 > >> debug auth = 0/0 > >> debug finisher = 0/0 > >> debug heartbeatmap = 0/0 > >> debug perfcounter = 0/0 > >> debug asok = 0/0 > >> debug throttle = 0/0 > >> > >> [osd] > >> journal_dio = true > >> journal_aio = true > >> osd_journal = /var/lib/ceph/osd/$cluster-$id-journal/journal > >> osd_journal_size = 2048 ; journal size, in megabytes > >> osd crush update on start = false > >> osd mount options xfs = > >> "rw,noatime,inode64,logbsize=256k,delaylog,allocsize=4M" > >> osd_op_threads = 5 > >> osd_disk_threads = 4 > >> osd_pool_default_size = 2 > >> osd_pool_default_min_size = 1 > >> osd_pool_default_pg_num = 512 > >> osd_pool_default_pgp_num = 512 > >> osd_crush_chooseleaf_type = 1 > >> ; osd pool_default_crush_rule = 1 > >> ; new options 04.12.2015 > >> filestore_op_threads = 4 > >> osd_op_num_threads_per_shard = 1 > >> osd_op_num_shards = 25 > >> filestore_fd_cache_size = 64 > >> filestore_fd_cache_shards = 32 > >> filestore_fiemap = false > >> ; Reduce impact of scrub (needs cfq on osds) > >> osd_disk_thread_ioprio_class = "idle" > >> osd_disk_thread_ioprio_priority = 7 > >> osd_deep_scrub_interval = 1211600 > >> osd_scrub_begin_hour = 19 > >> osd_scrub_end_hour = 4 > >> osd_scrub_sleep = 0.1 > >> [client] > >> rbd_cache = true > >> rbd_cache_size = 67108864 > >> rbd_cache_max_dirty = 50331648 > >> rbd_cache_target_dirty = 33554432 > >> rbd_cache_max_dirty_age = 2 > >> rbd_cache_writethrough_until_flush = true > >> > >> OSD logs and system log at that time show nothing interesting. > >> > >> Any clue of what to look for in order to diagnose the load / latency > >> spikes would be really appreciated. > >> > >> Thank you > >> > >> _______________________________________________ > >> ceph-users mailing list > >> ceph-users@xxxxxxxxxxxxxx > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > > ceph-users mailing list > > ceph-users@xxxxxxxxxxxxxx > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com