slow pwrite64()s to ceph

"Kelly, Mark (RIS-BCT)" <Mark.Kelly@xxxxxxxxxxxxxxxxxx> · Tue, 9 Apr 2024 17:44:24 +0000

Hi,

Thank you for any time.

We are tracking some very slow pwrite64() calls to a ceph filesystem -

20965 11:04:24.049186 <... pwrite64 resumed>) = 65536 <4.489594>
20966 11:04:24.069765 <... pwrite64 resumed>) = 65536 <4.508859>
20967 11:04:24.090354 <... pwrite64 resumed>) = 65536 <4.510256>

But other pwrite64()s from the same program in other threads to other files on the same ceph fs seem fine; we cannot really reproduce this, but it happens occasionally.

It seems we are spending some time in ceph_aio_write() when this is happening (see call graph below)
I've noticed THP (transparent Huge Pages) is enabled.
We are running version 15.2.17 on CentOS 7.9
Do not seem to be under any significant memory pressure when this happens, just many threads of this app blocked on i/o in pwrite64()s.

I am suggesting an upgrade, but until then, do you think this situation involves ceph and could be improved if we disable THP ?

Thanks for any advice or suggestion,
-mark

Call graph of app when slow pwrites64()s are happening -

--87.08%--system_call_fastpath
           |--58.20%--sys_pwrite64
           |           --58.03%--vfs_write
           |                     do_sync_write
           |                     ceph_aio_write
           |                     |--54.33%--generic_file_buffered_write
           |                     |          |--27.09%--ceph_write_begin
           |                     |          |          |--17.01%--grab_cache_page_write_begin
           |                     |          |          |          |--7.51%--add_to_page_cache_lru
           |                     |          |          |          |          |--4.28%--__add_to_page_cache_locked
           |                     |          |          |          |          |           --3.95%--mem_cgroup_cache_charge
           |                     |          |          |          |          |                     mem_cgroup_charge_common
           |                     |          |          |          |          |                      --3.75%--__mem_cgroup_commit_charge
           |                     |          |          |          |           --3.23%--lru_cache_add
           |                     |          |          |          |                     __lru_cache_add
           |                     |          |          |          |                      --2.94%--pagevec_lru_move_fn
           |                     |          |          |          |                                |--0.70%--mem_cgroup_page_lruvec
           |                     |          |          |          |                                |--0.67%--__pagevec_lru_add_fn
           |                     |          |          |          |                                 --0.57%--release_pages
           |                     |          |          |          |--5.31%--__page_cache_alloc
           |                     |          |          |          |           --5.00%--alloc_pages_current
           |                     |          |          |          |                     __alloc_pages_nodemask
           |                     |          |          |          |                      --4.23%--get_page_from_freelist
           |                     |          |          |          |                                |--1.85%--__rmqueue
           |                     |          |          |          |                                |           --1.49%--list_del
           |                     |          |          |          |                                |                     __list_del_entry
           |                     |          |          |          |                                 --1.74%--list_del
           |                     |          |          |          |                                           __list_del_entry
           |                     |          |          |           --3.92%--__find_lock_page
           |                     |          |          |                     __find_get_page
           |                     |          |          |                      --3.46%--radix_tree_lookup_slot
           |                     |          |          |                                |--2.76%--radix_tree_descend
           |                     |          |          |                                 --0.70%--__radix_tree_lookup
           |                     |          |          |                                           radix_tree_descend
           |                     |          |           --9.45%--ceph_update_writeable_page
           |                     |          |                      --8.94%--readpage_nounlock
           |                     |          |                                 --8.60%--ceph_osdc_readpages
           |                     |          |                                            --8.18%--submit_request
           |                     |          |                                                      __submit_request
           |                     |          |                                                      calc_target.isra.50
           |                     |          |                                                      ceph_pg_to_up_acting_osds
           |                     |          |                                                      crush_do_rule
           |                     |          |                                                      crush_choose_firstn
           |                     |          |                                                      |--4.89%--crush_choose_firstn
           |                     |          |                                                      |          is_out.isra.2.part.3
           |                     |          |                                                       --3.30%--crush_bucket_choose
           |                     |          |--14.50%--copy_user_enhanced_fast_string
           |                     |          |--6.34%--ceph_write_end
           |                     |          |          |--3.76%--set_page_dirty
           |                     |          |          |          ceph_set_page_dirty
           |                     |          |          |           --2.86%--__set_page_dirty_nobuffers
           |                     |          |          |                     |--0.92%--_raw_spin_unlock_irqrestore
           |                     |          |          |                      --0.59%--radix_tree_tag_set
           |                     |          |           --1.88%--unlock_page
           |                     |          |                     __wake_up_bit
           |                     |          |--3.44%--iov_iter_fault_in_readable
           |                     |           --2.70%--mark_page_accessed
           |                     |--2.54%--mutex_lock
           |                     |          __mutex_lock_slowpath
           |                     |           --2.14%--schedule_preempt_disabled
           |                     |                     __schedule
           |                     |                     |
           |                     |                     |--0.82%--finish_task_switch
           |                     |                     |          __perf_event_task_sched_in
           |                     |                     |          perf_pmu_enable
           |                     |                     |          x86_pmu_enable
           |                     |                     |           --0.80%--intel_pmu_enable_all
           |                     |                     |                      --0.74%--__intel_pmu_enable_all.isra.23
           |                     |                     |                                 --0.69%--native_write_msr_safe
           |                     |                      --0.75%--__perf_event_task_sched_out
           |                      --0.56%--mutex_unlock
           |                                __mutex_unlock_slowpath

________________________________
The information contained in this e-mail message is intended only for the personal and confidential use of the recipient(s) named above. This message may be an attorney-client communication and/or work product and as such is privileged and confidential. If the reader of this message is not the intended recipient or an agent responsible for delivering it to the intended recipient, you are hereby notified that you have received this document in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail, and delete the original message.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx