Hi Paul, We had similar experience with redhat ceph, and it turned out to be the mgr progress module. I think there are some works to fix this, though the one I thought would impact you seems to be in 14.2.11. https://github.com/ceph/ceph/pull/36076 If you have 14.2.15, you can try turning off the progress module altogether to see if it makes a difference. https://docs.ceph.com/en/latest/releases/nautilus/ MGR: progress module can now be turned on/off, using the commands: ceph progress on and ceph progress off. Rafael On Wed, 25 Nov 2020 at 06:04, Paul Mezzanini <pfmeec@xxxxxxx> wrote: > Ever since we jumped from 14.2.9 to .12 (and beyond) a lot of the ceph > commands just hang. The mgr daemon also just stops responding to our > Prometheus scrapes occasionally. A daemon restart and it wakes back up. I > have nothing pointing to these being related but it feels that way. > > I also tried to get device health monitoring with smart up and running > around that upgrade time. It never seemed to be able to pull in and report > on the health across the drives. I did see the osd process firing off > smartctl on occasion though so it was trying to do something. Again, I > have nothing pointing to this being related but it feels like it may be. > > Some commands that currently hang: > ceph osd pool autoscale-status > ceph balancer * > ceph iostat (oddly, this spit out a line of all 0 stats once and then hung) > ceph fs status > toggling ceph device monitoring on or off and a lot of the device health > stuff too > > > > Mgr logs on disk show flavors of this: > 2020-11-24 13:05:07.883 7f19e2c40700 0 log_channel(audit) log [DBG] : > from='mon.0 -' entity='mon.' cmd=[{,",p,r,e,f,i,x,",:, ,",o,s,d, > ,p,e,r,f,",,, ,",f,o,r,m,a,t,",:, ,",j,s,o,n,",}]: dispatch > 2020-11-24 13:05:07.895 7f19e2c40700 0 log_channel(audit) log [DBG] : > from='mon.0 -' entity='mon.' cmd=[{,",p,r,e,f,i,x,",:, ,",o,s,d, ,p,o,o,l, > ,s,t,a,t,s,",,, ,",f,o,r,m,a,t,",:, ,",j,s,o,n,",}]: dispatch > 2020-11-24 13:05:08.567 7f19e1c3e700 0 log_channel(cluster) log [DBG] : > pgmap v587: 17149 pgs: 1 active+remapped+backfill_wait, 2 > active+clean+scrubbing, 55 active+clean+scrubbing+deep, 9 > active+remapped+backfilling, 17082 active+clean; 2.1 PiB data, 3.5 PiB > used, 2.9 PiB / 6.4 PiB avail; 108 MiB/s rd, 53 MiB/s wr, 1.20k op/s; > 7525420/9900121381 objects misplaced (0.076%); 99 MiB/s, 40 objects/s > recovering > > ceph status: > cluster: > id: 971a5242-f00d-421e-9bf4-5a716fcc843a > health: HEALTH_WARN > 1 nearfull osd(s) > 1 pool(s) nearfull > > services: > mon: 3 daemons, quorum ceph-mon-01,ceph-mon-03,ceph-mon-02 (age 4h) > mgr: ceph-mon-01(active, since 97s), standbys: ceph-mon-03, ceph-mon-02 > mds: cephfs:1 {0=ceph-mds-02=up:active} 3 up:standby > osd: 843 osds: 843 up (since 13d), 843 in (since 2w); 10 remapped pgs > rgw: 1 daemon active (ceph-rgw-01) > > task status: > scrub status: > mds.ceph-mds-02: idle > > data: > pools: 16 pools, 17149 pgs > objects: 1.61G objects, 2.1 PiB > usage: 3.5 PiB used, 2.9 PiB / 6.4 PiB avail > pgs: 6482000/9900825469 objects misplaced (0.065%) > 17080 active+clean > 54 active+clean+scrubbing+deep > 9 active+remapped+backfilling > 5 active+clean+scrubbing > 1 active+remapped+backfill_wait > > io: > client: 877 MiB/s rd, 1.8 GiB/s wr, 1.91k op/s rd, 3.33k op/s wr > recovery: 136 MiB/s, 55 objects/s > > ceph config dump: > WHO MASK LEVEL OPTION > VALUE RO > global advanced cluster_network > 192.168.42.0/24 * > global advanced mon_max_pg_per_osd > 400 > global advanced mon_pg_warn_max_object_skew > -1.000000 > global dev mon_warn_on_pool_pg_num_not_power_of_two > false > global advanced osd_max_backfills > 2 > global advanced osd_max_scrubs > 4 > global advanced osd_scrub_during_recovery > false > global advanced public_network > 1xx.xx.171.0/24 10.16.171.0/24 * > mon advanced mon_allow_pool_delete > true > mgr advanced mgr/balancer/mode > none > mgr advanced mgr/devicehealth/enable_monitoring > false > osd advanced bluestore_compression_mode > passive > osd advanced > osd_deep_scrub_large_omap_object_key_threshold 2000000 > > osd advanced osd_op_queue_cut_off > high * > osd advanced osd_scrub_load_threshold > 5.000000 > mds advanced mds_beacon_grace > 300.000000 > mds basic mds_cache_memory_limit > 16384000000 > mds advanced mds_log_max_segments > 256 > client advanced rbd_default_features > 5 > client.libvirt advanced admin_socket > /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok * > client.libvirt basic log_file > /var/log/ceph/qemu-guest-$pid.log * > > > /etc/ceph/ceph.conf is the stub file with fsid and the mons listed. > Yes I have a drive that just started to tickle the full warn limit. > That's what pulled me back into the "I should fix this" mode. I'm manually > adjusting the weight on that one for the time being along with slowly > lowering pg_num on an oversized pool. The cluster still has this issue > when in health_ok. > > I'm free to do a lot of debugging and poking around even though this is > our production cluster. The only service I refuse to play around with is > the MDS. That one bites back. Does anyone have more ideas on where to > look to try and figure out what's going on? > > -- > Paul Mezzanini > Sr Systems Administrator / Engineer, Research Computing > Information & Technology Services > Finance & Administration > Rochester Institute of Technology > o:(585) 475-3245 | pfmeec@xxxxxxx > > CONFIDENTIALITY NOTE: The information transmitted, including attachments, > is > intended only for the person(s) or entity to which it is addressed and may > contain confidential and/or privileged material. Any review, > retransmission, > dissemination or other use of, or taking of any action in reliance upon > this > information by persons or entities other than the intended recipient is > prohibited. If you received this in error, please contact the sender and > destroy any copies of this information. > ------------------------ > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- *Rafael Lopez* Devops Systems Engineer Monash University eResearch Centre E: rafael.lopez@xxxxxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx