Interesting. There is a more forceful way to disable progress which I had to do as we have an older version. Basically, you stop the mgrs, and then move the progress module files: systemctl stop ceph-mgr.target mv /usr/share//ceph/mgr/progress {some backup location} systemctl start ceph-mgr.target It's not nice because ceph reports as health_err, but it isn't required for the cluster to function, and it made our (eg.) `ceph fs status` become responsive again. health: HEALTH_ERR Module 'progress' has failed: Not found or unloadable On Wed, 25 Nov 2020 at 12:02, Paul Mezzanini <pfmeec@xxxxxxx> wrote: > "ceph progress off" is just hanging like the others. > > I'll fiddle with it later tonight to see if I can get it to stick when I > bounce a daemon. > > -- > Paul Mezzanini > Sr Systems Administrator / Engineer, Research Computing > Information & Technology Services > Finance & Administration > Rochester Institute of Technology > o:(585) 475-3245 | pfmeec@xxxxxxx > > CONFIDENTIALITY NOTE: The information transmitted, including attachments, > is > intended only for the person(s) or entity to which it is addressed and may > contain confidential and/or privileged material. Any review, > retransmission, > dissemination or other use of, or taking of any action in reliance upon > this > information by persons or entities other than the intended recipient is > prohibited. If you received this in error, please contact the sender and > destroy any copies of this information. > ------------------------ > > ________________________________________ > From: Rafael Lopez <rafael.lopez@xxxxxxxxxx> > Sent: Tuesday, November 24, 2020 6:56 PM > To: Paul Mezzanini > Cc: ceph-users > Subject: Re: Many ceph commands hang. broken mgr? > > Hi Paul, > > We had similar experience with redhat ceph, and it turned out to be the mgr > progress module. I think there are some works to fix this, though the one I > thought would impact you seems to be in 14.2.11. > https://github.com/ceph/ceph/pull/36076 > > If you have 14.2.15, you can try turning off the progress module altogether > to see if it makes a difference. > https://docs.ceph.com/en/latest/releases/nautilus/ > MGR: progress module can now be turned on/off, using the commands: ceph > progress on and ceph progress off. > > Rafael > > > On Wed, 25 Nov 2020 at 06:04, Paul Mezzanini <pfmeec@xxxxxxx> wrote: > > > Ever since we jumped from 14.2.9 to .12 (and beyond) a lot of the ceph > > commands just hang. The mgr daemon also just stops responding to our > > Prometheus scrapes occasionally. A daemon restart and it wakes back > up. I > > have nothing pointing to these being related but it feels that way. > > > > I also tried to get device health monitoring with smart up and running > > around that upgrade time. It never seemed to be able to pull in and > report > > on the health across the drives. I did see the osd process firing off > > smartctl on occasion though so it was trying to do something. Again, I > > have nothing pointing to this being related but it feels like it may be. > > > > Some commands that currently hang: > > ceph osd pool autoscale-status > > ceph balancer * > > ceph iostat (oddly, this spit out a line of all 0 stats once and then > hung) > > ceph fs status > > toggling ceph device monitoring on or off and a lot of the device health > > stuff too > > > > > > > > Mgr logs on disk show flavors of this: > > 2020-11-24 13:05:07.883 7f19e2c40700 0 log_channel(audit) log [DBG] : > > from='mon.0 -' entity='mon.' cmd=[{,",p,r,e,f,i,x,",:, ,",o,s,d, > > ,p,e,r,f,",,, ,",f,o,r,m,a,t,",:, ,",j,s,o,n,",}]: dispatch > > 2020-11-24 13:05:07.895 7f19e2c40700 0 log_channel(audit) log [DBG] : > > from='mon.0 -' entity='mon.' cmd=[{,",p,r,e,f,i,x,",:, ,",o,s,d, > ,p,o,o,l, > > ,s,t,a,t,s,",,, ,",f,o,r,m,a,t,",:, ,",j,s,o,n,",}]: dispatch > > 2020-11-24 13:05:08.567 7f19e1c3e700 0 log_channel(cluster) log [DBG] : > > pgmap v587: 17149 pgs: 1 active+remapped+backfill_wait, 2 > > active+clean+scrubbing, 55 active+clean+scrubbing+deep, 9 > > active+remapped+backfilling, 17082 active+clean; 2.1 PiB data, 3.5 PiB > > used, 2.9 PiB / 6.4 PiB avail; 108 MiB/s rd, 53 MiB/s wr, 1.20k op/s; > > 7525420/9900121381 objects misplaced (0.076%); 99 MiB/s, 40 objects/s > > recovering > > > > ceph status: > > cluster: > > id: 971a5242-f00d-421e-9bf4-5a716fcc843a > > health: HEALTH_WARN > > 1 nearfull osd(s) > > 1 pool(s) nearfull > > > > services: > > mon: 3 daemons, quorum ceph-mon-01,ceph-mon-03,ceph-mon-02 (age 4h) > > mgr: ceph-mon-01(active, since 97s), standbys: ceph-mon-03, > ceph-mon-02 > > mds: cephfs:1 {0=ceph-mds-02=up:active} 3 up:standby > > osd: 843 osds: 843 up (since 13d), 843 in (since 2w); 10 remapped pgs > > rgw: 1 daemon active (ceph-rgw-01) > > > > task status: > > scrub status: > > mds.ceph-mds-02: idle > > > > data: > > pools: 16 pools, 17149 pgs > > objects: 1.61G objects, 2.1 PiB > > usage: 3.5 PiB used, 2.9 PiB / 6.4 PiB avail > > pgs: 6482000/9900825469 objects misplaced (0.065%) > > 17080 active+clean > > 54 active+clean+scrubbing+deep > > 9 active+remapped+backfilling > > 5 active+clean+scrubbing > > 1 active+remapped+backfill_wait > > > > io: > > client: 877 MiB/s rd, 1.8 GiB/s wr, 1.91k op/s rd, 3.33k op/s wr > > recovery: 136 MiB/s, 55 objects/s > > > > ceph config dump: > > WHO MASK LEVEL OPTION > > VALUE RO > > global advanced cluster_network > > 192.168.42.0/24 * > > global advanced mon_max_pg_per_osd > > 400 > > global advanced mon_pg_warn_max_object_skew > > -1.000000 > > global dev mon_warn_on_pool_pg_num_not_power_of_two > > false > > global advanced osd_max_backfills > > 2 > > global advanced osd_max_scrubs > > 4 > > global advanced osd_scrub_during_recovery > > false > > global advanced public_network > > 1xx.xx.171.0/24 10.16.171.0/24 * > > mon advanced mon_allow_pool_delete > > true > > mgr advanced mgr/balancer/mode > > none > > mgr advanced mgr/devicehealth/enable_monitoring > > false > > osd advanced bluestore_compression_mode > > passive > > osd advanced > > osd_deep_scrub_large_omap_object_key_threshold 2000000 > > > > osd advanced osd_op_queue_cut_off > > high * > > osd advanced osd_scrub_load_threshold > > 5.000000 > > mds advanced mds_beacon_grace > > 300.000000 > > mds basic mds_cache_memory_limit > > 16384000000 > > mds advanced mds_log_max_segments > > 256 > > client advanced rbd_default_features > > 5 > > client.libvirt advanced admin_socket > > /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok * > > client.libvirt basic log_file > > /var/log/ceph/qemu-guest-$pid.log * > > > > > > /etc/ceph/ceph.conf is the stub file with fsid and the mons listed. > > Yes I have a drive that just started to tickle the full warn limit. > > That's what pulled me back into the "I should fix this" mode. I'm > manually > > adjusting the weight on that one for the time being along with slowly > > lowering pg_num on an oversized pool. The cluster still has this issue > > when in health_ok. > > > > I'm free to do a lot of debugging and poking around even though this is > > our production cluster. The only service I refuse to play around with is > > the MDS. That one bites back. Does anyone have more ideas on where to > > look to try and figure out what's going on? > > > > -- > > Paul Mezzanini > > Sr Systems Administrator / Engineer, Research Computing > > Information & Technology Services > > Finance & Administration > > Rochester Institute of Technology > > o:(585) 475-3245 | pfmeec@xxxxxxx > > > > CONFIDENTIALITY NOTE: The information transmitted, including attachments, > > is > > intended only for the person(s) or entity to which it is addressed and > may > > contain confidential and/or privileged material. Any review, > > retransmission, > > dissemination or other use of, or taking of any action in reliance upon > > this > > information by persons or entities other than the intended recipient is > > prohibited. If you received this in error, please contact the sender and > > destroy any copies of this information. > > ------------------------ > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > -- > *Rafael Lopez* > Devops Systems Engineer > Monash University eResearch Centre > E: rafael.lopez@xxxxxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- *Rafael Lopez* Devops Systems Engineer Monash University eResearch Centre T: +61 3 9905 9118 E: rafael.lopez@xxxxxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx