On Thu, Sep 19, 2019 at 2:36 AM Yoann Moulin <yoann.moulin@xxxxxxx> wrote: > > Hello, > > I have a Ceph Nautilus Cluster 14.2.1 for cephfs only on 40x 1.8T SAS disk (no SSD) in 20 servers. > > > cluster: > > id: 778234df-5784-4021-b983-0ee1814891be > > health: HEALTH_WARN > > 2 MDSs report slow requests > > > > services: > > mon: 3 daemons, quorum icadmin006,icadmin007,icadmin008 (age 5d) > > mgr: icadmin008(active, since 18h), standbys: icadmin007, icadmin006 > > mds: cephfs:3 {0=icadmin006=up:active,1=icadmin007=up:active,2=icadmin008=up:active} > > osd: 40 osds: 40 up (since 2w), 40 in (since 3w) > > > > data: > > pools: 3 pools, 672 pgs > > objects: 36.08M objects, 19 TiB > > usage: 51 TiB used, 15 TiB / 65 TiB avail > > pgs: 670 active+clean > > 2 active+clean+scrubbing > > I often get "MDSs report slow requests" and plenty of "[WRN] 3 slow requests, 0 included below; oldest blocked for > 60281.199503 secs" > > > HEALTH_WARN 2 MDSs report slow requests > > MDS_SLOW_REQUEST 2 MDSs report slow requests > > mdsicadmin007(mds.1): 3 slow requests are blocked > 30 secs > > mdsicadmin006(mds.0): 10 slow requests are blocked > 30 secs > > After a few investigations, I saw that ALL ceph-osd process eat a lot of memory, up to 130GB RSS each. It this value normal? May this related to > slow requests? Is disk only increasing the probability to get slow requests? > > > USER PID %CPU %MEM VSZ RSS TTY STAT STAR TIME COMMAND > > ceph 34196 3.6 35.0 156247524 138521572 ? Ssl Jul01 4173:18 /usr/bin/ceph-osd -f --cluster apollo --id 1 --setuser ceph --setgroup ceph > > ceph 34394 3.6 35.0 160001436 138487776 ? Ssl Jul01 4178:37 /usr/bin/ceph-osd -f --cluster apollo --id 32 --setuser ceph --setgroup ceph > > ceph 34709 3.5 35.1 156369636 138752044 ? Ssl Jul01 4088:57 /usr/bin/ceph-osd -f --cluster apollo --id 29 --setuser ceph --setgroup ceph > > ceph 34915 3.4 35.1 158976936 138715900 ? Ssl Jul01 3950:45 /usr/bin/ceph-osd -f --cluster apollo --id 3 --setuser ceph --setgroup ceph > > ceph 34156 3.4 35.1 158280768 138714484 ? Ssl Jul01 3984:11 /usr/bin/ceph-osd -f --cluster apollo --id 30 --setuser ceph --setgroup ceph > > ceph 34378 3.7 35.1 155162420 138708096 ? Ssl Jul01 4312:12 /usr/bin/ceph-osd -f --cluster apollo --id 8 --setuser ceph --setgroup ceph > > ceph 34161 3.5 35.0 159606788 138523652 ? Ssl Jul01 4128:17 /usr/bin/ceph-osd -f --cluster apollo --id 16 --setuser ceph --setgroup ceph > > ceph 34380 3.6 35.1 161465372 138670168 ? Ssl Jul01 4238:20 /usr/bin/ceph-osd -f --cluster apollo --id 35 --setuser ceph --setgroup ceph > > ceph 33822 3.7 35.1 163456644 138734036 ? Ssl Jul01 4342:05 /usr/bin/ceph-osd -f --cluster apollo --id 15 --setuser ceph --setgroup ceph > > ceph 34003 3.8 35.0 161868584 138531208 ? Ssl Jul01 4427:32 /usr/bin/ceph-osd -f --cluster apollo --id 38 --setuser ceph --setgroup ceph > > ceph 9753 2.8 24.2 96923856 95580776 ? Ssl Sep02 700:25 /usr/bin/ceph-osd -f --cluster apollo --id 31 --setuser ceph --setgroup ceph > > ceph 10120 2.5 24.0 96130340 94856244 ? Ssl Sep02 644:50 /usr/bin/ceph-osd -f --cluster apollo --id 7 --setuser ceph --setgroup ceph > > ceph 36204 3.6 35.0 159394476 138592124 ? Ssl Jul01 4185:36 /usr/bin/ceph-osd -f --cluster apollo --id 18 --setuser ceph --setgroup ceph > > ceph 36427 3.7 34.4 155699060 136076432 ? Ssl Jul01 4298:26 /usr/bin/ceph-osd -f --cluster apollo --id 36 --setuser ceph --setgroup ceph > > ceph 36622 4.1 35.1 158219408 138724688 ? Ssl Jul01 4779:14 /usr/bin/ceph-osd -f --cluster apollo --id 19 --setuser ceph --setgroup ceph > > ceph 36881 4.0 35.1 157748752 138719064 ? Ssl Jul01 4669:54 /usr/bin/ceph-osd -f --cluster apollo --id 37 --setuser ceph --setgroup ceph > > ceph 34649 3.7 35.1 159601580 138652012 ? Ssl Jul01 4337:20 /usr/bin/ceph-osd -f --cluster apollo --id 14 --setuser ceph --setgroup ceph > > ceph 34881 3.8 35.1 158632412 138764376 ? Ssl Jul01 4433:50 /usr/bin/ceph-osd -f --cluster apollo --id 33 --setuser ceph --setgroup ceph > > ceph 34646 4.2 35.1 155029328 138732376 ? Ssl Jul01 4831:24 /usr/bin/ceph-osd -f --cluster apollo --id 17 --setuser ceph --setgroup ceph > > ceph 34881 4.1 35.1 156801676 138763588 ? Ssl Jul01 4710:19 /usr/bin/ceph-osd -f --cluster apollo --id 39 --setuser ceph --setgroup ceph > > ceph 36766 3.7 35.1 158070740 138703240 ? Ssl Jul01 4341:42 /usr/bin/ceph-osd -f --cluster apollo --id 13 --setuser ceph --setgroup ceph > > ceph 37013 3.5 35.0 157767668 138272248 ? Ssl Jul01 4094:12 /usr/bin/ceph-osd -f --cluster apollo --id 34 --setuser ceph --setgroup ceph > > ceph 35007 3.4 35.1 160318780 138756404 ? Ssl Jul01 3963:21 /usr/bin/ceph-osd -f --cluster apollo --id 2 --setuser ceph --setgroup ceph > > ceph 35217 3.5 35.1 159023744 138626680 ? Ssl Jul01 4041:50 /usr/bin/ceph-osd -f --cluster apollo --id 22 --setuser ceph --setgroup ceph > > ceph 36962 3.2 35.1 158692228 138730292 ? Ssl Jul01 3772:35 /usr/bin/ceph-osd -f --cluster apollo --id 5 --setuser ceph --setgroup ceph > > ceph 2991351 2.6 22.9 92011392 90761128 ? Ssl Sep02 666:32 /usr/bin/ceph-osd -f --cluster apollo --id 21 --setuser ceph --setgroup ceph > > ceph 35503 3.2 35.0 158784940 138502100 ? Ssl Jul01 3766:33 /usr/bin/ceph-osd -f --cluster apollo --id 25 --setuser ceph --setgroup ceph > > ceph 35683 3.6 35.1 160927812 138678080 ? Ssl Jul01 4233:17 /usr/bin/ceph-osd -f --cluster apollo --id 4 --setuser ceph --setgroup ceph > > ceph 36969 3.7 35.1 158701188 138745028 ? Ssl Jul01 4348:06 /usr/bin/ceph-osd -f --cluster apollo --id 20 --setuser ceph --setgroup ceph > > ceph 1902641 2.5 24.1 96688368 95438808 ? Ssl Sep02 633:45 /usr/bin/ceph-osd -f --cluster apollo --id 0 --setuser ceph --setgroup ceph > > ceph 35576 3.7 35.1 156262424 138750552 ? Ssl Jul01 4338:09 /usr/bin/ceph-osd -f --cluster apollo --id 27 --setuser ceph --setgroup ceph > > ceph 1901746 2.5 24.8 99300108 98051192 ? Ssl Sep02 641:52 /usr/bin/ceph-osd -f --cluster apollo --id 6 --setuser ceph --setgroup ceph > > ceph 35735 3.7 35.1 156027400 138738076 ? Ssl Jul01 4350:00 /usr/bin/ceph-osd -f --cluster apollo --id 24 --setuser ceph --setgroup ceph > > ceph 35929 3.7 35.0 160626040 138511872 ? Ssl Jul01 4361:54 /usr/bin/ceph-osd -f --cluster apollo --id 9 --setuser ceph --setgroup ceph > > ceph 35699 3.1 35.1 158773084 138728576 ? Ssl Jul01 3631:13 /usr/bin/ceph-osd -f --cluster apollo --id 10 --setuser ceph --setgroup ceph > > ceph 2941709 2.5 24.2 97125336 95906728 ? Ssl Sep02 638:11 /usr/bin/ceph-osd -f --cluster apollo --id 28 --setuser ceph --setgroup ceph > > ceph 38429 3.2 35.1 156638164 138712612 ? Ssl Jul01 3687:45 /usr/bin/ceph-osd -f --cluster apollo --id 12 --setuser ceph --setgroup ceph > > ceph 38651 3.3 35.1 159650296 138735924 ? Ssl Jul01 3835:51 /usr/bin/ceph-osd -f --cluster apollo --id 26 --setuser ceph --setgroup ceph > > ceph 35890 2.9 35.1 156923512 138734428 ? Ssl Jul01 3361:21 /usr/bin/ceph-osd -f --cluster apollo --id 11 --setuser ceph --setgroup ceph > > ceph 36129 3.3 35.1 158782748 138739248 ? Ssl Jul01 3845:41 /usr/bin/ceph-osd -f --cluster apollo --id 23 --setuser ceph --setgroup ceph > > some logs : > > > 2019-09-19 08:52:33.960242 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62427.674399 secs > > 2019-09-19 08:52:37.527465 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62431.241789 secs > > 2019-09-19 08:52:42.527581 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62436.241899 secs > > 2019-09-19 08:52:38.960358 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62432.674515 secs > > 2019-09-19 08:52:43.960476 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62437.674620 secs > > 2019-09-19 08:52:47.527663 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62441.241987 secs > > 2019-09-19 08:52:52.527770 mds.icadmin007 [WRN] 3 slow requests, 2 included below; oldest blocked for > 62446.242061 secs > > 2019-09-19 08:52:52.527777 mds.icadmin007 [WRN] slow request 61444.792236 seconds old, received at 2019-09-18 17:48:47.735459: internal op exportdir:mds.1:13 currently failed to wrlock, waiting > > 2019-09-19 08:52:52.527783 mds.icadmin007 [WRN] slow request 61444.792163 seconds old, received at 2019-09-18 17:48:47.735533: internal op exportdir:mds.1:14 currently failed to wrlock, waiting > > 2019-09-19 08:52:48.960590 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62442.674748 secs > > 2019-09-19 08:52:53.960684 mds.icadmin006 [WRN] 10 slow requests, 2 included below; oldest blocked for > 62447.674825 secs > > 2019-09-19 08:52:53.960692 mds.icadmin006 [WRN] slow request 61441.895507 seconds old, received at 2019-09-18 17:48:52.065114: rejoin:mds.1:13 currently dispatched > > 2019-09-19 08:52:53.960697 mds.icadmin006 [WRN] slow request 61441.895489 seconds old, received at 2019-09-18 17:48:52.065131: rejoin:mds.1:14 currently dispatched > > 2019-09-19 08:52:57.527852 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62451.242174 secs > > 2019-09-19 08:53:02.527972 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62456.242289 secs > > 2019-09-19 08:52:58.960777 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62452.674936 secs > > 2019-09-19 08:53:03.960853 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62457.675011 secs > > 2019-09-19 08:53:07.528033 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62461.242354 secs > > 2019-09-19 08:53:12.528177 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62466.242487 secs > > 2019-09-19 08:53:08.960965 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62462.675123 secs > > 2019-09-19 08:53:13.961034 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62467.675195 secs > > 2019-09-19 08:53:17.528276 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62471.242592 secs > > 2019-09-19 08:53:22.528407 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62476.242729 secs > > 2019-09-19 08:53:18.961149 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62472.675310 secs > > 2019-09-19 08:53:23.961234 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62477.675392 secs > > 2019-09-19 08:53:27.528509 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62481.242832 secs > > 2019-09-19 08:53:32.528651 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62486.242961 secs > > 2019-09-19 08:53:28.961314 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62482.675471 secs > > 2019-09-19 08:53:33.961393 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62487.675549 secs > > 2019-09-19 08:53:37.528706 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62491.243031 secs > > 2019-09-19 08:53:42.528790 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62496.243105 secs > > 2019-09-19 08:53:38.961476 mds.icadmin006 [WRN] 10 slow requests, 1 included below; oldest blocked for > 62492.675617 secs > > 2019-09-19 08:53:38.961485 mds.icadmin006 [WRN] slow request 61441.151061 seconds old, received at 2019-09-18 17:49:37.810351: client_request(client.21441:176429 getattr pAsLsXsFs #0x10000f2b1b3 2019-09-18 17:49:37.806002 caller_uid=204878, caller_gid=11233{}) currently failed to rdlock, waiting > > 2019-09-19 08:53:43.961569 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62497.675728 secs > > 2019-09-19 08:53:47.528891 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62501.243214 secs > > 2019-09-19 08:53:52.529021 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62506.243337 secs > > 2019-09-19 08:53:48.961685 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62502.675839 secs > > 2019-09-19 08:53:53.961792 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62507.675948 secs > > 2019-09-19 08:53:57.529113 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62511.243437 secs > > 2019-09-19 08:54:02.529224 mds.icadmin007 [WRN] 3 slow requests, 0 included below; oldest blocked for > 62516.243546 secs > > 2019-09-19 08:53:58.961866 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62512.676025 secs > > 2019-09-19 08:54:03.961939 mds.icadmin006 [WRN] 10 slow requests, 0 included below; oldest blocked for > 62517.676099 secs > > Thanks for your help. If you haven't set: osd op queue cut off = high in /etc/ceph/ceph.conf on your OSDs, I'd give that a try. It should help quite a bit with pure HDD clusters. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com