Re: Slow requet on node reboot

Christian Balzer <chibi@xxxxxxx> · Fri, 15 Jul 2016 16:05:10 +0900

Hello,

On Fri, 15 Jul 2016 00:28:37 +0200 Luis Ramirez wrote:

> Hi,
> 
>      I've a cluster with 3 MON nodes and 5 OSD nodes. If i make a reboot 
> of 1 of the osd nodes i get slow request waiting for active.
> 
> 2016-07-14 19:39:07.996942 osd.33 10.255.128.32:6824/7404 888 : cluster 
> [WRN] slow request 60.627789 seconds old, received at 2016-07-14 
> 19:38:07.369009: osd_op(client.593241.0:3283308 3.d8215fdb (undecoded) 
> ondisk+write+known_if_redirected e11409) currently waiting for active
> 2016-07-14 19:39:07.996950 osd.33 10.255.128.32:6824/7404 889 : cluster 
> [WRN] slow request 60.623972 seconds old, received at 2016-07-14 
> 19:38:07.372826: osd_op(client.593241.0:3283309 3.d8215fdb (undecoded) 
> ondisk+write+known_if_redirected e11411) currently waiting for active
> 2016-07-14 19:39:07.996958 osd.33 10.255.128.32:6824/7404 890 : cluster 
> [WRN] slow request 240.631544 seconds old, received at 2016-07-14 
> 19:35:07.365255: osd_op(client.593241.0:3283269 3.d8215fdb (undecoded) 
> ondisk+write+known_if_redirected e11384) currently waiting for active
> 2016-07-14 19:39:07.996965 osd.33 10.255.128.32:6824/7404 891 : cluster 
> [WRN] slow request 30.625102 seconds old, received at 2016-07-14 
> 19:38:37.371697: osd_op(client.593241.0:3283315 3.d8215fdb (undecoded) 
> ondisk+write+known_if_redirected e11433) currently waiting for active
> 2016-07-14 19:39:12.997985 osd.33 10.255.128.32:6824/7404 893 : cluster 
> [WRN] 83 slow requests, 4 included below; oldest blocked for > 
> 395.971587 secs
> 
> And the service will not recover until the node restart sucesffully. 
> Anyone could provide me any light about what i'm doing wrong?
> 

First of all, do all your pools have a size=3 and a min_size=2?

What kind of clients does your cluster have (RBD images, CephFS, RGW?)

How do you reboot that OSD node?

Normally when you stop OSDs via their initscript or systemd, they will be
removed gracefully and re-peering of clients will start right away before
any lengthy timeouts are reached.

See this example from my test cluster, the output is from "rados bench" and
I did stop all OSDs (via "service ceph stop osd") on one node from second
58. 
Note that shutting down all 4 OSDs on that node takes about 1-2 seconds
each.

Then we get about 10 seconds of things sorting themselves out and the
things continue normally.
No slow request warnings.
---
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    55      31      1022       991   72.0578       116    2.6782   1.74562
    56      31      1041      1010    72.128        76  0.955143   1.73901
    57      31      1066      1035   72.6166       100  0.972699   1.72883
    58      31      1084      1053   72.6058        72  0.549388   1.72471
    59      31      1100      1069   72.4597        64   0.75425   1.72927
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    60      31      1118      1087   72.4519        72    2.2628   1.72937
    61      31      1131      1100   72.1164        52   2.92359    1.7259
    62      31      1141      1110   71.5983        40   1.68941    1.7285
    63      31      1149      1118   70.9697        32   1.30379   1.73533
    64      31      1153      1122   70.1108        16   3.05046   1.73568
    65      31      1156      1125   69.2167        12   2.82071   1.73744
    66      31      1158      1127   68.2892         8   3.01163   1.73965
    67      31      1158      1127     67.27         0         -   1.73965
    68      31      1159      1128   66.3396         2   5.11638   1.74264
    69      31      1161      1130   65.4941         8   8.64385   1.75326
    70      31      1161      1130   64.5585         0         -   1.75326
    71      31      1161      1130   63.6492         0         -   1.75326
    72      31      1161      1130   62.7652         0         -   1.75326
    73      31      1163      1132    62.015         2   13.7002   1.77289
    74      31      1163      1132   61.1769         0         -   1.77289
    75      31      1163      1132   60.3613         0         -   1.77289
    76      31      1163      1132   59.5671         0         -   1.77289
    77      31      1163      1132   58.7935         0         -   1.77289
    78      31      1163      1132   58.0397         0         -   1.77289
    79      31      1163      1132   57.3051         0         -   1.77289
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
    80      31      1163      1132   56.5888         0         -   1.77289
    81      31      1163      1132   55.8901         0         -   1.77289
    82      31      1163      1132   55.2086         0         -   1.77289
    83      31      1163      1132   54.5434         0         -   1.77289
    84      31      1163      1132   53.8941         0         -   1.77289
    85      31      1164      1133   53.3071  0.333333   22.5502   1.79123
    86      31      1170      1139   52.9663        24   21.7306   1.90575
    87      31      1174      1143   52.5414        16   26.7337   1.98175
    88      31      1184      1153   52.3988        40   1.92565   2.07644
    89      31      1189      1158   52.0347        20   1.12557   2.10756
    90      31      1201      1170   51.9898        48  0.767024    2.1907
    91      31      1214      1183   51.9898        52  0.652047   2.24676
    92      31      1227      1196   51.9898        52   28.9226   2.28787
    93      31      1240      1209   51.9898        52   32.7307   2.35555
    94      31      1261      1230   52.3302        84  0.482482   2.40575
    95      31      1283      1252   52.7054        88   1.31267   2.39677
    96      31      1300      1269   52.8647        68  0.796716   2.38455
---

Note that with another test via CephFS and a different "rados bench" I was
able to create some slow requests, but they cleared up very quickly and
definitely did not require any of the OSDs to be brought back up.

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Rakuten Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com