pausing scrub crashed scrub daemon on nodes

Amudhan P <amudhan83@xxxxxxxxx> · Fri, 8 Sep 2017 16:07:42 +0530

Hi,

I am using glusterfs 3.10.1 with 30 nodes each with 36 bricks and 10 nodes each with 16 bricks in a single cluster. 

By default I have paused scrub process to have it run manually. for the first time, i was trying to run scrub-on-demand and it was running fine, 
but after some time, i decided to pause scrub process due to high CPU usage and user reporting folder listing taking time. 
But scrub pause resulted below message in some of the nodes.
Also, i can see that scrub daemon is not showing in volume status for some nodes.

Error msg type 1
--

[2017-09-01 10:04:45.840248] I [bit-rot.c:1683:notify] 0-glustervol-bit-rot-0: BitRot scrub ondemand called
[2017-09-01 10:05:05.094948] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-09-01 10:05:06.401792] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-09-01 10:05:07.544524] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-glustervol-bit-rot-0: Scaling up scrubbe
rs [0 => 36]
[2017-09-01 10:05:07.552893] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option] 0-glustervol-bit-rot-0: SCRUB TUNABLES::
 [Frequency: biweekly, Throttle: lazy]
[2017-09-01 10:05:07.552942] I [MSGID: 118038] [bit-rot-scrub.c:948:br_fsscan_schedule] 0-glustervol-bit-rot-0: Scrubbing is schedule
d to run at 2017-09-15 10:05:07
[2017-09-01 10:05:07.553457] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2017-09-01 10:05:20.953815] I [bit-rot.c:1683:notify] 0-glustervol-bit-rot-0: BitRot scrub ondemand called
[2017-09-01 10:05:20.953845] I [MSGID: 118038] [bit-rot-scrub.c:1085:br_fsscan_ondemand] 0-glustervol-bit-rot-0: Ondemand Scrubbing s
cheduled to run at 2017-09-01 10:05:21
[2017-09-01 10:05:22.216937] I [MSGID: 118044] [bit-rot-scrub.c:615:br_scrubber_log_time] 0-glustervol-bit-rot-0: Scrubbing started a
t 2017-09-01 10:05:22
[2017-09-01 10:05:22.306307] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-09-01 10:05:24.684900] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2017-09-06 08:37:26.422267] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-09-06 08:37:28.351821] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-09-06 08:37:30.350786] I [MSGID: 118034] [bit-rot-scrub.c:1342:br_scrubber_scale_down] 0-glustervol-bit-rot-0: Scaling down scr
ubbers [36 => 0]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2017-09-06 08:37:30
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.10.1
/usr/lib/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x78)[0x7fda0ab0b4f8]
/usr/lib/libglusterfs.so.0(gf_print_trace+0x324)[0x7fda0ab14914]
/lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7fda09ef9d40]
/usr/lib/libglusterfs.so.0(syncop_readv_cbk+0x17)[0x7fda0ab429e7]
/usr/lib/glusterfs/3.10.1/xlator/protocol/client.so(+0x2db4b)[0x7fda04986b4b]
/usr/lib/libgfrpc.so.0(rpc_clnt_handle_reply+0x90)[0x7fda0a8d5490]
/usr/lib/libgfrpc.so.0(rpc_clnt_notify+0x1e7)[0x7fda0a8d5777]
/usr/lib/libgfrpc.so.0(rpc_transport_notify+0x23)[0x7fda0a8d17d3]
/usr/lib/glusterfs/3.10.1/rpc-transport/socket.so(+0x7194)[0x7fda05826194]
/usr/lib/glusterfs/3.10.1/rpc-transport/socket.so(+0x9635)[0x7fda05828635]
/usr/lib/libglusterfs.so.0(+0x83db0)[0x7fda0ab64db0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7fda0a290182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fda09fbd47d]
--------------

Error msg type 2

[2017-09-01 10:01:20.387248] I [MSGID: 118035] [bit-rot-scrub.c:1297:br_scrubber_scale_up] 0-glustervol-bit-rot-0: Scaling up scrubbe
rs [0 => 36]
[2017-09-01 10:01:20.392544] I [MSGID: 118048] [bit-rot-scrub.c:1547:br_scrubber_log_option] 0-glustervol-bit-rot-0: SCRUB TUNABLES::
 [Frequency: biweekly, Throttle: lazy]
[2017-09-01 10:01:20.392571] I [MSGID: 118038] [bit-rot-scrub.c:948:br_fsscan_schedule] 0-glustervol-bit-rot-0: Scrubbing is schedule
d to run at 2017-09-15 10:01:20
[2017-09-01 10:01:20.392727] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2017-09-01 10:01:35.078694] I [bit-rot.c:1683:notify] 0-glustervol-bit-rot-0: BitRot scrub ondemand called
[2017-09-01 10:01:35.078735] I [MSGID: 118038] [bit-rot-scrub.c:1085:br_fsscan_ondemand] 0-glustervol-bit-rot-0: Ondemand Scrubbing s
cheduled to run at 2017-09-01 10:01:36
[2017-09-01 10:01:36.355827] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-09-01 10:01:37.018622] I [MSGID: 118044] [bit-rot-scrub.c:615:br_scrubber_log_time] 0-glustervol-bit-rot-0: Scrubbing started a
t 2017-09-01 10:01:37
[2017-09-01 10:01:37.601774] I [glusterfsd-mgmt.c:1778:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2017-09-06 08:33:37.738627] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-09-06 08:33:39.812894] I [glusterfsd-mgmt.c:52:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-09-06 08:33:41.828432] I [MSGID: 118034] [bit-rot-scrub.c:1342:br_scrubber_scale_down] 0-glustervol-bit-rot-0: Scaling down scr
ubbers [36 => 0]
[2017-09-06 08:33:41.884031] I [MSGID: 118051] [bit-rot-ssm.c:80:br_scrub_ssm_state_stall] 0-glustervol-bit-rot-0: Volume is under ac
tive scrubbing. Pausing scrub..
[2017-09-06 08:34:26.477106] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-970: server 192.168.0.21:49177 has
not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:34:29.477438] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-980: server 192.168.0.21:49178 has
not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:34:37.478198] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-1040: server 192.168.0.21:49184 has
 not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:34:40.478550] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-1070: server 192.168.0.21:49187 has
 not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:34:56.480200] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-990: server 192.168.0.21:49179 has
not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:34:59.480520] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-760: server 192.168.0.21:49156 has
not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:35:01.480751] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-1020: server 192.168.0.21:49182 has
 not responded in the last 42 seconds, disconnecting.
[2017-09-06 08:35:05.481223] C [rpc-clnt-ping.c:160:rpc_clnt_ping_timer_expired] 0-glustervol-client-790: server 192.168.0.21:49159 has not responded in the last 42 seconds, disconnecting.
[2017-09-06 09:03:43.637208] E [rpc-clnt.c:200:call_bail] 0-glusterfs: bailing out frame type(GlusterFS Handshake) op(GETSPEC(2)) xid = 0x8 sent = 2017-09-06 08:33:39.813002. timeout = 1800 for 127.0.0.1:24007
[2017-09-06 09:03:44.637338] E [rpc-clnt.c:200:call_bail] 0-glustervol-client-760: bailing out frame type(GlusterFS 3.3) op(READ(12)) xid = 0x160f941 sent = 2017-09-06 08:33:41.843336. timeout = 1800 for 192.168.0.21:49156
[2017-09-06 09:03:44.637726] W [MSGID: 114031] [client-rpc-fops.c:2992:client3_3_readv_cbk] 0-glustervol-client-760: remote operation failed [Transport endpoint is not connected]
pending frames:
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
frame : type(0) op(0)
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2017-09-06 09:03:44
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.10.1
/usr/lib/libglusterfs.so.0(_gf_msg_backtrace_nomem+0x78)[0x7f26721934f8]
/usr/lib/libglusterfs.so.0(gf_print_trace+0x324)[0x7f267219c914]
/lib/x86_64-linux-gnu/libc.so.6(+0x36d40)[0x7f2671581d40]
/usr/lib/libglusterfs.so.0(syncop_readv_cbk+0x17)[0x7f26721ca9e7]
/usr/lib/glusterfs/3.10.1/xlator/protocol/client.so(+0x2db4b)[0x7f2667dd3b4b]
/usr/lib/libgfrpc.so.0(+0xf92c)[0x7f2671f5c92c]
/usr/lib/libglusterfs.so.0(+0x36eb2)[0x7f267219feb2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f2671918182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f267164547d]
-------

My queries are below:-

1. To resume scrub process should I restart glusterd service in node where scrub daemon is not running or do a volume force start

2. if resumed, will it start from where it was stopped.

3. I am assuming, scrub by default assigns thread by calculating the number of bricks in the node. need an option to change it in gluster volume command.
   Because in my case my node has 12 CPU's (Intel Xeon CPU 6 core + HT) when scrub was running it consumed all CPU 99%.
   or it should be intelligent enough to scale down depending on available CPUs in the node.

4. Why was this crash? 

regards
Amudhan P
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users