On 03/04/2015 10:29 AM, Emmanuel Dreyfus wrote:
Emmanuel Dreyfus <manu@xxxxxxxxxx> wrote:
It seems there is very weird stuff going on there: it fails because
in afr_inode_refresh_subvol_cbk (after a lookup), we have a valid
reply from brick 0 with op_ret = 0.
But the brick 0 server process was killed. that makes no sense.
Looking at a kernel trace I can now tell that the brick0 server process
indeed gets a SIGKILL, but then glusterd spawn a new process for brick0
that answers the requests.
glusterd log confirms that: first it starts the two bricks;
[glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick /d/backends/brick0 on port 49152
[glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick /d/backends/brick1 on port 49153
-> Killing brick0
[glusterd-handler.c:4388:__glusterd_brick_rpc_notify] 0-management: Brick nbslave73.cloud.gluster.org:/d/backends/brick0 has disconnected from glusterd.
-> And here it restarts!
[glusterd-pmap.c:227:pmap_registry_bind] 0-pmap: adding brick /d/backends/brick0 on port 49152
-> test terminate and kill all bricks:
[glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /d/backends/brick0 on port 49152
[glusterd-pmap.c:271:pmap_registry_remove] 0-pmap: removing brick /d/backends/brick1 on port 49153
Hence it ould be a glusterd bug? Why would it restart a brick on its own?
Not sure, CC'ing Atin who might be able to shed some light on the
glusterd logs. If the brick gets restarted as you say, the brick log
will also contain something like "I [glusterfsd.c:1959:main]
0-/usr/local/sbin/glusterfsd: Started running
/usr/local/sbin/glusterfsd" and the graph information etc. Does it? And
does volume status show the brick as online again?
-Ravi
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel