Re: fractured/split glusterfs - 2 up, 2 down for an hour

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Also some other anomalies. Even when the files are visible and readable, many dirs are unwritable and/or undeleteable.

 

for example:

====

Sat Jan 04 18:36:17 [0.02 0.08 0.12] root@hpc-s:/bio/mmacchie

1104 $ mkdir hjmtest

mkdir: cannot create directory `hjmtest': Invalid argument

 

Sat Jan 04 18:36:23 [0.02 0.08 0.12] root@hpc-s:/bio/mmacchie

====

The client log says this for that operation (note offset times - UTC vs local:

<http://pastie.org/8602365>

 

And in many subdirs, other dirs can be made, but not deleted:

 

Sat Jan 04 18:41:45 [0.00 0.04 0.09] root@hpc-s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered

1109 $ mkdir j1

 

Sat Jan 04 18:42:00 [0.00 0.03 0.09] root@hpc-s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered

1110 $ rmdir j1

rmdir: failed to remove `j1': Transport endpoint is not connected

 

Sat Jan 04 18:42:09 [0.08 0.05 0.09] root@hpc-s:/bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered

 

With the client log saying:

====

[2014-01-05 02:42:09.548263] W [client-rpc-fops.c:526:client3_3_stat_cbk] 0-gl-client-2: remote operation failed: Transport endpoint is not connected

[2014-01-05 02:42:09.549314] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint is not connected. Path: /bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 (aebbf21f-37fe-4edc-be8a-0f57b057b516)

[2014-01-05 02:42:09.550124] W [client-rpc-fops.c:2541:client3_3_opendir_cbk] 0-gl-client-2: remote operation failed: Transport endpoint is not connected. Path: /bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 (aebbf21f-37fe-4edc-be8a-0f57b057b516)

[2014-01-05 02:42:09.552439] W [fuse-bridge.c:1193:fuse_unlink_cbk] 0-glusterfs-fuse: 5805445: RMDIR() /bio/mmacchie/Nematodes2/phast/steiner_motifs/mmacchie_recovered/j1 => -1 (Transport endpoint is not connected)

[2014-01-05 02:42:12.175860] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available)

[2014-01-05 02:42:15.181365] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available)

[2014-01-05 02:42:18.186668] W [socket.c:514:__socket_rwv] 0-gl-client-2: readv failed (No data available)

====

 

This is odd - how can a dir be created OK but then the fs lose track of it to delete it?

 

And that dir (j1) can have /files/ created and deleted inside of it, but not other /dirs/ (same result as the parent dir).

 

In looking thru the client log, I see instances of this:

====

[2014-01-05 02:27:20.721043] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote operation failed: Transport endpoint is not connected. Path: /bio/mmacchie/Nematodes (00000000-0000-0000-0000-000000000000)

[2014-01-05 02:27:20.769058] I [dht-layout.c:630:dht_layout_normalize] 0-gl-dht: found anomalies in /bio/mmacchie/Nematodes. holes=2 overlaps=0

[2014-01-05 02:27:20.769090] W [dht-selfheal.c:900:dht_selfheal_directory] 0-gl-dht: 1 subvolumes down -- not fixing

[2014-01-05 02:27:20.784335] W [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2:

====

more at: <http://pastie.org/8602381>

 

alarming since it says:

[2014-01-05 02:27:20.769090] W [dht-selfheal.c:900:dht_selfheal_directory] 0-gl-dht: 1 subvolumes down -- not fixing

 

All my servers and bricks appear to be up and online:

 

Sat Jan 04 18:54:09 [0.76 0.30 0.20] root@biostor1:~

1003 $ gluster volume status gl detail | egrep "Brick|Online"

Brick : Brick bs2:/raid1

Online : Y

Brick : Brick bs2:/raid2

Online : Y

Brick : Brick bs3:/raid1

Online : Y

Brick : Brick bs3:/raid2

Online : Y

Brick : Brick bs4:/raid1

Online : Y

Brick : Brick bs4:/raid2

Online : Y

Brick : Brick bs1:/raid1

Online : Y

Brick : Brick bs1:/raid2

Online : Y

 

 

The gluster server logs seem to be fairly quiet thru this. the followig contains the logs for the last day or so from the 4 servers, reduced by the following command to eliminate the 'socket.c:2788' errors

 

grep -v socket.c:2788 /var/log/glusterfs/etc-glusterfs-glusterd.vol.log

 

<http://pastie.org/8602412>

 

hjm

 

 

On Saturday, January 04, 2014 10:45:29 PM Vijay Bellur wrote:

> On 01/04/2014 07:21 AM, harry mangalam wrote:

> > This is a distributed-only glusterfs on 4 servers with 2 bricks each on

> > an IPoIB network.

> >

> > Thanks to a misconfigured autoupdate script, when 3.4.2 was released

> > today, my gluster servers tried to update themselves. 2 succeeded, but

> > then failed to restart, the other 2 failed to update and kept running.

> >

> > Not realizing the sequence of events, I restarted the 2 that failed to

> > restart, which gave my fs 2 servers running 3.4.1 and 2 running 3.4.2.

> >

> > When I realized this after about 30m, I shut everything down and updated

> > the 2 remaining to 3.4.2 and then restarted but now I'm getting lots of

> > reports of file errors of the type 'endpoints not connected' and the like:

> >

> > [2014-01-04 01:31:18.593547] W

> > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote

> > operation failed: Transport endpoint i

> >

> > s not connected. Path: /bio/fishm/test_cuffdiff.sh

> > (00000000-0000-0000-0000-000000000000)

> >

> > [2014-01-04 01:31:18.594928] W

> > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote

> > operation failed: Transport endpoint i

> >

> > s not connected. Path: /bio/fishm/test_cuffdiff.sh

> > (00000000-0000-0000-0000-000000000000)

> >

> > [2014-01-04 01:31:18.595818] W

> > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote

> > operation failed: Transport endpoint i

> >

> > s not connected. Path: /bio/fishm/.#test_cuffdiff.sh

> > (14c3b612-e952-4aec-ae18-7f3dbb422dcc)

> >

> > [2014-01-04 01:31:18.597381] W

> > [client-rpc-fops.c:2624:client3_3_lookup_cbk] 0-gl-client-2: remote

> > operation failed: Transport endpoint i

> >

> > s not connected. Path: /bio/fishm/test_cuffdiff.sh

> > (00000000-0000-0000-0000-000000000000)

> >

> > [2014-01-04 01:31:18.598212] W

> > [client-rpc-fops.c:814:client3_3_statfs_cbk] 0-gl-client-2: remote

> > operation failed: Transport endpoint is

> >

> > not connected

> >

> > [2014-01-04 01:31:18.598236] W [dht-diskusage.c:45:dht_du_info_cbk]

> > 0-gl-dht: failed to get disk info from gl-client-2

> >

> > [2014-01-04 01:31:19.912210] W [socket.c:514:__socket_rwv]

> > 0-gl-client-2: readv failed (No data available)

> >

> > [2014-01-04 01:31:22.912717] W [socket.c:514:__socket_rwv]

> > 0-gl-client-2: readv failed (No data available)

> >

> > [2014-01-04 01:31:25.913208] W [socket.c:514:__socket_rwv]

> > 0-gl-client-2: readv failed (No data available)

> >

> > The servers at the same time provided the following error 'E' messages:

> >

> > Fri Jan 03 17:46:42 [0.20 0.12 0.13] root@biostor1:~

> >

> > 1008 $ grep ' E ' /var/log/glusterfs/bricks/raid1.log |grep '2014-01-03'

> >

> > [2014-01-03 06:11:36.251786] E [server-helpers.c:751:server_alloc_frame]

> > (-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x103) [0x3161e090d3]

> > (-->/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x245)

> > [0x3161e08f85]

> > (-->/usr/lib64/glusterfs/3.4.1/xlator/protocol/server.so(server3_3_lookup+

> > 0xa0) [0x7fa60e577170]))) 0-server: invalid argument: conn

> >

> > [2014-01-03 06:11:36.251813] E

> > [rpcsvc.c:450:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed

> > to complete successfully

> >

> > [2014-01-03 17:48:44.236127] E [rpc-transport.c:253:rpc_transport_load]

> > 0-rpc-transport: /usr/lib64/glusterfs/3.4.1/rpc-transport/rdma.so:

> > cannot open shared object file: No such file or directory

> >

> > [2014-01-03 19:15:26.643378] E [rpc-transport.c:253:rpc_transport_load]

> > 0-rpc-transport: /usr/lib64/glusterfs/3.4.2/rpc-transport/rdma.so:

> > cannot open shared object file: No such file or directory

>

> rdma.so seems to be missing here. Is glusterfs-rdma-3.4.2-1 rpm

> installed on the servers?

>

> -Vijay

>

> _______________________________________________

> Gluster-users mailing list

> Gluster-users@xxxxxxxxxxx

> http://supercolony.gluster.org/mailman/listinfo/gluster-users

 

---

Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine

[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487

415 South Circle View Dr, Irvine, CA, 92697 [shipping]

MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)

---

 

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux