Jeff- We traced the problem to an array bounds error in a FORTRAN program, which resulted in it spontaneously producing a 250GB sparse file. We think glusterd crashed when the program was terminated half way through, but there is evidence in the client logs that it stopped responding temporarily when the program was allowed to run to completion. I don't think that particular problem will happen again now that we know what causes it, but the user whose program was responsible said that GlusterFS should be able to handle that sort of problem because memory errors are common in a research environment, and the fact that GlusterFS can't handle it means it is not very robust. I don't want to say whether or not I agree or disagree with that statement but I promised to pass it on to the developers. The reason I originally linked this problem to a recent volume expansion is that I have been seeing layout related errors in the client and NFS logs since that expansion. Here is a sample from two separate clients dating back to a few minutes after the add-brick operation was carried out. [2012-02-07 14:47:07.462200] W [fuse-resolve.c:328:fuse_resolve_path_deep] 0-fuse: /users/tjp/QuikSCAT_daily_v4/daily2hrws20040619v4.nc: no gfid found [2012-02-07 14:47:07.483662] W [fuse-resolve.c:328:fuse_resolve_path_deep] 0-fuse: /users/tjp/QuikSCAT_daily_v4/mingmt: no gfid found [2012-02-07 14:47:25.312719] I [dht-layout.c:682:dht_layout_dir_mismatch] 1-atmos-dht: subvol: atmos-replicate-4; inode layout - 1651910495 - 1982292593; disk layout - 2761050402 - 3067833779 [2012-02-07 14:47:25.312917] I [dht-common.c:524:dht_revalidate_cbk] 1-atmos-dht: mismatching layouts for /users [2012-02-07 15:16:41.256252] I [dht-layout.c:581:dht_layout_normalize] 0-atmos-dht: found anomalies in /. holes=1 overlaps=0 [2012-02-07 15:16:41.256285] I [dht-common.c:362:dht_lookup_root_dir_cbk] 0-atmos-dht: fixing assignment on / [2012-02-07 19:12:58.451690] W [fuse-resolve.c:273:fuse_resolve_deep_cbk] 0-fuse: /users/*: no gfid found [2012-02-07 19:12:58.466364] W [fuse-resolve.c:273:fuse_resolve_deep_cbk] 0-fuse: /users/*: no gfid found [2012-02-07 19:12:58.495660] W [fuse-resolve.c:328:fuse_resolve_path_deep] 0-fuse: /users/lsrf: no gfid found Unlike the sparse file incident this week there were no accompanying error messages reporting "subvolumes down" or "no child is up" in the period following the volume expansion. At the time I put the layout errors down to the fact that the fix-layout operation had not yet completed. Is that a possible explanation do you think? When the errors were still occurring several days later I thought perhaps that fix-layout hadn't completed properly and ran it again. Now I am wondering if the layout errors following the volume expansion were the result of any of the other factors I suggested in my original posting. Are any of those the likely cause do you think? Dodgy FORTRAN programs aside, the layout errors have not reoccurred on most of the clients since I did the following on Tuesday night. 1) Ran fsck.ext4 on all the bricks 2) Restarted glusterd on all the servers 3) Removed the pair of bricks recently added 4) Added that pair of bricks again, using "bdan14.nerc-essc.ac.uk" instead of "bdan14" as one of the host names. 5) Re-ran fix-layout I said the errors had not occurred on _most_ of the nodes. Last night the machine where I was running a self-heal on the volume hung, and after restarting it I discovered that the GlusterFS client had suddenly reported "11 subvolumes down" followed by multiple reports of "anomalies" with a few "failed to get fd ctx. EBADFD" warnings thrown in for good measure. Preceding that there were lots of these... [2012-02-24 18:27:20.21981] C [client-handshake.c:121:rpc_client_ping_timer_expired] 0-atmos-client-27: server 192.171.166.96:24041 has not responded in the last 42 seconds, disconnecting. ..followed by lots of these... [2012-02-24 18:56:48.378887] E [rpc-clnt.c:197:call_bail] 0-atmos-client-27: bailing out frame type(GlusterFS Handshake) op(PING(3)) xid = 0x11551459x sent = 2012-02-24 18:26:35.833590. timeout = 1800 [2012-02-24 18:56:48.409872] W [client-handshake.c:264:client_ping_cbk] 0-atmos-client-27: timer must have expired .. and then a load of these... [2012-02-24 18:58:28.782252] C [rpc-clnt.c:436:rpc_clnt_fill_request_info] 0-atmos-client-27: cannot lookup the saved frame corresponding to xid (11551458) [2012-02-24 18:58:28.782677] W [socket.c:1327:__socket_read_reply] 0-: notify for event MAP_XID failed [2012-02-24 18:58:28.782738] I [client.c:1883:client_rpc_notify] 0-atmos-client-27: disconnected It looks as if there was one of each of the above for all the bricks in the volume. There was no evidence of anything happening on the servers during all this and the other clients didn't report any problems, so this one client just appears to have gone haywire by itself. I hope this was just a consequence of the DHT self heal process dealing with a multitude of errors following the recent upheaval, but right now I'm not inclined to disagree with the dodgy FORTRAN program owner's views on the robustness of GlusterFS. I upset him by asking if he had been running his dodgy program again... -Dan. On 02/23/2012 05:21 PM, Jeff Darcy wrote: > On 02/23/2012 11:45 AM, Dan Bretherton wrote: >>> The main question is therefore why >>> we're losing connectivity to these servers. >> Could there be a hardware issue? I have replaced the network cables for >> the two servers but I don't really know what else to check. The network >> switch hasn't recorded any errors for those two ports. There isn't >> anything sinister in /var/log/messages. >> >> It seems a bit of a coincidence that both servers lost connection at >> exactly the same time. The only thing the users have started doing >> differently recently is processing a large number of small text files. >> There is one particular application they are running that processes this >> data, but the load on the Glusterfs servers doesn't go up when it is >> running. > It does seem like a weird coincidence. About the only thing I can think of is > that there's some combination of events that occurs on those two servers but > not the others. For example, what if there's some file that happens to live on > that replica pair, and which is accessed in some particularly pathological way? > I used to see something like that with some astrophysics code that would try > to open and truncate the same file from each of a thousand nodes simultaneously > each time it started. Needless to say, this caused a few problems. ;) Maybe > there's something about this new job type that similarly "converges" on one > file for configuration, logging, something like that?