Dear Xavi, thank you so much. The volname is "data" # gluster volume info data Volume Name: data Type: Distributed-Replicate Volume ID: e3a99db0-8643-41c1-b4a1-6a728bb1d08c Status: Started Number of Bricks: 5 x 2 = 10 Transport-type: tcp Bricks: Brick1: pedrillo.bo.infn.it:/storage/1/data Brick2: pedrillo.bo.infn.it:/storage/2/data Brick3: pedrillo.bo.infn.it:/storage/5/data Brick4: pedrillo.bo.infn.it:/storage/6/data Brick5: pedrillo.bo.infn.it:/storage/arc1/data Brick6: pedrillo.bo.infn.it:/storage/arc2/data Brick7: osmino:/storageOsmino/1/data Brick8: osmino:/storageOsmino/2/data Brick9: osmino:/storageOsmino/4/data Brick10: osmino:/storageOsmino/5/data # gluster volume status data Status of volume: data Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick pedrillo.bo.infn.it:/storage/1/data 24009 Y 1732 Brick pedrillo.bo.infn.it:/storage/2/data 24010 Y 1738 Brick pedrillo.bo.infn.it:/storage/5/data 24013 Y 1747 Brick pedrillo.bo.infn.it:/storage/6/data 24014 Y 1758 Brick pedrillo.bo.infn.it:/storage/arc1/data 24015 Y 1770 Brick pedrillo.bo.infn.it:/storage/arc2/data 24016 Y 1838 Brick osmino:/storageOsmino/1/data 24009 Y 1173 Brick osmino:/storageOsmino/2/data 24010 Y 1179 Brick osmino:/storageOsmino/4/data 24011 Y 1185 Brick osmino:/storageOsmino/5/data 24012 Y 1191 NFS Server on localhost 38467 Y 1847 Self-heal Daemon on localhost N/A Y 1855 NFS Server on castor.bo.infn.it 38467 Y 1582 Self-heal Daemon on castor.bo.infn.it N/A Y 1588 NFS Server on pollux.bo.infn.it 38467 Y 1583 Self-heal Daemon on pollux.bo.infn.it N/A Y 1589 NFS Server on osmino 38467 Y 1197 Self-heal Daemon on osmino N/A Y 1203 Then I decided to select a file, as you said. It's an executable, "leggi_particelle", that should be located at /data/stefano/leggi_particelle/leggi_particelle It's not there: # ll /data/stefano/leggi_particelle total 61 drwxr-xr-x 3 stefano user 20480 May 29 05:00 ./ drwxr-xr-x 14 stefano user 20480 May 28 11:32 ../ -rwxr-xr-x 1 stefano user 286 Feb 25 17:24 Espec.plt* lrwxrwxrwx 1 stefano user 53 Feb 13 11:30 parametri.cpp drwxr-xr-x 3 stefano user 20480 May 24 17:16 test/ but look at this: # ll /storage/5/data/stefano/leggi_particelle/ total 892 drwxr-xr-x 3 stefano user 4096 May 24 17:16 ./ drwxr-xr-x 14 stefano user 4096 May 28 11:32 ../ lrwxrwxrwx 2 stefano user 50 Apr 11 19:20 filtro.cpp lrwxrwxrwx 2 stefano user 70 Apr 11 19:20 leggi_binario_ALaDyn_fortran.h -rwxr-xr-x 2 stefano user 705045 May 22 18:24 leggi_particelle* -rwxr-xr-x 2 stefano user 61883 Dec 16 17:20 leggi_particelle.old01* -rwxr-xr-x 2 stefano user 106014 Apr 11 19:20 leggi_particelle.old03* ---------T 2 root root 0 May 24 17:16 parametri.cpp drwxr-xr-x 3 stefano user 4096 Apr 11 19:19 test/ # ll /storage/6/data/stefano/leggi_particelle/ total 892 drwxr-xr-x 3 stefano user 4096 May 24 17:16 ./ drwxr-xr-x 14 stefano user 4096 May 28 11:32 ../ lrwxrwxrwx 2 stefano user 50 Apr 11 19:20 filtro.cpp lrwxrwxrwx 2 stefano user 70 Apr 11 19:20 leggi_binario_ALaDyn_fortran.h -rwxr-xr-x 2 stefano user 705045 May 22 18:24 leggi_particelle* -rwxr-xr-x 2 stefano user 61883 Dec 16 17:20 leggi_particelle.old01* -rwxr-xr-x 2 stefano user 106014 Apr 11 19:20 leggi_particelle.old03* ---------T 2 root root 0 May 24 17:16 parametri.cpp drwxr-xr-x 3 stefano user 4096 Apr 11 19:19 test/ So as you can see, "leggi_particelle" is there, in the fifth and sixth brick (some files of the folder are in other bricks, ls on the fuse mount point just lists the one on the first brick). It's the same executable, working in both location. Most of all, what I found is that calling /data/stefano/leggi_particelle/leggi_particelle correctly launches the executable!! So ls doesn't find it but the os does! Is it just a very bad "bug" in ls??? I don't think so, because even remote mounting the volume with nfs hides the same files. Anyway, back with what you asked: this is the tail of data.log [2013-05-31 10:00:02.236397] W [rpc-transport.c:174:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to "socket" [2013-05-31 10:00:02.283271] I [cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0 [2013-05-31 10:00:02.283465] I [cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0 [2013-05-31 10:00:02.283613] I [cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0 [2013-05-31 10:00:02.283816] I [cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0 [2013-05-31 10:00:02.283919] I [cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0 [2013-05-31 10:00:02.283943] I [cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0 [2013-05-31 10:00:02.283951] I [input.c:46:cli_batch] 0-: Exiting with: 0 [2013-05-31 10:00:07.279855] W [rpc-transport.c:174:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to "socket" [2013-05-31 10:00:07.326202] I [cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk] 0-cli: Received resp to get vol: 0 [2013-05-31 10:00:07.326484] I [cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk] 0-cli: Returning: 0 [2013-05-31 10:00:07.326493] I [input.c:46:cli_batch] 0-: Exiting with: 0 [2013-05-31 10:01:29.718428] W [rpc-transport.c:174:rpc_transport_load] 0-rpc-transport: missing 'option transport-type'. defaulting to "socket" [2013-05-31 10:01:29.767990] I [input.c:46:cli_batch] 0-: Exiting with: 0 but calling for many ls or ll didn't touch it and this is the tail of storage-5-data.log and storage-6-data.log (they're almost the same) [2013-05-31 07:59:19.090790] I [server-handshake.c:571:server_setvolume] 0-data-server: accepted client from pedrillo-2510-2013/05/31-07:59:15:067773-data-client-3-0 (version: 3.3.1) [2013-05-31 08:00:56.935205] I [server-handshake.c:571:server_setvolume] 0-data-server: accepted client from pollux-2361-2013/05/31-08:00:52:937577-data-client-3-0 (version: 3.3.1) [2013-05-31 08:01:03.611506] I [server-handshake.c:571:server_setvolume] 0-data-server: accepted client from castor-2629-2013/05/31-08:00:59:614003-data-client-3-0 (version: 3.3.1) [2013-05-31 08:02:15.940950] I [server-handshake.c:571:server_setvolume] 0-data-server: accepted client from osmino-1844-2013/05/31-08:02:11:932993-data-client-3-0 (version: 3.3.1) Except for the warning in the data.log, they seem legit. Here are the attributes: # getfattr -m. -e hex -d /storage/5/data/stefano/leggi_particelle/leggi_particelle getfattr: Removing leading '/' from absolute path names # file: storage/5/data/stefano/leggi_particelle/leggi_particelle trusted.afr.data-client-2=0x000000000000000000000000 trusted.afr.data-client-3=0x000000000000000000000000 trusted.gfid=0x883c343b9366478da660843da8f6b87c # getfattr -m. -e hex -d /storage/6/data/stefano/leggi_particelle/leggi_particelle getfattr: Removing leading '/' from absolute path names # file: storage/6/data/stefano/leggi_particelle/leggi_particelle trusted.afr.data-client-2=0x000000000000000000000000 trusted.afr.data-client-3=0x000000000000000000000000 trusted.gfid=0x883c343b9366478da660843da8f6b87c # getfattr -m. -e hex -d /storage/5/data/stefano/leggi_particelle getfattr: Removing leading '/' from absolute path names # file: storage/5/data/stefano/leggi_particelle trusted.afr.data-client-2=0x000000000000000000000000 trusted.afr.data-client-3=0x000000000000000000000000 trusted.gfid=0xb62a16f0bdb94e3f8563ccfb278c2105 trusted.glusterfs.dht=0x00000001000000003333333366666665 # getfattr -m. -e hex -d /storage/6/data/stefano/leggi_particelle getfattr: Removing leading '/' from absolute path names # file: storage/6/data/stefano/leggi_particelle trusted.afr.data-client-2=0x000000000000000000000000 trusted.afr.data-client-3=0x000000000000000000000000 trusted.gfid=0xb62a16f0bdb94e3f8563ccfb278c2105 trusted.glusterfs.dht=0x00000001000000003333333366666665 Sorry for the very long email. But your help is very much appreciated and I hope I'm clear enough. Best regards, Stefano On Fri, May 31, 2013 at 4:34 PM, Xavier Hernandez <xhernandez at datalab.es>wrote: > Hi Stefano, > > it would help to see the results of the following commands: > > gluster volume info <volname> > gluster volume status <volname> > > It would also be interesting to see fragments of the logs containing > Warnings or Errors generated while the 'ls' command was executing (if any). > The logs from the mount point are normally located at > /var/log/glusterfs/<mount point>.log. Logs from the bricks are at > /var/log/glusterfs/bricks/<**brick name>.log. > > Also, identify a file (I'll call it <file>, and its parent directory > <parent>) that does not appear in an 'ls' from the mount point. Then > execute the following command on all bricks: > > getfattr -m. -e hex -d <brick root>/<parent> > > On all bricks that contain the file, execute the following command: > > getfattr -m. -e hex -d <brick root>/<parent>/<file> > > This information might help to see what is happening. > > Best regards, > > Xavi > > Al 31/05/13 08:53, En/na Stefano Sinigardi ha escrit: > >> Dear all, >> Thanks again for your support. >> Files are already and exactly duplicated (diff confirms it) on the >> bricks, so I'd like not to mess with them directly (in order to not do >> anything worse to the volume than what it's already suffering). >> I found out, thanks to your help, that in order to trigger the >> self-healing there's no more the request of doing a find on all the >> files, but that there's the >> >> gluster volume heal VOLNAME full >> >> command to run. So I did a it and it said "Launching Heal operation on >> volume data has been successful. Use heal info commands to check >> status". But then asking for >> >> gluster volume heal VOLNAME info >> >> it reported each and every entry at zero, like "gluster volume heal >> VOLNAME info heal-failed" and "gluster volume heal VOLNAME info >> split-brain". It should be a positive news, or no? >> If I requested a "gluster volume heal VOLNAME info healed", on the >> other hand, revealed 1023 files per couple of bricks that got healed >> (very strange that the number is always the same. Is it an upper >> bound?). For sure all of them are missing from the volume itself (not >> sure if just 1023 per couple are missing from the total, maybe more). >> I thought that now at least those should have become visible, but in >> fact those are not. How to check logs for this healing process? Doing >> a "ls -ltr /var/log/glusterfs/" says that no logs are being touched, >> and even looking at the most recent ones reveals that just healing >> command launch is reported into them. >> I rebooted the nodes but still nothing changed, files are still >> missing from the FUSE mount point but not from bricks. I relaunched >> the healing and again, 1023 files per couple of bricks got >> self-healed. But still, I think that they are the same as before and >> those are still missing from the fuse mount point (just tried a few >> reported by the gluster volume heal VOLNAME info healed). >> >> Do you have any other suggestion? Things are looking very bad for me >> now because of the time that I'm forcing others to loose (as I said, >> we don't have any system administrator and I do it just "for fun", but >> still people look at you as the main responsible)... >> >> Best regards, >> >> Stefano >> ______________________________**_________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://supercolony.gluster.**org/mailman/listinfo/gluster-**users<http://supercolony.gluster.org/mailman/listinfo/gluster-users> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130531/bceadc9f/attachment-0001.html>