glusterfs missing files on ls

stefano.sinigardi at gmail.com (Stefano Sinigardi) · Fri, 31 May 2013 17:23:54 +0900

 Dear Xavi,
thank you so much. The volname is "data"

# gluster volume info data

Volume Name: data
Type: Distributed-Replicate
Volume ID: e3a99db0-8643-41c1-b4a1-6a728bb1d08c
Status: Started
Number of Bricks: 5 x 2 = 10
Transport-type: tcp
Bricks:
Brick1: pedrillo.bo.infn.it:/storage/1/data
Brick2: pedrillo.bo.infn.it:/storage/2/data
Brick3: pedrillo.bo.infn.it:/storage/5/data
Brick4: pedrillo.bo.infn.it:/storage/6/data
Brick5: pedrillo.bo.infn.it:/storage/arc1/data
Brick6: pedrillo.bo.infn.it:/storage/arc2/data
Brick7: osmino:/storageOsmino/1/data
Brick8: osmino:/storageOsmino/2/data
Brick9: osmino:/storageOsmino/4/data
Brick10: osmino:/storageOsmino/5/data

# gluster volume status data

Status of volume: data
Gluster process                                         Port    Online  Pid
------------------------------------------------------------------------------
Brick pedrillo.bo.infn.it:/storage/1/data               24009   Y       1732
Brick pedrillo.bo.infn.it:/storage/2/data               24010   Y       1738
Brick pedrillo.bo.infn.it:/storage/5/data               24013   Y       1747
Brick pedrillo.bo.infn.it:/storage/6/data               24014   Y       1758
Brick pedrillo.bo.infn.it:/storage/arc1/data            24015   Y       1770
Brick pedrillo.bo.infn.it:/storage/arc2/data            24016   Y       1838
Brick osmino:/storageOsmino/1/data                      24009   Y       1173
Brick osmino:/storageOsmino/2/data                      24010   Y       1179
Brick osmino:/storageOsmino/4/data                      24011   Y       1185
Brick osmino:/storageOsmino/5/data                      24012   Y       1191
NFS Server on localhost                                 38467   Y       1847
Self-heal Daemon on localhost                           N/A     Y       1855
NFS Server on castor.bo.infn.it                         38467   Y       1582
Self-heal Daemon on castor.bo.infn.it                   N/A     Y       1588
NFS Server on pollux.bo.infn.it                         38467   Y       1583
Self-heal Daemon on pollux.bo.infn.it                   N/A     Y       1589
NFS Server on osmino                                    38467   Y       1197
Self-heal Daemon on osmino                              N/A     Y       1203

Then I decided to select a file, as you said. It's an executable,
"leggi_particelle", that should be located at
/data/stefano/leggi_particelle/leggi_particelle

It's not there:

# ll /data/stefano/leggi_particelle
total 61
drwxr-xr-x  3 stefano user 20480 May 29 05:00 ./
drwxr-xr-x 14 stefano user 20480 May 28 11:32 ../
-rwxr-xr-x  1 stefano user   286 Feb 25 17:24 Espec.plt*
lrwxrwxrwx  1 stefano user    53 Feb 13 11:30 parametri.cpp
drwxr-xr-x  3 stefano user 20480 May 24 17:16 test/

but look at this:

# ll /storage/5/data/stefano/leggi_particelle/

total 892
drwxr-xr-x  3 stefano user   4096 May 24 17:16 ./
drwxr-xr-x 14 stefano user   4096 May 28 11:32 ../
lrwxrwxrwx  2 stefano user     50 Apr 11 19:20 filtro.cpp
lrwxrwxrwx  2 stefano user     70 Apr 11 19:20
leggi_binario_ALaDyn_fortran.h
-rwxr-xr-x  2 stefano user 705045 May 22 18:24 leggi_particelle*
-rwxr-xr-x  2 stefano user  61883 Dec 16 17:20 leggi_particelle.old01*
-rwxr-xr-x  2 stefano user 106014 Apr 11 19:20 leggi_particelle.old03*
---------T  2 root    root      0 May 24 17:16 parametri.cpp
drwxr-xr-x  3 stefano user   4096 Apr 11 19:19 test/

# ll /storage/6/data/stefano/leggi_particelle/

total 892
drwxr-xr-x  3 stefano user   4096 May 24 17:16 ./
drwxr-xr-x 14 stefano user   4096 May 28 11:32 ../
lrwxrwxrwx  2 stefano user     50 Apr 11 19:20 filtro.cpp
lrwxrwxrwx  2 stefano user     70 Apr 11 19:20
leggi_binario_ALaDyn_fortran.h
-rwxr-xr-x  2 stefano user 705045 May 22 18:24 leggi_particelle*
-rwxr-xr-x  2 stefano user  61883 Dec 16 17:20 leggi_particelle.old01*
-rwxr-xr-x  2 stefano user 106014 Apr 11 19:20 leggi_particelle.old03*
---------T  2 root    root      0 May 24 17:16 parametri.cpp
drwxr-xr-x  3 stefano user   4096 Apr 11 19:19 test/

So as you can see, "leggi_particelle" is there, in the fifth and sixth
brick (some files of the folder are in other bricks, ls on the fuse mount
point just lists the one on the first brick). It's the same executable,
working in both location.
Most of all, what I found is that calling
/data/stefano/leggi_particelle/leggi_particelle correctly launches the
executable!! So ls doesn't find it but the os does!
Is it just a very bad "bug" in ls??? I don't think so, because even remote
mounting the volume with nfs hides the same files.
Anyway, back with what you asked:

this is the tail of data.log

[2013-05-31 10:00:02.236397] W [rpc-transport.c:174:rpc_transport_load]
0-rpc-transport: missing 'option transport-type'. defaulting to "socket"
[2013-05-31 10:00:02.283271] I [cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk]
0-cli: Received resp to get vol: 0
[2013-05-31 10:00:02.283465] I [cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk]
0-cli: Returning: 0
[2013-05-31 10:00:02.283613] I [cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk]
0-cli: Received resp to get vol: 0
[2013-05-31 10:00:02.283816] I [cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk]
0-cli: Returning: 0
[2013-05-31 10:00:02.283919] I [cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk]
0-cli: Received resp to get vol: 0
[2013-05-31 10:00:02.283943] I [cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk]
0-cli: Returning: 0
[2013-05-31 10:00:02.283951] I [input.c:46:cli_batch] 0-: Exiting with: 0
[2013-05-31 10:00:07.279855] W [rpc-transport.c:174:rpc_transport_load]
0-rpc-transport: missing 'option transport-type'. defaulting to "socket"
[2013-05-31 10:00:07.326202] I [cli-rpc-ops.c:504:gf_cli3_1_get_volume_cbk]
0-cli: Received resp to get vol: 0
[2013-05-31 10:00:07.326484] I [cli-rpc-ops.c:757:gf_cli3_1_get_volume_cbk]
0-cli: Returning: 0
[2013-05-31 10:00:07.326493] I [input.c:46:cli_batch] 0-: Exiting with: 0
[2013-05-31 10:01:29.718428] W [rpc-transport.c:174:rpc_transport_load]
0-rpc-transport: missing 'option transport-type'. defaulting to "socket"
[2013-05-31 10:01:29.767990] I [input.c:46:cli_batch] 0-: Exiting with: 0

but calling for many ls or ll didn't touch it

and this is the tail of storage-5-data.log and storage-6-data.log (they're
almost the same)
[2013-05-31 07:59:19.090790] I [server-handshake.c:571:server_setvolume]
0-data-server: accepted client from
pedrillo-2510-2013/05/31-07:59:15:067773-data-client-3-0 (version: 3.3.1)
[2013-05-31 08:00:56.935205] I [server-handshake.c:571:server_setvolume]
0-data-server: accepted client from
pollux-2361-2013/05/31-08:00:52:937577-data-client-3-0 (version: 3.3.1)
[2013-05-31 08:01:03.611506] I [server-handshake.c:571:server_setvolume]
0-data-server: accepted client from
castor-2629-2013/05/31-08:00:59:614003-data-client-3-0 (version: 3.3.1)
[2013-05-31 08:02:15.940950] I [server-handshake.c:571:server_setvolume]
0-data-server: accepted client from
osmino-1844-2013/05/31-08:02:11:932993-data-client-3-0 (version: 3.3.1)

Except for the warning in the data.log, they seem legit.

Here are the attributes:

# getfattr -m. -e hex -d
/storage/5/data/stefano/leggi_particelle/leggi_particelle

getfattr: Removing leading '/' from absolute path names
# file: storage/5/data/stefano/leggi_particelle/leggi_particelle
trusted.afr.data-client-2=0x000000000000000000000000
trusted.afr.data-client-3=0x000000000000000000000000
trusted.gfid=0x883c343b9366478da660843da8f6b87c

# getfattr -m. -e hex -d
/storage/6/data/stefano/leggi_particelle/leggi_particelle

getfattr: Removing leading '/' from absolute path names
# file: storage/6/data/stefano/leggi_particelle/leggi_particelle
trusted.afr.data-client-2=0x000000000000000000000000
trusted.afr.data-client-3=0x000000000000000000000000
trusted.gfid=0x883c343b9366478da660843da8f6b87c

# getfattr -m. -e hex -d /storage/5/data/stefano/leggi_particelle

getfattr: Removing leading '/' from absolute path names
# file: storage/5/data/stefano/leggi_particelle
trusted.afr.data-client-2=0x000000000000000000000000
trusted.afr.data-client-3=0x000000000000000000000000
trusted.gfid=0xb62a16f0bdb94e3f8563ccfb278c2105
trusted.glusterfs.dht=0x00000001000000003333333366666665

 # getfattr -m. -e hex -d /storage/6/data/stefano/leggi_particelle

getfattr: Removing leading '/' from absolute path names
# file: storage/6/data/stefano/leggi_particelle
trusted.afr.data-client-2=0x000000000000000000000000
trusted.afr.data-client-3=0x000000000000000000000000
trusted.gfid=0xb62a16f0bdb94e3f8563ccfb278c2105
trusted.glusterfs.dht=0x00000001000000003333333366666665

Sorry for the very long email. But your help is very much appreciated and I
hope I'm clear enough.

Best regards,

     Stefano

On Fri, May 31, 2013 at 4:34 PM, Xavier Hernandez <xhernandez at datalab.es>wrote:

> Hi Stefano,
>
> it would help to see the results of the following commands:
>
> gluster volume info <volname>
> gluster volume status <volname>
>
> It would also be interesting to see fragments of the logs containing
> Warnings or Errors generated while the 'ls' command was executing (if any).
> The logs from the mount point are normally located at
> /var/log/glusterfs/<mount point>.log. Logs from the bricks are at
> /var/log/glusterfs/bricks/<**brick name>.log.
>
> Also, identify a file (I'll call it <file>, and its parent directory
> <parent>) that does not appear in an 'ls' from the mount point. Then
> execute the following command on all bricks:
>
> getfattr -m. -e hex -d <brick root>/<parent>
>
> On all bricks that contain the file, execute the following command:
>
> getfattr -m. -e hex -d <brick root>/<parent>/<file>
>
> This information might help to see what is happening.
>
> Best regards,
>
> Xavi
>
> Al 31/05/13 08:53, En/na Stefano Sinigardi ha escrit:
>
>> Dear all,
>> Thanks again for your support.
>> Files are already and exactly duplicated (diff confirms it) on the
>> bricks, so I'd like not to mess with them directly (in order to not do
>> anything worse to the volume than what it's already suffering).
>> I found out, thanks to your help, that in order to trigger the
>> self-healing there's no more the request of doing a find on all the
>> files, but that there's the
>>
>> gluster volume heal VOLNAME full
>>
>> command to run. So I did a it and it said "Launching Heal operation on
>> volume data has been successful. Use heal info commands to check
>> status". But then asking for
>>
>> gluster volume heal VOLNAME info
>>
>> it reported each and every entry at zero, like "gluster volume heal
>> VOLNAME info heal-failed" and "gluster volume heal VOLNAME info
>> split-brain". It should be a positive news, or no?
>> If I requested a "gluster volume heal VOLNAME info healed", on the
>> other hand, revealed 1023 files per couple of bricks that got healed
>> (very strange that the number is always the same. Is it an upper
>> bound?). For sure all of them are missing from the volume itself (not
>> sure if just 1023 per couple are missing from the total, maybe more).
>> I thought that now at least those should have become visible, but in
>> fact those are not. How to check logs for this healing process? Doing
>> a "ls -ltr /var/log/glusterfs/" says that no logs are being touched,
>> and even looking at the most recent ones reveals that just healing
>> command launch is reported into them.
>> I rebooted the nodes but still nothing changed, files are still
>> missing from the FUSE mount point but not from bricks. I relaunched
>> the healing and again, 1023 files per couple of bricks got
>> self-healed. But still, I think that they are the same as before and
>> those are still missing from the fuse mount point (just tried a few
>> reported by the gluster volume heal VOLNAME info healed).
>>
>> Do you have any other suggestion? Things are looking very bad for me
>> now because of the time that I'm forcing others to loose (as I said,
>> we don't have any system administrator and I do it just "for fun", but
>> still people look at you as the main responsible)...
>>
>> Best regards,
>>
>>       Stefano
>> ______________________________**_________________
>> Gluster-users mailing list
>> Gluster-users at gluster.org
>> http://supercolony.gluster.**org/mailman/listinfo/gluster-**users<http://supercolony.gluster.org/mailman/listinfo/gluster-users>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20130531/bceadc9f/attachment-0001.html>