Hi Strahil
The versions are:
CentOS Linux release 7.7.1908
glusterfs 7.3
I am setting performance.md-cache-timeout and performance.nl-cache-timeout to 120s
The weird thing about it, It always happens on the same mount as the operation (copy, mv). My common sense is that any cache related problem should be on the nodes who may be not aware of the operation, isn't it?
Regards,
Martin
On Tue, Oct 27, 2020 at 4:01 PM Strahil Nikolov <hunter86_bg@xxxxxxxxx> wrote:
Have you tried to reduce the cache timeouts ?
I can't find your gluster version in the thread - can you share again OS + gluster version ?
Best Regards,
Strahil Nikolov
В вторник, 27 октомври 2020 г., 19:23:28 Гринуич+2, Martín Lorenzo <mlorenzo@xxxxxxxxx> написа:
Hi Strahil, today we have the same number clients on all nodes, but the problem persists. I have the impression that it gets more frequent as the server capacity fills up, now we are having at least one incident per day.
Regards,
Martin
On Mon, Oct 26, 2020 at 8:09 AM Martín Lorenzo <mlorenzo@xxxxxxxxx> wrote:
> HI Strahil, thanks for your reply,
> I had one node with 13 clients, the rest with 14. I've just restarted the services on that node, now I have 14, let's see what happens.
> Regarding the samba repos, I wasn't aware of that, I was using centos main repo. I'll check the out
> Best Regards,
> Martin
>
>
> On Tue, Oct 20, 2020 at 3:19 PM Strahil Nikolov <hunter86_bg@xxxxxxxxx> wrote:
>> Do you have the same ammount of clients connected to each brick ?
>>
>> I guess something like this can show it:
>>
>> gluster volume status VOL clients
>> gluster volume status VOL client-list
>>
>> Best Regards,
>> Strahil Nikolov
>>
>>
>>
>>
>>
>>
>> В вторник, 20 октомври 2020 г., 15:41:45 Гринуич+3, Martín Lorenzo <mlorenzo@xxxxxxxxx> написа:
>>
>>
>>
>>
>>
>> Hi, I have the following problem, I have a distributed replicated cluster set up with samba and CTDB, over fuse mount points
>> I am having inconsistencies across the FUSE mounts, users report that files are disappearing after being copied/moved. I take a look at the mount points on each node, and they don't display the same data
>>
>> #### faulty mount point####
>> [root@gluster6 ARRIBA GENTE martes 20 de octubre]# ll
>> ls: cannot access PANEO VUELTA A CLASES CON TAPABOCAS.mpg: No such file or directory
>> ls: cannot access PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg: No such file or directory
>> total 633723
>> drwxr-xr-x. 5 arribagente PN 4096 Oct 19 10:52 COMERCIAL AG martes 20 de octubre
>> -rw-r--r--. 1 arribagente PN 648927236 Jun 3 07:16 PANEO FACHADA PALACIO LEGISLATIVO DRONE DIA Y NOCHE.mpg
>> -?????????? ? ? ? ? ? PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg
>> -?????????? ? ? ? ? ? PANEO VUELTA A CLASES CON TAPABOCAS.mpg
>>
>>
>> ###healthy mount point###
>> [root@gluster7 ARRIBA GENTE martes 20 de octubre]# ll
>> total 3435596
>> drwxr-xr-x. 5 arribagente PN 4096 Oct 19 10:52 COMERCIAL AG martes 20 de octubre
>> -rw-r--r--. 1 arribagente PN 648927236 Jun 3 07:16 PANEO FACHADA PALACIO LEGISLATIVO DRONE DIA Y NOCHE.mpg
>> -rw-r--r--. 1 arribagente PN 2084415492 Aug 18 09:14 PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg
>> -rw-r--r--. 1 arribagente PN 784701444 Sep 4 07:23 PANEO VUELTA A CLASES CON TAPABOCAS.mpg
>>
>> - So far the only way to solve this is to create a directory in the healthy mount point, on the same path:
>> [root@gluster7 ARRIBA GENTE martes 20 de octubre]# mkdir hola
>>
>> - When you refresh the other mountpoint, and the issue is resolved:
>> [root@gluster6 ARRIBA GENTE martes 20 de octubre]# ll
>> total 3435600
>> drwxr-xr-x. 5 arribagente PN 4096 Oct 19 10:52 COMERCIAL AG martes 20 de octubre
>> drwxr-xr-x. 2 root root 4096 Oct 20 08:45 hola
>> -rw-r--r--. 1 arribagente PN 648927236 Jun 3 07:16 PANEO FACHADA PALACIO LEGISLATIVO DRONE DIA Y NOCHE.mpg
>> -rw-r--r--. 1 arribagente PN 2084415492 Aug 18 09:14 PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg
>> -rw-r--r--. 1 arribagente PN 784701444 Sep 4 07:23 PANEO VUELTA A CLASES CON TAPABOCAS.mpg
>>
>> Interestingly, the error occurs on the mount point where the files were copied. They don't show up as pending heal entries. I have around 15 people using them over samba, I think I'm having this issue reported every two days.
>>
>> I have an older cluster with similar issues, different gluster version, but a very similar topology (4 bricks, initially two bricks then expanded)
>> Please note , the bricks aren't the same size (but their replicas are), so my other suspicion is that rebalancing has something to do with it.
>>
>> I'm trying to reproduce it over a small virtualized cluster, so far no results.
>>
>> Here are the cluster details
>> four nodes, replica 2, plus one arbiter hosting 2 bricks
>>
>> I have 2 bricks with ~20 TB capacity and the other pair is ~48TB
>> Volume Name: tapeless
>> Type: Distributed-Replicate
>> Volume ID: 53bfa86d-b390-496b-bbd7-c4bba625c956
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 2 x (2 + 1) = 6
>> Transport-type: tcp
>> Bricks:
>> Brick1: gluster6.glustersaeta.net:/data/glusterfs/tapeless/brick_6/brick
>> Brick2: gluster7.glustersaeta.net:/data/glusterfs/tapeless/brick_7/brick
>> Brick3: kitchen-store.glustersaeta.net:/data/glusterfs/tapeless/brick_1a/brick (arbiter)
>> Brick4: gluster12.glustersaeta.net:/data/glusterfs/tapeless/brick_12/brick
>> Brick5: gluster13.glustersaeta.net:/data/glusterfs/tapeless/brick_13/brick
>> Brick6: kitchen-store.glustersaeta.net:/data/glusterfs/tapeless/brick_2a/brick (arbiter)
>> Options Reconfigured:
>> features.quota-deem-statfs: on
>> performance.client-io-threads: on
>> nfs.disable: on
>> transport.address-family: inet
>> features.quota: on
>> features.inode-quota: on
>> features.cache-invalidation: on
>> features.cache-invalidation-timeout: 600
>> performance.cache-samba-metadata: on
>> performance.stat-prefetch: on
>> performance.cache-invalidation: on
>> performance.md-cache-timeout: 600
>> network.inode-lru-limit: 200000
>> performance.nl-cache: on
>> performance.nl-cache-timeout: 600
>> performance.readdir-ahead: on
>> performance.parallel-readdir: on
>> performance.cache-size: 1GB
>> client.event-threads: 4
>> server.event-threads: 4
>> performance.normal-prio-threads: 16
>> performance.io-thread-count: 32
>> performance.write-behind-window-size: 8MB
>> storage.batch-fsync-delay-usec: 0
>> cluster.data-self-heal: on
>> cluster.metadata-self-heal: on
>> cluster.entry-self-heal: on
>> cluster.self-heal-daemon: on
>> performance.write-behind: on
>> performance.open-behind: on
>>
>> Log section form faulty mount point. I think the [file exists] entries are from people trying to copy the missing files over an over
>>
>>
>> [2020-10-20 11:31:03.034220] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:32:06.684329] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:33:02.191863] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:34:05.841608] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:35:20.736633] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-tapeless-replicate-1: performing metadata selfheal on 958dbd7a-3cd7-4b66-9038-76e5c5669644
>> [2020-10-20 11:35:20.741213] I [MSGID: 108026] [afr-self-heal-common.c:1750:afr_log_selfheal] 0-tapeless-replicate-1: Completed metadata selfheal on 958dbd7a-3cd7-4b66-9038-76e5c5669644. sources=[0] 1 sinks=2
>> [2020-10-20 11:35:04.278043] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> The message "I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-tapeless-replicate-1: performing metadata selfheal on 958dbd7a-3cd7-4b66-9038-76e5c5669644" repeated 3 times between [2020-10-20 11:35:20.736633] and [2020-10-20 11:35:26.733298]
>> The message "I [MSGID: 108026] [afr-self-heal-common.c:1750:afr_log_selfheal] 0-tapeless-replicate-1: Completed metadata selfheal on 958dbd7a-3cd7-4b66-9038-76e5c5669644. sources=[0] 1 sinks=2 " repeated 3 times between [2020-10-20 11:35:20.741213] and [2020-10-20 11:35:26.737629]
>> [2020-10-20 11:36:02.548350] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:36:57.365537] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-tapeless-replicate-1: performing metadata selfheal on f4907af2-1775-4c46-89b5-e9776df6d5c7
>> [2020-10-20 11:36:57.370824] I [MSGID: 108026] [afr-self-heal-common.c:1750:afr_log_selfheal] 0-tapeless-replicate-1: Completed metadata selfheal on f4907af2-1775-4c46-89b5-e9776df6d5c7. sources=[0] 1 sinks=2
>> [2020-10-20 11:37:01.363925] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-tapeless-replicate-1: performing metadata selfheal on f4907af2-1775-4c46-89b5-e9776df6d5c7
>> [2020-10-20 11:37:01.368069] I [MSGID: 108026] [afr-self-heal-common.c:1750:afr_log_selfheal] 0-tapeless-replicate-1: Completed metadata selfheal on f4907af2-1775-4c46-89b5-e9776df6d5c7. sources=[0] 1 sinks=2
>> The message "I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0" repeated 3 times between [2020-10-20 11:36:02.548350] and [2020-10-20 11:37:36.389208]
>> [2020-10-20 11:38:07.367113] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:39:01.595981] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:40:04.184899] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:41:07.833470] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:42:01.871621] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:43:04.399194] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:44:04.558647] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:44:15.953600] W [MSGID: 114031] [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-5: remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists]
>> [2020-10-20 11:44:15.953819] W [MSGID: 114031] [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-2: remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists]
>> [2020-10-20 11:44:15.954072] W [MSGID: 114031] [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-3: remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists]
>> [2020-10-20 11:44:15.954680] W [fuse-bridge.c:2606:fuse_create_cbk] 0-glusterfs-fuse: 31043294: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists)
>> [2020-10-20 11:44:15.963175] W [fuse-bridge.c:2606:fuse_create_cbk] 0-glusterfs-fuse: 31043306: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists)
>> [2020-10-20 11:44:15.971839] W [fuse-bridge.c:2606:fuse_create_cbk] 0-glusterfs-fuse: 31043318: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists)
>> [2020-10-20 11:44:16.010242] W [fuse-bridge.c:2606:fuse_create_cbk] 0-glusterfs-fuse: 31043403: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists)
>> [2020-10-20 11:44:16.020291] W [fuse-bridge.c:2606:fuse_create_cbk] 0-glusterfs-fuse: 31043415: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists)
>> [2020-10-20 11:44:16.028857] W [fuse-bridge.c:2606:fuse_create_cbk] 0-glusterfs-fuse: 31043427: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg => -1 (File exists)
>> The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-5: remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists]" repeated 5 times between [2020-10-20 11:44:15.953600] and [2020-10-20 11:44:16.027785]
>> The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-2: remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists]" repeated 5 times between [2020-10-20 11:44:15.953819] and [2020-10-20 11:44:16.028331]
>> The message "W [MSGID: 114031] [client-rpc-fops_v2.c:2114:client4_0_create_cbk] 0-tapeless-client-3: remote operation failed. Path: /PN/arribagente/PLAYER 2020/ARRIBA GENTE martes 20 de octubre/PANEO NIÑOS ESCUELAS CON TAPABOCAS.mpg [File exists]" repeated 5 times between [2020-10-20 11:44:15.954072] and [2020-10-20 11:44:16.028355]
>> [2020-10-20 11:45:03.572106] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:45:40.080010] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> The message "I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0" repeated 2 times between [2020-10-20 11:45:40.080010] and [2020-10-20 11:47:10.871801]
>> [2020-10-20 11:48:03.913129] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:49:05.082165] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:50:06.725722] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:51:04.254685] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:52:07.903617] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:53:01.420513] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 0-tapeless-replicate-0: performing metadata selfheal on 3c316533-5f47-4267-ac19-58b3be305b94
>> [2020-10-20 11:53:01.428657] I [MSGID: 108026] [afr-self-heal-common.c:1750:afr_log_selfheal] 0-tapeless-replicate-0: Completed metadata selfheal on 3c316533-5f47-4267-ac19-58b3be305b94. sources=[0] sinks=1 2
>> The message "I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0" repeated 3 times between [2020-10-20 11:52:07.903617] and [2020-10-20 11:53:12.037835]
>> [2020-10-20 11:54:02.208354] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:55:04.360284] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:56:09.508092] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:57:02.580970] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>> [2020-10-20 11:58:06.230698] I [MSGID: 108031] [afr-common.c:2581:afr_local_discovery_cbk] 0-tapeless-replicate-0: selecting local read_child tapeless-client-0
>>
>>
>> Let me know if you need something else. Thank you for you suppoort!
>> Best Regards,
>> Martin Lorenzo
>>
>>
>> ________
>>
>>
>>
>> Community Meeting Calendar:
>>
>> Schedule -
>> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
>> Bridge: https://bluejeans.com/441850968
>>
>> Gluster-users mailing list
>> Gluster-users@xxxxxxxxxxx
>> https://lists.gluster.org/mailman/listinfo/gluster-users
>>
>
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users