Re: transport endpoint not connected on just 2 files

Strahil Nikolov <hunter86_bg@xxxxxxxxx> · Wed, 8 Jun 2022 05:27:05 +0000 (UTC)

                Volume stop was not necessary. Every time you access the file , Gluster will check the permissions, acls , extended file attributes and then allow you access or not.

I'm really surprised that this situation ever happened , and most probably is worth a github issue if you are using latest version of Gluster.

Best Regards,
Strahil Nikolov

                                                                                        В вторник, 7 юни 2022 г., 14:50:23 ч. Гринуич+3, Kingsley Tart <gluster@xxxxxxxxxxxxxxxxxxx> написа:                

                Hi,

Thanks - sorry for the late reply - I was suddenly swamped with other work then it was a UK holiday.

I've tried rsync -A -X with the volume stopped, then restarted it. Will see whether it heals.

Cheers,
Kingsley.

On Mon, 2022-05-30 at 18:41 +0000, Strahil Nikolov wrote:
Make a backup from all bricks. Based on the info 2 of the bricks have the same copy while brickC has another copy (gfid mismatch).

I would use mtime to identify the latest version and use that, but I have no clue what kind of application you have.

Usually, It's not recommended to manipulate bricks directly, but in this case it might be necessary. The simplest way is to move the file on brick C (the only one that is different) away, but if you need exactly that one, you can rsync/scp it to the other 2 bricks.

Best Regards,
Strahil Nikolov

On Fri, May 27, 2022 at 11:45, Kingsley Tart
<gluster@xxxxxxxxxxxxxxxxxxx> wrote:
Hi, thanks.

OK that's interesting. Picking one of the files, on bricks A and B I see this (and all of the values are identical between bricks A and B):

trusted.afr.dirty=0x000000000000000000000000
trusted.afr.gw-runqueues-client-2=0x000000010000000200000000
trusted.gfid=0xa40bb83ff3784ae09c997d272296a7a9
trusted.gfid2path.06eddbe9be9c7c75=0x30323665396561652d613661662d346365642d623863632d6261353037333339646364372f677733
trusted.glusterfs.mdata=0x01000000000000000000000000628ec57700000000007168bb00000000628ec576000000000000000000000000628ec5760000000000000000

and on brick C I see this:

trusted.gfid=0xd73992aee03e4021824b1baced973df3
trusted.gfid2path.06eddbe9be9c7c75=0x30323665396561652d613661662d346365642d623863632d6261353037333339646364372f677733
trusted.glusterfs.mdata=0x01000000000000000000000000628ec5230000000030136ca000000000628ec523000000000000000000000000628ec5230000000000000000

So brick C is missing the trusted.afr attributes and the trusted.gfid and mdata differ.

What do I need to do to fix this?

Cheers,
Kingsley.

On Fri, 2022-05-27 at 03:59 +0000, Strahil Nikolov wrote:
Check the file attributes on all bricks:

getfattr -d -e hex -m. /data/brick/gw-runqueues/<path to file>

Best Regards,
Strahil Nikolov

On Thu, May 26, 2022 at 16:05, Kingsley Tart
<gluster@xxxxxxxxxxxxxxxxxxx> wrote:
Hi,

I've got a strange issue where on all clients I've tested on (tested on
4) I have "transport endpoint is not connected" on two files in a
directory, whereas other files can be read fine.

Any ideas?

On one of the servers (all same version):

# gluster --version
glusterfs 9.1

On one of the clients (same thing with all of them) - problem with
files "gw3" and "gw11":

[root@gw6 btl]# cd /mnt/runqueues/runners/
[root@gw6 runners]# ls -la
ls: cannot access gw11: Transport endpoint is not connected
ls: cannot access gw3: Transport endpoint is not connected
total 8
drwxr-xr-x  2 root root 4096 May 26 09:48 .
drwxr-xr-x 13 root root 4096 Apr 12  2021 ..
-rw-r--r--  1 root root    0 May 26 09:49 gw1
-rw-r--r--  1 root root    0 May 26 09:49 gw10
-?????????  ? ?    ?      ?            ? gw11
-rw-r--r--  1 root root    0 May 26 09:49 gw2
-?????????  ? ?    ?      ?            ? gw3
-rw-r--r--  1 root root    0 May 26 09:49 gw4
-rw-r--r--  1 root root    0 May 26 09:49 gw6
-rw-r--r--  1 root root    0 May 26 09:49 gw7
[root@gw6 runners]# cat *
cat: gw11: Transport endpoint is not connected
cat: gw3: Transport endpoint is not connected
[root@gw6 runners]#

Querying on a server shows those two problematic files:

# gluster volume heal gw-runqueues info
Brick gluster9a:/data/brick/gw-runqueues
/runners
/runners/gw11
/runners/gw3
Status: Connected
Number of entries: 3

Brick gluster9b:/data/brick/gw-runqueues
/runners
/runners/gw11
/runners/gw3
Status: Connected
Number of entries: 3

Brick gluster9c:/data/brick/gw-runqueues
Status: Connected
Number of entries: 0

However several hours later there's no obvious change. The servers have
hardly any load and the volume is tiny. From a client:

# find /mnt/runqueues | wc -l
35

glfsheal-gw-runqueues.log from server gluster9a:
https://pastebin.com/7mPszBBM

glfsheal-gw-runqueues.log from server gluster9b:
https://pastebin.com/rxXs5Tcv

Any pointers would be much appreciated!

Cheers,
Kingsley.

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://meet.google.com/cpu-eiue-hvk
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users