Re: not healing one file

Karthik Subrahmanya <ksubrahm@xxxxxxxxxx> · Thu, 26 Oct 2017 15:04:14 +0530

Hi Richard,

Thanks for the informations. As you said there is gfid mismatch for the file.
On brick-1 & brick-2 the gfids are same & on brick-3 the gfid is different.
This is not considered as split-brain because we have two good copies here.
Gluster 3.10 does not have a method to resolve this situation other than the
manual intervention [1]. Basically what you need to do is remove the file and
the gfid hardlink from brick-3 (considering brick-3 entry as bad). Then when
you do a lookup for the file from mount it will recreate the entry on the other brick.

Form 3.12 we have methods to resolve this situation with the cli option [2] and
with favorite-child-policy [3]. For the time being you can use [1] to resolve this
and if you can consider upgrading to 3.12 that would give you options to handle
these scenarios.

[1] http://docs.gluster.org/en/latest/Troubleshooting/split-brain/#fixing-directory-entry-split-brain
[2] https://review.gluster.org/#/c/17485/
[3] https://review.gluster.org/#/c/16878/

HTH,
Karthik

On Thu, Oct 26, 2017 at 12:40 PM, Richard Neuboeck <hawk@xxxxxxxxxxxxxxxx> wrote:
Hi Karthik,

thanks for taking a look at this. I'm not working with gluster long

enough to make heads or tails out of the logs. The logs are attached to

this mail and here is the other information:

# gluster volume info home

Volume Name: home

Type: Replicate

Volume ID: fe6218ae-f46b-42b3-a467-5fc6a36ad48a

Status: Started

Snapshot Count: 1

Number of Bricks: 1 x 3 = 3

Transport-type: tcp

Bricks:

Brick1: sphere-six:/srv/gluster_home/brick

Brick2: sphere-five:/srv/gluster_home/brick

Brick3: sphere-four:/srv/gluster_home/brick

Options Reconfigured:

features.barrier: disable

cluster.quorum-type: auto

cluster.server-quorum-type: server

nfs.disable: on

performance.readdir-ahead: on

transport.address-family: inet

features.cache-invalidation: on

features.cache-invalidation-timeout: 600

performance.stat-prefetch: on

performance.cache-samba-metadata: on

performance.cache-invalidation: on

performance.md-cache-timeout: 600

network.inode-lru-limit: 90000

performance.cache-size: 1GB

performance.client-io-threads: on

cluster.lookup-optimize: on

cluster.readdir-optimize: on

features.quota: on

features.inode-quota: on

features.quota-deem-statfs: on

cluster.server-quorum-ratio: 51%

[root@sphere-four ~]# getfattr -d -e hex -m .

/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4

getfattr: Removing leading '/' from absolute path names

# file:

srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4

security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000

trusted.afr.dirty=0x000000000000000000000000

trusted.bit-rot.version=0x020000000000000059df20a40006f989

trusted.gfid=0xda1c94b1643544b18d5b6f4654f60bf5

trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=0x0000000000009a000000000000000001

trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001

[root@sphere-five ~]# getfattr -d -e hex -m .

/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4

getfattr: Removing leading '/' from absolute path names

# file:

srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4

security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000

trusted.afr.dirty=0x000000000000000000000000

trusted.afr.home-client-4=0x000000010000000100000000

trusted.bit-rot.version=0x020000000000000059df1f310006ce63

trusted.gfid=0xea8ecfd195fd4e48b994fd0a2da226f9

trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=0x0000000000009a000000000000000001

trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001

[root@sphere-six ~]# getfattr -d -e hex -m .

/srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4

getfattr: Removing leading '/' from absolute path names

# file:

srv/gluster_home/brick/romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.baklz4

security.selinux=0x73797374656d5f753a6f626a6563745f723a756e6c6162656c65645f743a733000

trusted.afr.dirty=0x000000000000000000000000

trusted.afr.home-client-4=0x000000010000000100000000

trusted.bit-rot.version=0x020000000000000059df11cd000548ec

trusted.gfid=0xea8ecfd195fd4e48b994fd0a2da226f9

trusted.glusterfs.quota.48e9eea6-cda6-4e53-bb4a-72059debf4c2.contri.1=0x0000000000009a000000000000000001

trusted.pgfid.48e9eea6-cda6-4e53-bb4a-72059debf4c2=0x00000001

Cheers

Richard

On 26.10.17 07:41, Karthik Subrahmanya wrote:

> HeyRichard,

>

> Could you share the following informations please?

> 1. gluster volume info <volname>

> 2. getfattr output of that file from all the bricks

>     getfattr -d -e hex -m . <brickpath/filepath>

> 3. glustershd & glfsheal logs

>

> Regards,

> Karthik

>

> On Thu, Oct 26, 2017 at 10:21 AM, Amar Tumballi <atumball@xxxxxxxxxx

> <mailto:atumball@xxxxxxxxxx>> wrote:

>

>     On a side note, try recently released health report tool, and see if

>     it does diagnose any issues in setup. Currently you may have to run

>     it in all the three machines.

>

>

>

>     On 26-Oct-2017 6:50 AM, "Amar Tumballi" <atumball@xxxxxxxxxx

>     <mailto:atumball@xxxxxxxxxx>> wrote:

>

>         Thanks for this report. This week many of the developers are at

>         Gluster Summit in Prague, will be checking this and respond next

>         week. Hope that's fine.

>

>         Thanks,

>         Amar

>

>

>         On 25-Oct-2017 3:07 PM, "Richard Neuboeck"

>         <hawk@xxxxxxxxxxxxxxxx <mailto:hawk@xxxxxxxxxxxxxxxx>> wrote:

>

>             Hi Gluster Gurus,

>

>             I'm using a gluster volume as home for our users. The volume is

>             replica 3, running on CentOS 7, gluster version 3.10

>             (3.10.6-1.el7.x86_64). Clients are running Fedora 26 and also

>             gluster 3.10 (3.10.6-3.fc26.x86_64).

>

>             During the data backup I got an I/O error on one file. Manually

>             checking for this file on a client confirms this:

>

>             ls -l

>             romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/

>             ls: cannot access

>             'romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba

>             <http://recovery.ba>klz4':

>             Input/output error

>             total 2015

>             -rw-------. 1 romanoch tbi 998211 Sep 15 18:44 previous.js

>             -rw-------. 1 romanoch tbi  65222 Oct 17 17:57 previous.jsonlz4

>             -rw-------. 1 romanoch tbi 149161 Oct  1 13:46 recovery.bak

>             -?????????? ? ?        ?        ?            ? recovery.baklz4

>

>             Out of curiosity I checked all the bricks for this file. It's

>             present there. Making a checksum shows that the file is

>             different on

>             one of the three replica servers.

>

>             Querying healing information shows that the file should be

>             healed:

>             # gluster volume heal home info

>             Brick sphere-six:/srv/gluster_home/brick

>             /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba

>             <http://recovery.ba>klz4

>

>             Status: Connected

>             Number of entries: 1

>

>             Brick sphere-five:/srv/gluster_home/brick

>             /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba

>             <http://recovery.ba>klz4

>

>             Status: Connected

>             Number of entries: 1

>

>             Brick sphere-four:/srv/gluster_home/brick

>             Status: Connected

>             Number of entries: 0

>

>             Manually triggering heal doesn't report an error but also

>             does not

>             heal the file.

>             # gluster volume heal home

>             Launching heal operation to perform index self heal on

>             volume home

>             has been successful

>

>             Same with a full heal

>             # gluster volume heal home full

>             Launching heal operation to perform full self heal on volume

>             home

>             has been successful

>

>             According to the split brain query that's not the problem:

>             # gluster volume heal home info split-brain

>             Brick sphere-six:/srv/gluster_home/brick

>             Status: Connected

>             Number of entries in split-brain: 0

>

>             Brick sphere-five:/srv/gluster_home/brick

>             Status: Connected

>             Number of entries in split-brain: 0

>

>             Brick sphere-four:/srv/gluster_home/brick

>             Status: Connected

>             Number of entries in split-brain: 0

>

>

>             I have no idea why this situation arose in the first place

>             and also

>             no idea as how to solve this problem. I would highly

>             appreciate any

>             helpful feedback I can get.

>

>             The only mention in the logs matching this file is a rename

>             operation:

>             /var/log/glusterfs/bricks/srv-gluster_home-brick.log:[2017-10-23

>             09:19:11.561661] I [MSGID: 115061]

>             [server-rpc-fops.c:1022:server_rename_cbk] 0-home-server:

>             5266153:

>             RENAME

>             /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.jsonlz4

>             (48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.jsonlz4) ->

>             /romanoch/.mozilla/firefox/vzzqqxrm.default-1396429081309/sessionstore-backups/recovery.ba

>             <http://recovery.ba>klz4

>             (48e9eea6-cda6-4e53-bb4a-72059debf4c2/recovery.baklz4), client:

>             romulus.tbi.univie.ac.at-11894-2017/10/18-07:06:07:206366-home-client-3-0-0,

>             error-xlator: home-posix [No data available]

>

>             I enabled directory quotas the same day this problem showed

>             up but

>             I'm not sure how quotas could have an effect like this

>             (maybe unless

>             the limit is reached but that's also not the case).

>

>             Thanks again if anyone as an idea.

>             Cheers

>             Richard

>             --

>             /dev/null

>

>

>             _______________________________________________

>             Gluster-users mailing list

>             Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@gluster.org>

>             http://lists.gluster.org/mailman/listinfo/gluster-users

>             <http://lists.gluster.org/mailman/listinfo/gluster-users>

>

>

>     _______________________________________________

>     Gluster-users mailing list

>     Gluster-users@xxxxxxxxxxx <mailto:Gluster-users@gluster.org>

>     http://lists.gluster.org/mailman/listinfo/gluster-users

>     <http://lists.gluster.org/mailman/listinfo/gluster-users>

>

>

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users