Hi Nithya, That's what I'm getting from file3: getfattr -d -m. -e hex $file3 # file: $file3 trusted.ec.config=0x0000080602000200 trusted.ec.dirty=0x00000000000000000000000000000000 trusted.ec.size=0x00000000006c8aba trusted.ec.version=0x000000000000000f0000000000000019 trusted.gfid=0x47d6124290e844e2b733740134a657ce trusted.gfid2path.60d8a15c6ccaf15b=0x36363732366635372d396533652d343337372d616637382d6366353061636434306265322f616c676f732e63707974686f6e2d33356d2d783 8365f36342d6c696e75782d676e752e736f trusted.glusterfs.quota.66726f57-9e3e-4377-af78-cf50acd40be2.contri.3=0x00000000001b24000000000000000001 trusted.pgfid.66726f57-9e3e-4377-af78-cf50acd40be2=0x00000001So, no dht attribute. I think. That's what I found in the rebalance logs. rebalance.log.3 was another rebalance that, to our knowledge, finished without problems. I included the results from both rebalances, just in case. There is no mention of this file in the logs of the other servers. root@gluster06:/var/log/glusterfs# zgrep $file3 $VOLUME-rebalance.log* $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.243620] I [MSGID: 109045] [dht-common.c:2456:dht_lookup_cbk] 0-$VOLUME-dht: linkfile not having link subvol for $file3 $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.275213] I [MSGID: 109069] [dht-common.c:1410:dht_lookup_unlink_of_false_linkto_cbk] 0-$VOLUME-dht: lookup_unlink returned with op_ret -> 0 and op-errno -> 0 for $file3 $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.307754] I [dht-rebalance.c:1570:dht_migrate_file] 0-$VOLUME-dht: $file3: attempting to move from $VOLUME- readdir-ahead-6 to $VOLUME-readdir-ahead-8 $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.341451] I [dht-rebalance.c:1570:dht_migrate_file] 0-$VOLUME-dht: $file3: attempting to move from $VOLUME- readdir-ahead-6 to $VOLUME-readdir-ahead-8 $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.488473] I [MSGID: 109022] [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration of $file3 from subvolume $VOLUME-readdir-ahead-6 to $VOLUME-readdir-ahead-8 $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.494803] W [MSGID: 109023] [dht-rebalance.c:2094:dht_migrate_file] 0-$VOLUME-dht: Migrate file failed:$file3: failed to get xattr from $VOLUME-readdir-ahead-6 [No such file or directory] $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.499016] W [dht-rebalance.c:2159:dht_migrate_file] 0-$VOLUME-dht: $file3: failed to perform removexattr on $VOLUME-readdir-ahead-8 (No data available) $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.499776] W [MSGID: 109023] [dht-rebalance.c:2179:dht_migrate_file] 0-$VOLUME-dht: $file3: failed to do a stat on $VOLUME-readdir-ahead-6 [No such file or directory] $VOLUME-rebalance.log.1:[2019-01-12 07:17:06.500900] I [MSGID: 109022] [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration of $file3 from subvolume $VOLUME-readdir-ahead-6 to $VOLUME-readdir-ahead-8 $VOLUME-rebalance.log.3.gz:[2018-12-10 23:18:43.145616] I [dht-rebalance.c:1570:dht_migrate_file] 0-$VOLUME-dht: $file3: attempting to move from $VOLUME-disperse-6 to $VOLUME-disperse-8 $VOLUME-rebalance.log.3.gz:[2018-12-10 23:18:43.150303] W [MSGID: 109023] [dht-rebalance.c:1013:__dht_check_free_space] 0-$VOLUME-dht: data movement of file {blocks:13896 name:($file3)} would result in dst node ($VOLUME-disperse-8:23116260576) having lower disk space than the source node ($VOLUME- disperse-6:23521698592).Skipping file. $VOLUME-rebalance.log.3.gz:[2018-12-10 23:18:43.153051] I [MSGID: 109126] [dht-rebalance.c:2812:gf_defrag_migrate_single_file] 0-$VOLUME-dht: File migration skipped for $file3. Kind regards, Gudrun Am Donnerstag, den 31.01.2019, 14:46 +0530 schrieb Nithya Balachandran: > > > On Wed, 30 Jan 2019 at 19:12, Gudrun Mareike Amedick <g.amedick@xxxxxxxxxxxxxx> wrote: > > Hi, > > > > a bit additional info inlineAm Montag, den 28.01.2019, 10:23 +0100 schrieb Frank Ruehlemann: > > > Am Montag, den 28.01.2019, 09:50 +0530 schrieb Nithya Balachandran: > > > > > > > > On Fri, 25 Jan 2019 at 20:51, Gudrun Mareike Amedick < > > > > g.amedick@xxxxxxxxxxxxxx> wrote: > > > > > > > > > > > > > > Hi all, > > > > > > > > > > we have a problem with a distributed dispersed volume (GlusterFS 3.12). We > > > > > have files that lost their permissions or gained sticky bits. The files > > > > > themselves seem to be okay. > > > > > > > > > > It looks like this: > > > > > > > > > > # ls -lah $file1 > > > > > ---------- 1 www-data www-data 45M Jan 12 07:01 $file1 > > > > > > > > > > # ls -lah $file2 > > > > > -rw-rwS--T 1 $user $group 11K Jan 9 11:48 $file2 > > > > > > > > > > # ls -lah $file3 > > > > > ---------T 1 $user $group 6.8M Jan 12 08:17 $file3 > > > > > > > > > > These are linkto files (internal dht files) and should not be visible on > > > > the mount point. Are they consistently visible like this or do they revert > > > > to the proper permissions after some time? > > > They didn't heal yet, even after more than 4 weeks. Therefore we decided > > > to recommend our users to fix their files by setting the correct > > > permissions again, which worked without problems. But for analysis > > > reasons we still have some broken files nobody touched yet. > > > > > > We know these linkto files but they were never visible to clients. We > > > did these ls-commands on a client, not on a brick. > > > > They have linkfile permissions but on brick side, it looks like this: > > > > root@gluster06:~# ls -lah /$brick/$file3 > > ---------T 2 $user $group 1.7M Jan 12 08:17 /$brick/$file3 > > > > That seems to be too big for a linkfile. Also, there is no file it could link to. There's no other file with that name at that path on any other > > subvolume. > This sounds like the rebalance failed to transition the file from a linkto to a data file once the migration was complete. Please check the > rebalance logs on all nodes for any messages that refer to this file. > If you still see any such files, please check the its xattrs directly on the brick. You should see one called trusted.glusterfs.dht.linkto. Let me > know if that is missing. > > Regards, > Nithya > > > > > > > > > > > > > > > > > > > > > This is not what the permissions are supposed to look. They were 644 or > > > > > 660 before. And they definitely had no sticky bits. > > > > > The permissions on the bricks match what I see on client side. So I think > > > > > the original permissions are lost without a chance to recover them, right? > > > > > > > > > > > > > > > With some files with weird looking permissions (but not with all of them), > > > > > I can do this: > > > > > # ls -lah $path/$file4 > > > > > -rw-r--r-- 1 $user $group 6.0G Oct 11 09:34 $path/$file4 > > > > > ls -lah $path | grep $file4 > > > > > -rw-r-Sr-T 1 $user$group 6.0G Oct 11 09:34 $file4 > > > > > > > > > > > > > > So, the permissions I see depend on how I'm querying them. The permissions > > > > > on brick side agree with the ladder result, stat sees the former. I'm not > > > > > sure how that works. > > > > > > > > > The S and T bits indicate that a file is being migrated. The difference > > > > seems to be because of the way lookup versus readdirp handle this - this > > > > looks like a bug. Lookup will strip out the internal permissions set. I > > > > don't think readdirp does. This is happening because a rebalance is in > > > > progress. > > > There is no active rebalance. At least in "gluster volume rebalance > > > $VOLUME status" is none visible. > > > > > > And in the rebalance log file of this volume is the last line: > > > "[2019-01-11 02:14:50.101944] W … received signum (15), shutting down" > > > > > > > > > > > > > > > > > We know for at least a part of those files that they were okay at December > > > > > 19th. We got the first reports of weird-looking permissions at January > > > > > 12th. Between that, there was a rebalance running (January 7th to January > > > > > 11th). During that rebalance, a node was offline for a longer period of time > > > > > due to hardware issues. The output of "gluster volume heal $VOLUME info" > > > > > shows no files though. > > > > > > > > > > For all files with broken permissions we found so far, the following lines > > > > > are in the rebalance log: > > > > > > > > > > [2019-01-07 09:31:11.004802] I [MSGID: 109045] > > > > > [dht-common.c:2456:dht_lookup_cbk] 0-$VOLUME-dht: linkfile not having link > > > > > subvol for $file5 > > > > > [2019-01-07 09:31:11.262273] I [MSGID: 109069] > > > > > [dht-common.c:1410:dht_lookup_unlink_of_false_linkto_cbk] 0-$VOLUME-dht: > > > > > lookup_unlink returned with > > > > > op_ret -> 0 and op-errno -> 0 for $file5 > > > > > [2019-01-07 09:31:11.266014] I [dht-rebalance.c:1570:dht_migrate_file] > > > > > 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to > > > > > $VOLUME-readdir-ahead-5 > > > > > [2019-01-07 09:31:11.278120] I [dht-rebalance.c:1570:dht_migrate_file] > > > > > 0-$VOLUME-dht: $file5: attempting to move from $VOLUME-readdir-ahead-0 to > > > > > $VOLUME-readdir-ahead-5 > > > > > [2019-01-07 09:31:11.732175] W [dht-rebalance.c:2159:dht_migrate_file] > > > > > 0-$VOLUME-dht: $file5: failed to perform removexattr on > > > > > $VOLUME-readdir-ahead-0 > > > > > (No data available) > > > > > [2019-01-07 09:31:11.737319] W [MSGID: 109023] > > > > > [dht-rebalance.c:2179:dht_migrate_file] 0-$VOLUME-dht: $file5: failed to do > > > > > a stat on $VOLUME-readdir- > > > > > ahead-0 [No such file or directory] > > > > > [2019-01-07 09:31:11.744382] I [MSGID: 109022] > > > > > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration > > > > > of $file5 from subvolume > > > > > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > > > > [2019-01-07 09:31:11.744676] I [MSGID: 109022] > > > > > [dht-rebalance.c:2218:dht_migrate_file] 0-$VOLUME-dht: completed migration > > > > > of $file5 from subvolume > > > > > $VOLUME-readdir-ahead-0 to $VOLUME-readdir-ahead-5 > > > > > > > > > > > > > > > > > > > > I've searched the brick logs for $file5 with broken permissions and found > > > > > this on all bricks from (I think) the subvolume $VOLUME-readdir-ahead-5: > > > > > > > > > > [2019-01-07 09:32:13.821545] I [MSGID: 113030] [posix.c:2171:posix_unlink] > > > > > 0-$VOLUME-posix: open-fd-key-status: 0 for $file5 > > > > > [2019-01-07 09:32:13.821609] I [MSGID: 113031] > > > > > [posix.c:2084:posix_skip_non_linkto_unlink] 0-posix: linkto_xattr status: 0 > > > > > for $file5 > > > > > > > > > > > > > > > > > > > > Also, we noticed that many directories got their modification time > > > > > updated. It was set to the rebalance date. Is that supposed to happen? > > > > > > > > > > > > > > > We had parallel-readdir enabled during the rebalance. We disabled it since > > > > > we had empty directories that couldn't be deleted. I was able to delete > > > > > those dirs after that. > > > > > > > > > Was this disabled during the rebalance? parallel-readdirp changes the > > > > volume graph for clients but not for the rebalance process causing it to > > > > fail to find the linkto subvols. > > > Yes, parallel-readdirp was enabled during the rebalance. But we disabled > > > it after some files where invisible on the client side again. > > > > The timetable looks like this: > > > > December 12th: parallel-readdir enabled > > January 7th: rebalance started > > January 11th/12th: rebalance finished (varied a bit, some servers were faster) > > January 15th: parallel-readdir disabled > > > > > > > > > > > > > > > > > > > > > > > > Also, we have directories who lost their GFID on some bricks. Again. > > > > > > > > Is this the missing symlink problem that was reported earlier? > > > > Looks like. I had a dir with missing GFID on one brick, I couldn't see some files on client side, I recreated the GFID symlink and everything was > > fine > > again. > > And in the brick log, I had this entry (with 1d372a8a-4958-4700-8ef1-fa4f756baad3 being the GFID of the dir in question): > > > > [2019-01-13 17:57:55.020859] W [MSGID: 113103] [posix.c:301:posix_lookup] 0-$VOLUME-posix: Found stale gfid handle > > /srv/glusterfs/bricks/$brick/data/.glusterfs/1d/37/1d372a8a-4958-4700-8ef1-fa4f756baad3, removing it. [No such file or directory] > > > > Very familiar. At least, I know how to fix that :D > > > > Kind regards > > > > Gudrun > > > > > > > > > > Regards, > > > > Nithya > > > > > > > > > > > > > > > > > > > > > > > > > > > > > What happened? Can we do something to fix this? And could that happen > > > > > again? > > > > > > > > > > We want to upgrade to 4.1 soon. Is it safe to do that or could it make > > > > > things worse? > > > > > > > > > > Kind regards > > > > > > > > > > Gudrun Amedick_______________________________________________ > > > > > Gluster-users mailing list > > > > > Gluster-users@xxxxxxxxxxx > > > > > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > _______________________________________________ > > > > Gluster-users mailing list > > > > Gluster-users@xxxxxxxxxxx > > > > https://lists.gluster.org/mailman/listinfo/gluster-users
Attachment:
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users