Re: GFID attir is missing after adding large amounts of data

Ben Turner <bturner@xxxxxxxxxx> · Thu, 31 Aug 2017 22:19:01 -0400 (EDT)

I re-added gluster-users to get some more eye on this.

----- Original Message -----
> From: "Christoph Schäbel" <christoph.schaebel@xxxxxxxxxxxx>
> To: "Ben Turner" <bturner@xxxxxxxxxx>
> Sent: Wednesday, August 30, 2017 8:18:31 AM
> Subject: Re:  GFID attir is missing after adding large amounts of	data
> 
> Hello Ben,
> 
> thank you for offering your help.
> 
> Here are outputs from all the gluster commands I could think of.
> Note that we had to remove the terrabytes of data to keep the system
> operational, because it is a live system.
> 
> # gluster volume status
> 
> Status of volume: gv0
> Gluster process                             TCP Port  RDMA Port  Online  Pid
> ------------------------------------------------------------------------------
> Brick 10.191.206.15:/mnt/brick1/gv0         49154     0          Y       2675
> Brick 10.191.198.15:/mnt/brick1/gv0         49154     0          Y       2679
> Self-heal Daemon on localhost               N/A       N/A        Y
> 12309
> Self-heal Daemon on 10.191.206.15           N/A       N/A        Y       2670
> 
> Task Status of Volume gv0
> ------------------------------------------------------------------------------
> There are no active volume tasks

OK so your bricks are all online, you have two nodes with 1 brick per node.

> 
> # gluster volume info
> 
> Volume Name: gv0
> Type: Replicate
> Volume ID: 5e47d0b8-b348-45bb-9a2a-800f301df95b
> Status: Started
> Snapshot Count: 0
> Number of Bricks: 1 x 2 = 2
> Transport-type: tcp
> Bricks:
> Brick1: 10.191.206.15:/mnt/brick1/gv0
> Brick2: 10.191.198.15:/mnt/brick1/gv0
> Options Reconfigured:
> transport.address-family: inet
> performance.readdir-ahead: on
> nfs.disable: on

You are using a replicate volume with 2 copies of your data, it looks like you are using the defaults as I don't see any tuning.

> 
> # gluster peer status
> 
> Number of Peers: 1
> 
> Hostname: 10.191.206.15
> Uuid: 030a879d-da93-4a48-8c69-1c552d3399d2
> State: Peer in Cluster (Connected)
> 
> 
> # gluster —version
> 
> glusterfs 3.8.11 built on Apr 11 2017 09:50:39
> Repository revision: git://git.gluster.com/glusterfs.git
> Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
> GlusterFS comes with ABSOLUTELY NO WARRANTY.
> You may redistribute copies of GlusterFS under the terms of the GNU General
> Public License.

You are running Gluster 3.8 which is the latest upstream release marked stable.

> 
> # df -h
> 
> Filesystem               Size  Used Avail Use% Mounted on
> /dev/mapper/vg00-root     75G  5.7G   69G   8% /
> devtmpfs                 1.9G     0  1.9G   0% /dev
> tmpfs                    1.9G     0  1.9G   0% /dev/shm
> tmpfs                    1.9G   17M  1.9G   1% /run
> tmpfs                    1.9G     0  1.9G   0% /sys/fs/cgroup
> /dev/sda1                477M  151M  297M  34% /boot
> /dev/mapper/vg10-brick1  8.0T  700M  8.0T   1% /mnt/brick1
> localhost:/gv0           8.0T  768M  8.0T   1% /mnt/glusterfs_client
> tmpfs                    380M     0  380M   0% /run/user/0
>

Your brick is:

 /dev/mapper/vg10-brick1  8.0T  700M  8.0T   1% /mnt/brick1

The block device is 8TB.  Can you tell me more about your brick?  Is it a single disk or a RAID?  If its a RAID can you tell me about the disks?  I am interested in:

-Size of disks
-RAID type
-Stripe size
-RAID controller

I also see:

 localhost:/gv0           8.0T  768M  8.0T   1% /mnt/glusterfs_client

So you are mounting your volume on the local node, is this the mount where you are writing data to?

> 
> 
> The setup of the servers is done via shell script on CentOS 7 containing the
> following commands:
> 
> yum install -y centos-release-gluster
> yum install -y glusterfs-server
> 
> mkdir /mnt/brick1
> ssm create -s 999G -n brick1 --fstype xfs -p vg10 /dev/sdb /mnt/brick1

I haven't used system-storage-manager before, do you know if it takes care of properly tuning your storage stack(if you have a RAID that is)?  If you don't have a RAID its prolly not that big of a deal, if you do have a RAID we should make sure everything is aware of your stripe size and tune appropriately.

> 
> echo "/dev/mapper/vg10-brick1   /mnt/brick1 xfs defaults    1   2" >>
> /etc/fstab
> mount -a && mount
> mkdir /mnt/brick1/gv0
> 
> gluster peer probe OTHER_SERVER_IP
> 
> gluster pool list
> gluster volume create gv0 replica 2 OWN_SERVER_IP:/mnt/brick1/gv0
> OTHER_SERVER_IP:/mnt/brick1/gv0
> gluster volume start gv0
> gluster volume info gv0
> gluster volume set gv0 network.ping-timeout "10"
> gluster volume info gv0
> 
> # mount as client for archiving cronjob, is already in fstab
> mount -a
> 
> # mount via fuse-client
> mkdir -p /mnt/glusterfs_client
> echo "localhost:/gv0	/mnt/glusterfs_client	glusterfs	defaults,_netdev	0	0" >>
> /etc/fstab
> mount -a
> 
> 
> We untar multiple files (around 1300 tar files) each around 2,7GB in size.
> The tar files are not compressed.
> We untar the files with a shell script containing the following:
> 
> #! /bin/bash
>  for f in *.tar; do tar xfP $f; done

Your script looks good, I am not that familiar with the tar flag "P" but it looks to mean:

       -P, --absolute-names
              Don't strip leading slashes from file names when creating archives.

I don't see anything strange here, everything looks OK.

> 
> The script is run as user root, the processes glusterd, glusterfs and
> glusterfsd also run under user root.
> 
> Each tar file consists of a single folder with multiple folders and files in
> it.
> The folder tree looks like this (note that the "=“ is part of the folder
> name):
> 
> 1498780800/
> - timeframe_hour=1498780800/ (about 25 of these folders)
> -- type=1/ (about 25 folders total)
> --- data-x.gz.parquet (between 100MB and 1kb in size)
> --- data-x.gz.parquet.crc (around 1kb in size)
> -- …
> - ...
> 
> Unfortunately I cannot share the file contents with you.

Thats no problem, I'll try to recreate this in the lab.

> 
> We have not seen any other issues with glusterfs, when untaring just a few of
> those files. I just tried writing a 100GB with dd and did not see any issues
> there, the file is replicated and the GFID attribute is set correctly on
> both nodes.

ACK.  I do this all the time, if you saw an issue here I would be worried about your setup.

> 
> We are not able to reproduce this in our lab environment which is a clone
> (actual cloned VMs) of the other system, but it only has around 1TB of
> storage.
> Do you think this could be an issue with the number of files which is
> generated by tar (over 1.5 million files). ?
> What I can say is that it is not an issue with inodes, that I checked when
> all the files where unpacked on the live system.

Hmm I am not sure.  Its strange that you can't repro this on your other config, in the lab I have a ton of space to work with so I can run a ton of data in my repro.

> 
> If you need anything else, let me know.

Can you help clarify your reproducer so I can give it a go in the lab?  From what I can tell you have:

 1498780800/    <-- Just a string of numbers, this is the root dir of your tarball
 - timeframe_hour=1498780800/ (about 25 of these folders)    <-- This is the second level dir of your tarball, there are ~25 of these dirs that mention a timeframe and an hour
 -- type=1/ (about 25 folders total)    <-- This is the 3rd level of your tar, there are about 25 different type=$X dirs
 --- data-x.gz.parquet (between 100MB and 1kb in size)    <-- This is your actual data.  Is there just 1 pair of these file per dir or multiple?
 --- data-x.gz.parquet.crc (around 1kb in size)    <-- This is a checksum for the above file?

I have almost everything I need for my reproducer, can you answer the above questions about the data?

-b

> 
> Thank you for your help,
> Christoph
> > Am 29.08.2017 um 06:36 schrieb Ben Turner <bturner@xxxxxxxxxx>:
> > 
> > Also include gluster v status, I want to check the status of your bricks
> > and SHD processes.
> > 
> > -b
> > 
> > ----- Original Message -----
> >> From: "Ben Turner" <bturner@xxxxxxxxxx>
> >> To: "Christoph Schäbel" <christoph.schaebel@xxxxxxxxxxxx>
> >> Cc: gluster-users@xxxxxxxxxxx
> >> Sent: Tuesday, August 29, 2017 12:35:05 AM
> >> Subject: Re:  GFID attir is missing after adding large
> >> amounts of	data
> >> 
> >> This is strange, a couple of questions:
> >> 
> >> 1.  What volume type is this?  What tuning have you done?  gluster v info
> >> output would be helpful here.
> >> 
> >> 2.  How big are your bricks?
> >> 
> >> 3.  Can you write me a quick reproducer so I can try this in the lab?  Is
> >> it
> >> just a single multi TB file you are untarring or many?  If you give me the
> >> steps to repro, and I hit it, we can get a bug open.
> >> 
> >> 4.  Other than this are you seeing any other problems?  What if you untar
> >> a
> >> smaller file(s)?  Can you read and write to the volume with say DD without
> >> any problems?
> >> 
> >> It sounds like you have some other issues affecting things here, there is
> >> no
> >> reason why you shouldn't be able to untar and write multiple TBs of data
> >> to
> >> gluster.  Go ahead and answer those questions and I'll see what I can do
> >> to
> >> help you out.
> >> 
> >> -b
> >> 
> >> ----- Original Message -----
> >>> From: "Christoph Schäbel" <christoph.schaebel@xxxxxxxxxxxx>
> >>> To: gluster-users@xxxxxxxxxxx
> >>> Sent: Monday, August 28, 2017 3:55:31 AM
> >>> Subject:  GFID attir is missing after adding large amounts
> >>> of	data
> >>> 
> >>> Hi Cluster Community,
> >>> 
> >>> we are seeing some problems when adding multiple terrabytes of data to a
> >>> 2
> >>> node replicated GlusterFS installation.
> >>> 
> >>> The version is 3.8.11 on CentOS 7.
> >>> The machines are connected via 10Gbit LAN and are running 24/7. The OS is
> >>> virtualized on VMWare.
> >>> 
> >>> After a restart of node-1 we see that the log files are growing to
> >>> multiple
> >>> Gigabytes a day.
> >>> 
> >>> Also there seem to be problems with the replication.
> >>> The setup worked fine until sometime after we added the additional data
> >>> (around 3 TB in size) to node-1. We added the data to a mountpoint via
> >>> the
> >>> client, not directly to the brick.
> >>> What we did is add tar files via a client-mount and then untar them while
> >>> in
> >>> the client-mount folder.
> >>> The brick (/mnt/brick1/gv0) is using the XFS filesystem.
> >>> 
> >>> When checking the file attributes of one of the files mentioned in the
> >>> brick
> >>> logs, i can see that the gfid attribute is missing on node-1. On node-2
> >>> the
> >>> file does not even exist.
> >>> 
> >>> getfattr -m . -d -e hex
> >>> mnt/brick1/gv0/.glusterfs/40/59/40598e46-9868-4d7c-b494-7b978e67370a/type=type1/part-r-00002-4846e211-c81d-4c08-bb5e-f22fa5a4b404.gz.parquet
> >>> 
> >>> # file:
> >>> mnt/brick1/gv0/.glusterfs/40/59/40598e46-9868-4d7c-b494-7b978e67370a/type=type1/part-r-00002-4846e211-c81d-4c08-bb5e-f22fa5a4b404.gz.parquet
> >>> security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a756e6c6162656c65645f743a733000
> >>> 
> >>> We repeated this scenario a second time with a fresh setup and got the
> >>> same
> >>> results.
> >>> 
> >>> Does anyone know what we are doing wrong ?
> >>> 
> >>> Is there maybe a problem with glusterfs and tar ?
> >>> 
> >>> 
> >>> Log excerpts:
> >>> 
> >>> 
> >>> glustershd.log
> >>> 
> >>> [2017-07-26 15:31:36.290908] I [MSGID: 108026]
> >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do] 0-gv0-replicate-0:
> >>> performing entry selfheal on fe5c42ac-5fda-47d4-8221-484c8d826c06
> >>> [2017-07-26 15:31:36.294289] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1: remote
> >>> operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No
> >>> data available]
> >>> [2017-07-26 15:31:36.298287] I [MSGID: 108026]
> >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do] 0-gv0-replicate-0:
> >>> performing entry selfheal on e31ae2ca-a3d2-4a27-a6ce-9aae24608141
> >>> [2017-07-26 15:31:36.300695] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1: remote
> >>> operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No
> >>> data available]
> >>> [2017-07-26 15:31:36.303626] I [MSGID: 108026]
> >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do] 0-gv0-replicate-0:
> >>> performing entry selfheal on 2cc9dafe-64d3-454a-a647-20deddfaebfe
> >>> [2017-07-26 15:31:36.305763] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1: remote
> >>> operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No
> >>> data available]
> >>> [2017-07-26 15:31:36.308639] I [MSGID: 108026]
> >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do] 0-gv0-replicate-0:
> >>> performing entry selfheal on cbabf9ed-41be-4d08-9cdb-5734557ddbea
> >>> [2017-07-26 15:31:36.310819] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1: remote
> >>> operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No
> >>> data available]
> >>> [2017-07-26 15:31:36.315057] I [MSGID: 108026]
> >>> [afr-self-heal-entry.c:833:afr_selfheal_entry_do] 0-gv0-replicate-0:
> >>> performing entry selfheal on 8a3c1c16-8edf-40f0-b2ea-8e70c39e1a69
> >>> [2017-07-26 15:31:36.317196] W [MSGID: 114031]
> >>> [client-rpc-fops.c:2933:client3_3_lookup_cbk] 0-gv0-client-1: remote
> >>> operation failed. Path: (null) (00000000-0000-0000-0000-000000000000) [No
> >>> data available]
> >>> 
> >>> 
> >>> 
> >>> bricks/mnt-brick1-gv0.log
> >>> 
> >>> 2017-07-26 15:31:36.287831] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153546: LOOKUP
> >>> <gfid:d99930df-6b47-4b55-9af3-c767afd6584c>/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> (d99930df-6b47-4b55-9af3-c767afd6584c/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> [2017-07-26 15:31:36.294202] E [MSGID: 113002] [posix.c:266:posix_lookup]
> >>> 0-gv0-posix: buf->ia_gfid is null for
> >>> /mnt/brick1/gv0/.glusterfs/e7/2d/e72d9005-b958-432b-b4a9-37aaadd9d2df/type=type1/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> [No data available]
> >>> [2017-07-26 15:31:36.294235] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153564: LOOKUP
> >>> <gfid:fe5c42ac-5fda-47d4-8221-484c8d826c06>/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> (fe5c42ac-5fda-47d4-8221-484c8d826c06/part-r-00001-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> [2017-07-26 15:31:36.300611] E [MSGID: 113002] [posix.c:266:posix_lookup]
> >>> 0-gv0-posix: buf->ia_gfid is null for
> >>> /mnt/brick1/gv0/.glusterfs/33/d4/33d47146-bc30-49dd-ada8-475bb75435bf/type=type2/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> [No data available]
> >>> [2017-07-26 15:31:36.300645] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153582: LOOKUP
> >>> <gfid:e31ae2ca-a3d2-4a27-a6ce-9aae24608141>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> (e31ae2ca-a3d2-4a27-a6ce-9aae24608141/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> [2017-07-26 15:31:36.305671] E [MSGID: 113002] [posix.c:266:posix_lookup]
> >>> 0-gv0-posix: buf->ia_gfid is null for
> >>> /mnt/brick1/gv0/.glusterfs/33/d4/33d47146-bc30-49dd-ada8-475bb75435bf/type=type1/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> [No data available]
> >>> [2017-07-26 15:31:36.305711] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153600: LOOKUP
> >>> <gfid:2cc9dafe-64d3-454a-a647-20deddfaebfe>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> (2cc9dafe-64d3-454a-a647-20deddfaebfe/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> [2017-07-26 15:31:36.310735] E [MSGID: 113002] [posix.c:266:posix_lookup]
> >>> 0-gv0-posix: buf->ia_gfid is null for
> >>> /mnt/brick1/gv0/.glusterfs/df/71/df715321-3078-47c8-bf23-dec47abe46d7/type=type2/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> [No data available]
> >>> [2017-07-26 15:31:36.310767] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153618: LOOKUP
> >>> <gfid:cbabf9ed-41be-4d08-9cdb-5734557ddbea>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> (cbabf9ed-41be-4d08-9cdb-5734557ddbea/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> [2017-07-26 15:31:36.317113] E [MSGID: 113002] [posix.c:266:posix_lookup]
> >>> 0-gv0-posix: buf->ia_gfid is null for
> >>> /mnt/brick1/gv0/.glusterfs/df/71/df715321-3078-47c8-bf23-dec47abe46d7/type=type3/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> [No data available]
> >>> [2017-07-26 15:31:36.317146] E [MSGID: 115050]
> >>> [server-rpc-fops.c:156:server_lookup_cbk] 0-gv0-server: 6153636: LOOKUP
> >>> <gfid:8a3c1c16-8edf-40f0-b2ea-8e70c39e1a69>/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet
> >>> (8a3c1c16-8edf-40f0-b2ea-8e70c39e1a69/part-r-00002-becc67f0-1665-47b6-8566-fa0245f560ad.gz.parquet)
> >>> ==> (No data available) [No data available]
> >>> 
> >>> 
> >>> Regards,
> >>> Christoph
> >>> _______________________________________________
> >>> Gluster-users mailing list
> >>> Gluster-users@xxxxxxxxxxx
> >>> http://lists.gluster.org/mailman/listinfo/gluster-users
> >>> 
> >> 
> 
> 
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users