Re: GFID attir is missing after adding large amounts of data

Christoph Schäbel <christoph.schaebel@xxxxxxxxxxxx> · Fri, 1 Sep 2017 10:20:25 +0200

My answers inline.

> Am 01.09.2017 um 04:19 schrieb Ben Turner <bturner@xxxxxxxxxx>:
> 
> I re-added gluster-users to get some more eye on this.
> 
> ----- Original Message -----
>> From: "Christoph Schäbel" <christoph.schaebel@xxxxxxxxxxxx>
>> To: "Ben Turner" <bturner@xxxxxxxxxx>
>> Sent: Wednesday, August 30, 2017 8:18:31 AM
>> Subject: Re:  GFID attir is missing after adding large amounts of	data
>> 
>> Hello Ben,
>> 
>> thank you for offering your help.
>> 
>> Here are outputs from all the gluster commands I could think of.
>> Note that we had to remove the terrabytes of data to keep the system
>> operational, because it is a live system.
>> 
>> # gluster volume status
>> 
>> Status of volume: gv0
>> Gluster process                             TCP Port  RDMA Port  Online  Pid
>> ------------------------------------------------------------------------------
>> Brick 10.191.206.15:/mnt/brick1/gv0         49154     0          Y       2675
>> Brick 10.191.198.15:/mnt/brick1/gv0         49154     0          Y       2679
>> Self-heal Daemon on localhost               N/A       N/A        Y
>> 12309
>> Self-heal Daemon on 10.191.206.15           N/A       N/A        Y       2670
>> 
>> Task Status of Volume gv0
>> ------------------------------------------------------------------------------
>> There are no active volume tasks
> 
> OK so your bricks are all online, you have two nodes with 1 brick per node.

Yes

> 
>> 
>> # gluster volume info
>> 
>> Volume Name: gv0
>> Type: Replicate
>> Volume ID: 5e47d0b8-b348-45bb-9a2a-800f301df95b
>> Status: Started
>> Snapshot Count: 0
>> Number of Bricks: 1 x 2 = 2
>> Transport-type: tcp
>> Bricks:
>> Brick1: 10.191.206.15:/mnt/brick1/gv0
>> Brick2: 10.191.198.15:/mnt/brick1/gv0
>> Options Reconfigured:
>> transport.address-family: inet
>> performance.readdir-ahead: on
>> nfs.disable: on
> 
> You are using a replicate volume with 2 copies of your data, it looks like you are using the defaults as I don't see any tuning.

The only thing we tuned is the network.ping-timeout, we set this to 10 seconds (if this is not the default anyways)

> 
>> 
>> # gluster peer status
>> 
>> Number of Peers: 1
>> 
>> Hostname: 10.191.206.15
>> Uuid: 030a879d-da93-4a48-8c69-1c552d3399d2
>> State: Peer in Cluster (Connected)
>> 
>> 
>> # gluster —version
>> 
>> glusterfs 3.8.11 built on Apr 11 2017 09:50:39
>> Repository revision: git://git.gluster.com/glusterfs.git
>> Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com>
>> GlusterFS comes with ABSOLUTELY NO WARRANTY.
>> You may redistribute copies of GlusterFS under the terms of the GNU General
>> Public License.
> 
> You are running Gluster 3.8 which is the latest upstream release marked stable.
> 
>> 
>> # df -h
>> 
>> Filesystem               Size  Used Avail Use% Mounted on
>> /dev/mapper/vg00-root     75G  5.7G   69G   8% /
>> devtmpfs                 1.9G     0  1.9G   0% /dev
>> tmpfs                    1.9G     0  1.9G   0% /dev/shm
>> tmpfs                    1.9G   17M  1.9G   1% /run
>> tmpfs                    1.9G     0  1.9G   0% /sys/fs/cgroup
>> /dev/sda1                477M  151M  297M  34% /boot
>> /dev/mapper/vg10-brick1  8.0T  700M  8.0T   1% /mnt/brick1
>> localhost:/gv0           8.0T  768M  8.0T   1% /mnt/glusterfs_client
>> tmpfs                    380M     0  380M   0% /run/user/0
>> 
> 
> Your brick is:
> 
> /dev/mapper/vg10-brick1  8.0T  700M  8.0T   1% /mnt/brick1
> 
> The block device is 8TB.  Can you tell me more about your brick?  Is it a single disk or a RAID?  If its a RAID can you tell me about the disks?  I am interested in:
> 
> -Size of disks
> -RAID type
> -Stripe size
> -RAID controller

Not sure about the disks, because it comes from a large storage system (not the cheap NAS kind, but the really expensive rack kind) which is then used by VMWare to present a single Volume to my virtual machine. I am pretty sure that on the storage system there is some kind of RAID going on, but I am not sure if that does have an effect on the "virtual“ disk that is presented to my VM. To the VM the disk does not look like a RAID, as far as I can tell.

# lvdisplay 
  --- Logical volume --- 
  LV Path                /dev/vg10/brick1 
  LV Name                brick1 
  VG Name                vg10 
  LV UUID                OEvHEG-m5zc-2MQ1-3gNd-o2gh-q405-YWG02j 
  LV Write Access        read/write 
  LV Creation host, time localhost, 2017-01-26 09:44:08 +0000 
  LV Status              available 
  # open                 1 
  LV Size                8.00 TiB 
  Current LE             2096890 
  Segments               1 
  Allocation             inherit 
  Read ahead sectors     auto 
  - currently set to     8192 
  Block device           253:1 

  --- Logical volume --- 
  LV Path                /dev/vg00/root 
  LV Name                root 
  VG Name                vg00 
  LV UUID                3uyF7l-Xhfa-6frx-qjsP-Iy0u-JdbQ-Me03AS 
  LV Write Access        read/write 
  LV Creation host, time localhost, 2016-12-15 14:24:08 +0000 
  LV Status              available 
  # open                 1 
  LV Size                74.49 GiB 
  Current LE             19069 
  Segments               1 
  Allocation             inherit 
  Read ahead sectors     auto 
  - currently set to     8192 
  Block device           253:0 

# ssm list 
----------------------------------------------------------- 
Device         Free      Used      Total  Pool  Mount point 
----------------------------------------------------------- 
/dev/fd0                         4.00 KB 
/dev/sda                        80.00 GB        PARTITIONED 
/dev/sda1                      500.00 MB        /boot 
/dev/sda2  20.00 MB  74.49 GB   74.51 GB  vg00 
/dev/sda3                        5.00 GB        SWAP 
/dev/sdb    1.02 GB   8.00 TB    8.00 TB  vg10 
----------------------------------------------------------- 
------------------------------------------------- 
Pool  Type  Devices      Free      Used     Total 
------------------------------------------------- 
vg00  lvm   1        20.00 MB  74.49 GB  74.51 GB 
vg10  lvm   1         1.02 GB   8.00 TB   8.00 TB 
------------------------------------------------- 
------------------------------------------------------------------------------------ 
Volume            Pool  Volume size  FS      FS size       Free  Type    Mount point 
------------------------------------------------------------------------------------ 
/dev/vg00/root    vg00     74.49 GB  xfs    74.45 GB   69.36 GB  linear  / 
/dev/vg10/brick1  vg10      8.00 TB  xfs     8.00 TB    8.00 TB  linear  /mnt/brick1 
/dev/sda1                 500.00 MB  ext4  500.00 MB  300.92 MB  part    /boot 
------------------------------------------------------------------------------------ 

> 
> I also see:
> 
> localhost:/gv0           8.0T  768M  8.0T   1% /mnt/glusterfs_client
> 
> So you are mounting your volume on the local node, is this the mount where you are writing data to?

Yes, this is the mount I am writing to.

> 
>> 
>> 
>> The setup of the servers is done via shell script on CentOS 7 containing the
>> following commands:
>> 
>> yum install -y centos-release-gluster
>> yum install -y glusterfs-server
>> 
>> mkdir /mnt/brick1
>> ssm create -s 999G -n brick1 --fstype xfs -p vg10 /dev/sdb /mnt/brick1
> 
> I haven't used system-storage-manager before, do you know if it takes care of properly tuning your storage stack(if you have a RAID that is)?  If you don't have a RAID its prolly not that big of a deal, if you do have a RAID we should make sure everything is aware of your stripe size and tune appropriately.

I am not sure if ssm does any tuning by default, but since there does not seem to be a RAID (at least for the VM) I don’t think tuning is necessary.

> 
>> 
>> echo "/dev/mapper/vg10-brick1   /mnt/brick1 xfs defaults    1   2" >>
>> /etc/fstab
>> mount -a && mount
>> mkdir /mnt/brick1/gv0
>> 
>> gluster peer probe OTHER_SERVER_IP
>> 
>> gluster pool list
>> gluster volume create gv0 replica 2 OWN_SERVER_IP:/mnt/brick1/gv0
>> OTHER_SERVER_IP:/mnt/brick1/gv0
>> gluster volume start gv0
>> gluster volume info gv0
>> gluster volume set gv0 network.ping-timeout "10"
>> gluster volume info gv0
>> 
>> # mount as client for archiving cronjob, is already in fstab
>> mount -a
>> 
>> # mount via fuse-client
>> mkdir -p /mnt/glusterfs_client
>> echo "localhost:/gv0	/mnt/glusterfs_client	glusterfs	defaults,_netdev	0	0" >>
>> /etc/fstab
>> mount -a
>> 
>> 
>> We untar multiple files (around 1300 tar files) each around 2,7GB in size.
>> The tar files are not compressed.
>> We untar the files with a shell script containing the following:
>> 
>> #! /bin/bash
>> for f in *.tar; do tar xfP $f; done
> 
> Your script looks good, I am not that familiar with the tar flag "P" but it looks to mean:
> 
>       -P, --absolute-names
>              Don't strip leading slashes from file names when creating archives.
> 
> I don't see anything strange here, everything looks OK.
> 
>> 
>> The script is run as user root, the processes glusterd, glusterfs and
>> glusterfsd also run under user root.
>> 
>> Each tar file consists of a single folder with multiple folders and files in
>> it.
>> The folder tree looks like this (note that the "=“ is part of the folder
>> name):
>> 
>> 1498780800/
>> - timeframe_hour=1498780800/ (about 25 of these folders)
>> -- type=1/ (about 25 folders total)
>> --- data-x.gz.parquet (between 100MB and 1kb in size)
>> --- data-x.gz.parquet.crc (around 1kb in size)
>> -- …
>> - ...
>> 
>> Unfortunately I cannot share the file contents with you.
> 
> Thats no problem, I'll try to recreate this in the lab.
> 
>> 
>> We have not seen any other issues with glusterfs, when untaring just a few of
>> those files. I just tried writing a 100GB with dd and did not see any issues
>> there, the file is replicated and the GFID attribute is set correctly on
>> both nodes.
> 
> ACK.  I do this all the time, if you saw an issue here I would be worried about your setup.
> 
>> 
>> We are not able to reproduce this in our lab environment which is a clone
>> (actual cloned VMs) of the other system, but it only has around 1TB of
>> storage.
>> Do you think this could be an issue with the number of files which is
>> generated by tar (over 1.5 million files). ?
>> What I can say is that it is not an issue with inodes, that I checked when
>> all the files where unpacked on the live system.
> 
> Hmm I am not sure.  Its strange that you can't repro this on your other config, in the lab I have a ton of space to work with so I can run a ton of data in my repro.
> 
>> 
>> If you need anything else, let me know.
> 
> Can you help clarify your reproducer so I can give it a go in the lab?  From what I can tell you have:
> 
> 1498780800/    <-- Just a string of numbers, this is the root dir of your tarball
> - timeframe_hour=1498780800/ (about 25 of these folders)    <-- This is the second level dir of your tarball, there are ~25 of these dirs that mention a timeframe and an hour
> -- type=1/ (about 25 folders total)    <-- This is the 3rd level of your tar, there are about 25 different type=$X dirs
> --- data-x.gz.parquet (between 100MB and 1kb in size)    <-- This is your actual data.  Is there just 1 pair of these file per dir or multiple?
> --- data-x.gz.parquet.crc (around 1kb in size)    <-- This is a checksum for the above file?
> 
> I have almost everything I need for my reproducer, can you answer the above questions about the data?

Yes this is all correct. There is just 1 pair in the last level, and the *.crc file is a checksum file.

Thank you for your help,
Christoph

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-users