My answers inline. > Am 01.09.2017 um 04:19 schrieb Ben Turner <bturner@xxxxxxxxxx>: > > I re-added gluster-users to get some more eye on this. > > ----- Original Message ----- >> From: "Christoph Schäbel" <christoph.schaebel@xxxxxxxxxxxx> >> To: "Ben Turner" <bturner@xxxxxxxxxx> >> Sent: Wednesday, August 30, 2017 8:18:31 AM >> Subject: Re: GFID attir is missing after adding large amounts of data >> >> Hello Ben, >> >> thank you for offering your help. >> >> Here are outputs from all the gluster commands I could think of. >> Note that we had to remove the terrabytes of data to keep the system >> operational, because it is a live system. >> >> # gluster volume status >> >> Status of volume: gv0 >> Gluster process TCP Port RDMA Port Online Pid >> ------------------------------------------------------------------------------ >> Brick 10.191.206.15:/mnt/brick1/gv0 49154 0 Y 2675 >> Brick 10.191.198.15:/mnt/brick1/gv0 49154 0 Y 2679 >> Self-heal Daemon on localhost N/A N/A Y >> 12309 >> Self-heal Daemon on 10.191.206.15 N/A N/A Y 2670 >> >> Task Status of Volume gv0 >> ------------------------------------------------------------------------------ >> There are no active volume tasks > > OK so your bricks are all online, you have two nodes with 1 brick per node. Yes > >> >> # gluster volume info >> >> Volume Name: gv0 >> Type: Replicate >> Volume ID: 5e47d0b8-b348-45bb-9a2a-800f301df95b >> Status: Started >> Snapshot Count: 0 >> Number of Bricks: 1 x 2 = 2 >> Transport-type: tcp >> Bricks: >> Brick1: 10.191.206.15:/mnt/brick1/gv0 >> Brick2: 10.191.198.15:/mnt/brick1/gv0 >> Options Reconfigured: >> transport.address-family: inet >> performance.readdir-ahead: on >> nfs.disable: on > > You are using a replicate volume with 2 copies of your data, it looks like you are using the defaults as I don't see any tuning. The only thing we tuned is the network.ping-timeout, we set this to 10 seconds (if this is not the default anyways) > >> >> # gluster peer status >> >> Number of Peers: 1 >> >> Hostname: 10.191.206.15 >> Uuid: 030a879d-da93-4a48-8c69-1c552d3399d2 >> State: Peer in Cluster (Connected) >> >> >> # gluster —version >> >> glusterfs 3.8.11 built on Apr 11 2017 09:50:39 >> Repository revision: git://git.gluster.com/glusterfs.git >> Copyright (c) 2006-2011 Gluster Inc. <http://www.gluster.com> >> GlusterFS comes with ABSOLUTELY NO WARRANTY. >> You may redistribute copies of GlusterFS under the terms of the GNU General >> Public License. > > You are running Gluster 3.8 which is the latest upstream release marked stable. > >> >> # df -h >> >> Filesystem Size Used Avail Use% Mounted on >> /dev/mapper/vg00-root 75G 5.7G 69G 8% / >> devtmpfs 1.9G 0 1.9G 0% /dev >> tmpfs 1.9G 0 1.9G 0% /dev/shm >> tmpfs 1.9G 17M 1.9G 1% /run >> tmpfs 1.9G 0 1.9G 0% /sys/fs/cgroup >> /dev/sda1 477M 151M 297M 34% /boot >> /dev/mapper/vg10-brick1 8.0T 700M 8.0T 1% /mnt/brick1 >> localhost:/gv0 8.0T 768M 8.0T 1% /mnt/glusterfs_client >> tmpfs 380M 0 380M 0% /run/user/0 >> > > Your brick is: > > /dev/mapper/vg10-brick1 8.0T 700M 8.0T 1% /mnt/brick1 > > The block device is 8TB. Can you tell me more about your brick? Is it a single disk or a RAID? If its a RAID can you tell me about the disks? I am interested in: > > -Size of disks > -RAID type > -Stripe size > -RAID controller Not sure about the disks, because it comes from a large storage system (not the cheap NAS kind, but the really expensive rack kind) which is then used by VMWare to present a single Volume to my virtual machine. I am pretty sure that on the storage system there is some kind of RAID going on, but I am not sure if that does have an effect on the "virtual“ disk that is presented to my VM. To the VM the disk does not look like a RAID, as far as I can tell. # lvdisplay --- Logical volume --- LV Path /dev/vg10/brick1 LV Name brick1 VG Name vg10 LV UUID OEvHEG-m5zc-2MQ1-3gNd-o2gh-q405-YWG02j LV Write Access read/write LV Creation host, time localhost, 2017-01-26 09:44:08 +0000 LV Status available # open 1 LV Size 8.00 TiB Current LE 2096890 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:1 --- Logical volume --- LV Path /dev/vg00/root LV Name root VG Name vg00 LV UUID 3uyF7l-Xhfa-6frx-qjsP-Iy0u-JdbQ-Me03AS LV Write Access read/write LV Creation host, time localhost, 2016-12-15 14:24:08 +0000 LV Status available # open 1 LV Size 74.49 GiB Current LE 19069 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 8192 Block device 253:0 # ssm list ----------------------------------------------------------- Device Free Used Total Pool Mount point ----------------------------------------------------------- /dev/fd0 4.00 KB /dev/sda 80.00 GB PARTITIONED /dev/sda1 500.00 MB /boot /dev/sda2 20.00 MB 74.49 GB 74.51 GB vg00 /dev/sda3 5.00 GB SWAP /dev/sdb 1.02 GB 8.00 TB 8.00 TB vg10 ----------------------------------------------------------- ------------------------------------------------- Pool Type Devices Free Used Total ------------------------------------------------- vg00 lvm 1 20.00 MB 74.49 GB 74.51 GB vg10 lvm 1 1.02 GB 8.00 TB 8.00 TB ------------------------------------------------- ------------------------------------------------------------------------------------ Volume Pool Volume size FS FS size Free Type Mount point ------------------------------------------------------------------------------------ /dev/vg00/root vg00 74.49 GB xfs 74.45 GB 69.36 GB linear / /dev/vg10/brick1 vg10 8.00 TB xfs 8.00 TB 8.00 TB linear /mnt/brick1 /dev/sda1 500.00 MB ext4 500.00 MB 300.92 MB part /boot ------------------------------------------------------------------------------------ > > I also see: > > localhost:/gv0 8.0T 768M 8.0T 1% /mnt/glusterfs_client > > So you are mounting your volume on the local node, is this the mount where you are writing data to? Yes, this is the mount I am writing to. > >> >> >> The setup of the servers is done via shell script on CentOS 7 containing the >> following commands: >> >> yum install -y centos-release-gluster >> yum install -y glusterfs-server >> >> mkdir /mnt/brick1 >> ssm create -s 999G -n brick1 --fstype xfs -p vg10 /dev/sdb /mnt/brick1 > > I haven't used system-storage-manager before, do you know if it takes care of properly tuning your storage stack(if you have a RAID that is)? If you don't have a RAID its prolly not that big of a deal, if you do have a RAID we should make sure everything is aware of your stripe size and tune appropriately. I am not sure if ssm does any tuning by default, but since there does not seem to be a RAID (at least for the VM) I don’t think tuning is necessary. > >> >> echo "/dev/mapper/vg10-brick1 /mnt/brick1 xfs defaults 1 2" >> >> /etc/fstab >> mount -a && mount >> mkdir /mnt/brick1/gv0 >> >> gluster peer probe OTHER_SERVER_IP >> >> gluster pool list >> gluster volume create gv0 replica 2 OWN_SERVER_IP:/mnt/brick1/gv0 >> OTHER_SERVER_IP:/mnt/brick1/gv0 >> gluster volume start gv0 >> gluster volume info gv0 >> gluster volume set gv0 network.ping-timeout "10" >> gluster volume info gv0 >> >> # mount as client for archiving cronjob, is already in fstab >> mount -a >> >> # mount via fuse-client >> mkdir -p /mnt/glusterfs_client >> echo "localhost:/gv0 /mnt/glusterfs_client glusterfs defaults,_netdev 0 0" >> >> /etc/fstab >> mount -a >> >> >> We untar multiple files (around 1300 tar files) each around 2,7GB in size. >> The tar files are not compressed. >> We untar the files with a shell script containing the following: >> >> #! /bin/bash >> for f in *.tar; do tar xfP $f; done > > Your script looks good, I am not that familiar with the tar flag "P" but it looks to mean: > > -P, --absolute-names > Don't strip leading slashes from file names when creating archives. > > I don't see anything strange here, everything looks OK. > >> >> The script is run as user root, the processes glusterd, glusterfs and >> glusterfsd also run under user root. >> >> Each tar file consists of a single folder with multiple folders and files in >> it. >> The folder tree looks like this (note that the "=“ is part of the folder >> name): >> >> 1498780800/ >> - timeframe_hour=1498780800/ (about 25 of these folders) >> -- type=1/ (about 25 folders total) >> --- data-x.gz.parquet (between 100MB and 1kb in size) >> --- data-x.gz.parquet.crc (around 1kb in size) >> -- … >> - ... >> >> Unfortunately I cannot share the file contents with you. > > Thats no problem, I'll try to recreate this in the lab. > >> >> We have not seen any other issues with glusterfs, when untaring just a few of >> those files. I just tried writing a 100GB with dd and did not see any issues >> there, the file is replicated and the GFID attribute is set correctly on >> both nodes. > > ACK. I do this all the time, if you saw an issue here I would be worried about your setup. > >> >> We are not able to reproduce this in our lab environment which is a clone >> (actual cloned VMs) of the other system, but it only has around 1TB of >> storage. >> Do you think this could be an issue with the number of files which is >> generated by tar (over 1.5 million files). ? >> What I can say is that it is not an issue with inodes, that I checked when >> all the files where unpacked on the live system. > > Hmm I am not sure. Its strange that you can't repro this on your other config, in the lab I have a ton of space to work with so I can run a ton of data in my repro. > >> >> If you need anything else, let me know. > > Can you help clarify your reproducer so I can give it a go in the lab? From what I can tell you have: > > 1498780800/ <-- Just a string of numbers, this is the root dir of your tarball > - timeframe_hour=1498780800/ (about 25 of these folders) <-- This is the second level dir of your tarball, there are ~25 of these dirs that mention a timeframe and an hour > -- type=1/ (about 25 folders total) <-- This is the 3rd level of your tar, there are about 25 different type=$X dirs > --- data-x.gz.parquet (between 100MB and 1kb in size) <-- This is your actual data. Is there just 1 pair of these file per dir or multiple? > --- data-x.gz.parquet.crc (around 1kb in size) <-- This is a checksum for the above file? > > I have almost everything I need for my reproducer, can you answer the above questions about the data? Yes this is all correct. There is just 1 pair in the last level, and the *.crc file is a checksum file. Thank you for your help, Christoph _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-users