Re: cascading errors (GlusterFS v3.7.x) - what's new about my issues?

Niels de Vos <ndevos@xxxxxxxxxx> · Thu, 20 Aug 2015 13:33:23 +0200

On Sun, Aug 16, 2015 at 01:35:24AM +0200, Geoffrey Letessier wrote:
> Hi,
> 
> Since i've upgraded GlusterFS from 3.5.3 to 3.7.x, trying to solve my quota miscalculation and poor performances (as advised by the user support team), we are still out-of-production for roughly 7 weeks because of a lot of v3.7.x issues we meet:
> 
> 	- T-files apparition. I notice a lot of T files (with permissions --- --- --- T) located in my brick paths. Vijay has explained me T-files appear when a re-name is performed or when an add/remove brick is performed; but the problem is, since I've completely re-created (with RAID initialization, etc.) and import my data into the new volume, i've renamed nothing and never add nor delete any brick. 
> So, why these T-files are present in my new volume??? For example, for my /derreumaux_team directory,  I have 13891 real files and 704 T-files totalized in the brick paths…
> How to clean it, avoiding side effets?

Apart from the reasons that you mention above, there are a few others. A
T-file (or linkfile) is a marker to tell Gluster clients that the actual
file is on a different brick/subvolume than the elastic-hashing expects.
I'm no expert of the hashing, but I think at least these two things
cause the linkfiles too:

- hardlinks: the filename is used for the hashing, a new hardlink will
  have a different name/path than the original file, but the contents is
  the same. The new hardlink will 'hash' to a different brick, and a
  linkfile points back to the original file.

- During creation of a new file, the brick/subvolume that should have
  the new file is offline. Creation will succeed on a different
  brick/subvolume.

Depending on the reason for the creation of the linkfiles, there is no
need to remove them. The linkfiles will get created again when the
elastic hash points to a brick where the file should be located, but the
file is actually located somewhere else.

> The first time I noticed this kind of files, it was after having set a quota under the real path size which has resulted in some quota explosions (quota daemon failure) and T-files apparitions...
> 
> 	- 7006 files in split-brain status after having back transfert data (30TB, 6.2M files) from a backup server into my just created volume. Thanks to Mathieu Chateau who help me putting me on road (GFID vs real file path), this problem has been manually fixed.
> 
> 	- log issue. After having created only one file (35GB), I can notice more than 186000 new lines in brick log files. I can stop them setting brick-log-level to CRITICAL but I guess this issue gravely impact the IO performances and throughput. Vijay told me having fixed this problem in the code but I apparently need to wait the new release to take advantage of… Very nice for the production!
> 
> Actually, if I dont set brick-log-level to CRITICAL, i can fill my /var partition (10GB) in less than 1 day making some tests/benchs in the volume… 

You are highly recommended to file a bug for the exhaustive logging.
Some log messages might be more useful for developers than for sysadmins
and should not be logged by default. We have changed several of these
before, and will continue to fix them whenever users report those.

  https://bugzilla.redhat.com/enter_bug.cgi?product=GlusterFS&component=logging

It is also possible that there still is an issue in your environment
that needs to be addressed. When you file a bug for these log messages,
please include the steps/procedure you followed to restore the data back
in the new volume. It could be that the process is missing a step or
something like that.

> 	- volume healing issue: slightly less than 14000 files was in a bad situation (# gluster volume heal <vol_home> info) and a new forced heal in my volume make no change. Thanks to Krutika and Pranith, this is problem is currently fixed.
> 
> 	- du/df/stat/etc. hangs cause of RDMA protocol. This problem seems to not occur anymore since I’ve upgraded my GlusterFS v3.7.2 to v3.7.3. This was probably due to the brick crashes (after a few minutes or a few days after having [re]start the volume) with RDMA transport-type we had. I noticed it only with v3.7.2 version.

Good to hear that at least these two issues have been fixed now.

> 	- quota problem: after having forced (with success) the quota re-calculation (with a simple DU for each defined quotas), after a couple of days with good values, the quota daemon failed again (some quota explosions, etc.)

Do you have a bug report for this issue? I am sure the developers
working on quota should be looking into this unexplained behaviour.

> 	- a lot of warnings in TAR operations on replicated volumes: 
> tar: linux-4.1-rc6/sound/soc/codecs/wm8962.c : fichier modifié pendant sa lecture

This is a commonly reported problem. You can find more details about
that in this email from Vijay:

  http://www.gluster.org/pipermail/gluster-devel/2014-December/043356.html

Using "tar -P ..." should prevent those messages.

> 	- low I/O performances and throughput:
> 
> 		1- if I enable to quota feature, my IO throughput is divided by 2. So, for the moment, i disabled this feature… (only since I’ve upgraded GlusterFS into 3.7.x version)
> 		2- since I’ve upgraded GlusterFS from 3.5.3 to 3.7.3, my I/O performance and throughput is lower than before, as you can read below. (keeping in mind i’ve disable quota feature)
> 
> IO operation tests with a Linux kernel archive (80MB tar ball file, ~53000 files, 550MB uncompressed):
> ------------------------------------------------------------------------
> |                          PRODUCTION HARDWARE                         |
> ------------------------------------------------------------------------
> |             |  UNTAR  |   DU   |  FIND   |  GREP  |   TAR   |   RM   |
> ------------------------------------------------------------------------
> | native FS   |    ~16s |   ~0.1s |  ~0.1s |  ~0.1s |    ~24s |    ~3s |
> ------------------------------------------------------------------------
> |                        GlusterFS version 3.5.3	       	       |
> ------------------------------------------------------------------------
> | distributed |  ~2m57s |   ~23s |    ~22s |   ~49s |    ~50s |   ~54s |
> ------------------------------------------------------------------------
> | dist-repl   | ~29m56s |  ~1m5s |  ~1m04s | ~1m32s |  ~1m31s | ~2m40s |
> ------------------------------------------------------------------------
> |                        GlusterFS version 3.7.3	       	       |
> ------------------------------------------------------------------------
> | distributed |  ~2m49s |   ~20s |    ~29s |   ~58s |    ~60s |   ~41s |
> ------------------------------------------------------------------------
> | dist-repl   | ~28m24s |   ~51s |    ~37s | ~1m16s |  ~1m14s | ~1m17s |
> ------------------------------------------------------------------------
> *:
> 	- distributed: 4 bricks (2 bricks on 2 servers)
> 	- dist-repl: 4 bricks (2 bricks on 2 servers) for each replicas, 2 replicas.
> 	- native FS: each brick path (XFS)

Most of these results are indeed a little worse than before. Some of
the distribute-only actually have quite some performance degration.

You mentioned that the impact of quotad is huge. The test that compare
with quota enabled/disabled are not in the table though.

> 
> And the craziest thing is  I did the same test on a crashtest storage cluster (2 old Dell servers, all brick are single 2TB hard drive 7.2k, 2 bricks for each server) and the performance exceeds the production hardware performance (4 recent servers, 2 bricks each, each brick is 24TB RAID6 with good LSI RAID controllers (1 controller for 1 brick):
> ------------------------------------------------------------------------
> |                           CRASHTEST HARDWARE                         |
> ------------------------------------------------------------------------
> |             |  UNTAR  |   DU   |  FIND   |   GREP |   TAR   |   RM   |
> ------------------------------------------------------------------------
> | native FS   |    ~19s |   ~0.2s |  ~0.1s |  ~1.2s |    ~29s |    ~2s |
> ------------------------------------------------------------------------
> ------------------------------------------------------------------------
> | single      |  ~3m45s |   ~43s |    ~47s |        |  ~3m10s | ~3m15s |
> ------------------------------------------------------------------------
> | single v2*  |  ~3m24s |   ~13s |    ~33s | ~1m10s |    ~46s |   ~48s |
> ------------------------------------------------------------------------
> | single NFS  | ~23m51s |    ~3s |     ~1s |   ~27s |    ~36s |   ~13s |
> ------------------------------------------------------------------------
> | replicated  |  ~5m10s |   ~59s |   ~1m6s |        |  ~1m19s | ~1m49s |
> ------------------------------------------------------------------------
> | distributed |  ~4m18s |   ~41s |    ~57s |        |  ~2m24s | ~1m38s |
> ------------------------------------------------------------------------
> | dist-repl   |   ~7m1s |   ~19s |    ~31s | ~1m34s |  ~1m26s | ~2m11s |
> ------------------------------------------------------------------------
> | FhGFS(dist) |  ~3m33s |   ~15s |     ~2s | ~1m31s |  ~1m31s |   ~52s |
> ------------------------------------------------------------------------
> *: with default parameters
> 
> 
> Concerning the throughput (for both writes and reads operations), in the production hardware, it was around 600MBs (dist-repl volume) and 1.1GBs (distributed volume) with GlusterFS version 3.5.3 with TCP network transport-type (RDMA never worked in my storage cluster before 3.7.x version of GlusterFS).
> Now, it is around 500-600MBs with RDMA and 150-300MBs with TCP (for dist-repl volume), and around 600-700MBs with RDMA and 500-600 with TCP for distributed volume.

I'm not aware of any changes that could cause such a huge impact. Ben
Turner did run a lot of performance tests with different versions. Maybe
he has an idea what could be wrong.

> Could you help to back into production our HPC center, solving above-mentioned issues? Or do you advise me to downgrade into v3.5.3 (the more stable version I’d known since I use GlusterFS in production)? or move on ?;-)

I can not say how easy it is to downgrade to 3.5. The changes in quota
for later versions would need to be undone, I think.

From my understanding, you current have two pending problems:

1. log messages
   needs a bug report and more investigation

2. major performance regression
   not sure if this is a known issue, or has been reported

Kind regards,
Niels

> 
> Thanks in advance.
> Geoffrey
> ------------------------------------------------------
> Geoffrey Letessier
> Responsable informatique & ingénieur système
> UPR 9080 - CNRS - Laboratoire de Biochimie Théorique
> Institut de Biologie Physico-Chimique
> 13, rue Pierre et Marie Curie - 75005 Paris
> Tel: 01 58 41 50 93 - eMail: geoffrey.letessier@xxxxxxx
> 

> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users

Attachment:
pgp9lxcbGOM3x.pgp

Description: PGP signature
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users