Re: [Gluster-devel] Weird full heal on Distributed-Disperse volume with sharding

Xavi Hernandez <xhernandez@xxxxxxxxxx> · Wed, 30 Sep 2020 09:56:05 +0200

Hi Dmitry,

On Wed, Sep 30, 2020 at 9:21 AM Dmitry Antipov <dmantipov@xxxxxxxxx> wrote:
On 9/30/20 8:58 AM, Xavi Hernandez wrote:

> This is normal. A dispersed volume writes encoded fragments of each block in each brick. In this case it's a 2+1 configuration, so each block is divided into 2 fragments. A third fragment is generated 

> for redundancy and stored on the third brick.

OK. But for Distributed-Replicate 2 x 3 setup and 64K shards, 4M file should be split into (4096 / 64) * 3 = 192 shards, not 189. So why 189?

In fact, there aren't 189 shards. There are 63 shards replicated 3 times each. The shard 0 is not inside the .shard directory. It's placed in the directory where the file was created. So there are a total of 64 chunks of 64 KiB = 4 MiB.

And if all bricks are considered equal and has enough amount of free space, shards distribution {24, 24, 24, 39, 39, 39} looks suboptimal.

Shards are distributed exactly equal as regular files. This means that they are balanced based on a random distribution (with some correction when free space is not equal, but this is irrelevant now). Random distributions tend to balance very well the number of files, but only with a big number of files. Statistics on a small number of files may be biased.

If you keep adding new files to the volume, the balance will improve.

Why not {31, 32, 31, 32, 31, 32}? Isn't it a bug?

This can't happen. When you create a 2 x 3 replicated volume, you are creating 2 independent replica 3 subvolumes. The first replica set is composed of the first 3 bricks, and the second of the last 3. The distribution layer chooses on which replica set to put each file.

It's not a bug. It's by design. Gluster can work with multiple clients creating files simultaneously. To force a perfect distribution, all of them would have to synchronize to decide where to create each file. This would have a significant performance impact. Instead of that, distribution is done randomly, which allows each client to work independently and it will balance files pretty well in the long term.

> This is not right. A disperse 2+1 configuration only supports a single failure. Wiping 2 fragments from the same file makes the file unrecoverable. Disperse works using the Reed-Solomon erasure code, 

> which requires at least 2 healthy fragments to recover the data (in a 2+1 configuration).

It seems that I missed the point that all bricks are considered equal, regardless of the physical host they're attached to.

All bricks are considered equal inside a single replica/disperse set. A 2 x (2 + 1) configuration has 2 independent disperse sets, so only one brick from each of them may fail without data loss. If you want to support any 2 brick failures, you need to use a 1 x (4 + 2) configuration. In this case there's a single disperse set which tolerates up to 2 brick failures.

So, for the Distributed-Disperse 2 x (2 + 1) setup with 3 hosts, 2 bricks per each, and two files, A and B, it's possible to have

the following layout:

Host0:                  Host1:                  Host2:

|- Brick0: A0 B0        |- Brick0: A1           |- Brick0: A2

|- Brick1: B1           |- Brick1: B2           |- Brick1:

No, this won't happen. A single file will go either to brick0 of all hosts or brick1 of all hosts. They won't be mixed.

This setup can tolerate single brick failure but not single host failure because if Host0 is down, two fragments of B will be lost

and so B becomes unrecoverable (but A is not).

If this is so, is it possible/hard to enforce 'one fragment per *host*' behavior? If we can guarantee the following:

Host0:                  Host1:                  Host2:

|- Brick0: A0           |- Brick0: A1           |- Brick0: A2

|- Brick1: B1           |- Brick1: B2           |- Brick1: B0

This is how it currently works. You only need to take care of creating the volume with the bricks in the right order. In this case the order should be H0/B0, H1/B0, H2/B0, H0/B1, H1/B1, H1/B1. Anyway, if you create the volume using an incorrect order and two bricks of the same disperse set are placed in the same host, the operation will complain about it. This will only be accepted by gluster if you create the volume with the 'force' option.

Regards,

Xavi

this setup can tolerate both single brick and single host failures.

Dmitry

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users