Hi,
On 01/08/2015 09:51 AM, RASTELLI Alessandro wrote:
Hi Xavi,
now there are some files on nodes1-2-3 and others on nodes4-5, so I think I'm going to destroy and re-create the volume from scratch (I can afford it now).
If data is not needed, this is the best way to remove all problems.
However, if you continue testing and arrive at this situation again, I
would be very interested to know what operations have you made and as
many details on your workload as you can. Maybe there's a bug causing
this problem.
In your opinion, having 5 nodes with 10x 4TB disks each, what's the best way to dimension the bricks?
Now we configured disperse FS, 2 bricks per node per volume (2x 4TB RAID0 each), if I'm not wrong we can afford losing 2 bricks (= an entire node)
Would it be better using distributed FS, and having 1 brick per node (10x 4TB RAID5 each)?
Or you have other suggestions?
The best configuration depends on your specific hardware characteristics
and your needs or preferences.
The main factor is the MTBF/AFR of the disks (Mean Time Between Failures
/ Annualized Failure Rate).
Relationship between MTBF and AFR is defined by (assuming disks are
working uninterrupted all the year):
AFR = 1 - exp(-8760 / MTBF)
AFR is the probability that a single disk fails in one year.
In your environment, each server has 10 disks. If we assume an AFR of
3%, we can calculate some failure probabilities in different
configurations (all probabilities are in one year):
Failure probability of a single disk: 3.00%
Failure probability of a RAID-0 with 5 disks: 14.13%
Failure probability of a RAID-0 with 10 disks: 26.26%
Failure probability of a RAID-5 with 5 disks: 0.85%
Failure probability of a RAID-5 with 10 disks: 3.45% *
Failure probability of a RAID-6 with 5 disks: 0.03%
Failure probability of a RAID-6 with 10 disks: 0.28%
Failure probability of a RAID-50 with 10 disks: 1.69%
* Note that a RAID-1 of 10 disks has more probability of failure than a
single disk in this case.
Once you have calculated the failure probability for your hardware
configuration, this probability can be considered as the AFR of a single
disk used as a brick for gluster.
Then you can calculate the failure probability of the gluster volume
(assuming you have an AFR of 3.45% using a RAID-5 of 10 disks):
Failure probability of a Disperse 3:1: 0.35%
Failure probability of a Disperse 5:1: 1.11%
Failure probability of a Disperse 6:2: 0.08%
Failure probability of a Disperse 10:2: 0.41%
Gluster has the possibility of using Distribute. Distribute is similar
to a RAID-0 (it combines multiple subvolumes into a single one), but if
one subvolume fails, only data stored in that subvolume is lost (in a
RAID-0, if a single disk fails, the entire RAID is lost).
This doesn't reduce the probability of failure, but it reduces the
impact of that failure (it's much harder to lose all data):
Failure prob of a Distributed-Dispersed 2x3:1: 0.6956% (1 subvol)
0.0012% (2 subvol)
Failure prob of a Distributed-Dispersed 2x5:1: 0.0428135% (1 subvol)
0.0000046% (2 subvol)
Of course all these numbers are only statistical. A batch of defective
drives or servers can ruin any configuration.
You should also consider the time needed to rebuild a brick if a RAID
fails. If you create RAID-5 of 10 disks, for example, gluster will need
to recover up to 36 TB of information (if brick was full). Using smaller
RAIDs reduces this amount of data.
If you use a single RAID to store multiple bricks, you will get multiple
brick failures in case of a RAID or server failure. In any case it's not
recommended to have more than one brick of the same subvolume in the
same server. It's better to use distribute in this case (a 10:2
configuration where a single server failure causes 2 bricks to fail, is
almost equivalent to a 5:1 configuration with respect to probabilities,
specially if disks are configured in a RAID-0).
I wouldn't recommend to use RAID-0 with gluster. Instead of creating a
RAID-0 of 2 disks, it's better to create 2 bricks belonging to two
different gluster subvolumes and use distribute.
Failure probability of one brick using RAID-0 of two disks: 5.91%
Failure probability of two bricks using two disks: 5.82% (1 subvol)
0.09% (2 subvol)
RAID-5 or RAID-6 can be useful for single disk failure because the disk
can be recovered locally in the server without having to read data from
other servers. Only a more critical failure will require that gluster
rebuilds brick contents. However bigger RAIDs have greater failure
probabilities (though they waste less physical disk space).
You must also consider the cost of growing a volume. Disperse and
Replicate need to grow in multiples of the subvolume size. This means
that if you create a 3:1 configuration you will need to add 3 new bricks
if you want to get more space. If you start with a 10:2 configuration
you will need to add 10 new bricks to get more space.
In your case I would recommend using two RAID-5 of 5 disks each, or a
single RAID-6 of 10 disks, in each server. You can also opt to not use
any RAID and have 10 independent disks in each server. I would also
create relatively small bricks (for example 4TB each) and use a
distributed-dispersed 5:1, with one brick of each subvolume in each server.
With this configuration, if you lose one RAID or an entire server, you
will only lose, at most, one brick of each subvolume.
Probability of failure using RAID-6: 0.0076% (1 subvol)
Probability of failure using RAID-5 (5 disks): 0.07% (1 subvol)
Probability of failure without RAID: 0.85% (1 subvol)
Probability of failure using RAID-5 (10 disks): 1.11% (1 subvol)
Of course it's better using RAID, but you also waste more space:
Available space using RAID-6: 128 TB
Available space using RAID-5 (5 disks): 128 TB
Available space using RAID-5 (10 disks): 144 TB
Available space without RAID: 160 TB
Using RAID you will recover integrity faster when only one or two disks
fails. But it will take more time when gluster has to recover more than
one brick (all bricks contained in the failed RAID).
You can also use disperse with redundancy 2. In your case it should be a
5:2. This configuration is not considered optimal, but it's possible
that with your workload it performs quite well (you should test it).
With this configuration I wouldn't recommend any RAID, or RAID-5 with 5
disks at most.
Probability of failure using RAID-6: 0.00002% (1 subvol)
Probability of failure using RAID-5 (5 disks): 0.00060% (1 subvol)
Probability of failure without RAID: 0.026% (1 subvol)
Probability of failure using RAID-5 (10 disks): 0.039% (1 subvol)
Available space using RAID-6: 96 TB
Available space using RAID-5 (5 disks): 96 TB
Available space using RAID-5 (10 disks): 108 TB
Available space without RAID: 120 TB
Hope this helps a little to decide the best configuration for you.
Xavi
Thanks
A.
-----Original Message-----
From: Xavier Hernandez [mailto:xhernandez@xxxxxxxxxx]
Sent: mercoledì 7 gennaio 2015 18:14
To: RASTELLI Alessandro
Cc: gluster-users@xxxxxxxxxxx; CAZZANIGA Stefano; UBERTINI Gabriele; TECHNOLOGY - Supporto Sistemi OTT e Cloud; ORLANDO Luca
Subject: Re: Input/Output Error when deleting folder
If that file is missing only from gluster03-mi, and it has the same attributes in all remaining bricks, self-heal should recover it automatically.
Are there differences in the extended attributes of the file on bricks that have it ?
On 01/07/2015 05:22 PM, RASTELLI Alessandro wrote:
It worked... partially :)
now I can access the folders again, but I can't delete them because
that there are a couple of files into them (which I don't need) The files exist only on node1,2,4,5 , but not on node3:
[root@gluster02-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218/Rec_218_1_part_14656.ts
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218/Rec_218_1_part_14656.ts
trusted.ec.config=0x0000080a02000200
trusted.ec.size=0x0000000034400000
trusted.ec.version=0x0000000000001a20
trusted.gfid=0x8d5da5a1cd1949618a5b96657857ceb6
[root@gluster03-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218/Rec_218_1_part_14656.ts
getfattr: /brick1/recorder/Rec218/Rec_218_1_part_14656.ts: No such
file or directory
How do I proceed?
Thanks
-----Original Message-----
From: Xavier Hernandez [mailto:xhernandez@xxxxxxxxxx]
Sent: mercoledì 7 gennaio 2015 16:45
To: RASTELLI Alessandro
Cc: gluster-users@xxxxxxxxxxx; CAZZANIGA Stefano; UBERTINI Gabriele;
TECHNOLOGY - Supporto Sistemi OTT e Cloud; ORLANDO Luca
Subject: Re: Input/Output Error when deleting folder
Sorry, the command should be:
setfattr -n trusted.ec.version -v 0x0000000000000001 <brick
path>/Rec218
On 01/07/2015 04:34 PM, RASTELLI Alessandro wrote:
See my answers below:
1.
[root@gluster03-mi ~]# ls -l
/brick1/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
ls: cannot access
/brick1/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
: No such file or directory [root@gluster03-mi ~]# ls -l
/brick1/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d lrwxrwxrwx 1 root root 55 Dec 17 17:37
/brick1/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d
-> ../../00/00/00000000-0000-0000-0000-000000000001/Rec218
[root@gluster03-mi ~]# ls -l
/brick2/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
ls: cannot access
/brick2/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
: No such file or directory [root@gluster03-mi ~]# ls -l
/brick2/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d lrwxrwxrwx 1 root root 55 Dec 17 17:37
/brick2/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d
-> ../../00/00/00000000-0000-0000-0000-000000000001/Rec218
2.
/Rec218 is supposed to be empty (or, I don't need to restore the
files) I stopped the volume, but when executing the command I get an error:
[root@gluster01-mi ~]# setfattr -n trusted.ec.version -v 0x1
/brick1/recorder/Rec218 bad input encoding
Regards
A.
-----Original Message-----
From: Xavier Hernandez [mailto:xhernandez@xxxxxxxxxx]
Sent: mercoledì 7 gennaio 2015 16:08
To: RASTELLI Alessandro
Cc: gluster-users@xxxxxxxxxxx; CAZZANIGA Stefano; UBERTINI Gabriele;
TECHNOLOGY - Supporto Sistemi OTT e Cloud; ORLANDO Luca
Subject: Re: Input/Output Error when deleting folder
I see two problems here:
1. There has happened something very strange on gluster03-mi. It
contains the directory, but it's not the same one that there's on the
other bricks (8 bricks have gfid
a9d904af-0d9e-4018-acb2-881bd8b3c2e4,
while that node has gfid bda849fc-a556-469e-ad84-ed074f2c1bcd)
Whatever that has happened here has affected both bricks of that node in the same way.
What return these commands on gluster03-mi:
ls -l
/brick1/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
ls -l
/brick1/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d
ls -l
/brick2/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
ls -l
/brick2/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d
2. It seems that node gluster04-mi has been stopped (or rebooted or
has
failed) while an operation that modifies the directory contents was being executed, so it has lost an update an it's out of sync (both bricks on the same server have missed one update, so it seems clear that it's not a brick problem but a server problem).
The global result of all this is that you have 4 failed bricks on a configuration that only supports 2 failed bricks.
BTW, having two or more bricks on the same server is not recommended because a single server failure causes multiple bricks to be lost. In this case a directory can be recovered, but if this happens to a file, it won't be 100% recoverable.
Are there any files inside /Rec218 ?
If you are going to delete the directory and all its contents and
brick contents in gluster03-mi are the same than in other servers,
the following commands should be safe (otherwise let me know before
doing
anything):
Before starting you must be sure that nothing is creating or deleting entries inside /Rec218. It would be even better if this could be done with volume stopped.
On each brick (including gluster03-mi):
setfattr -n trusted.ec.version -v 0x1 <brick path>/Rec218
On bricks in gluster03-mi:
setfattr -n trusted.gfid -v 0xa9d904af0d9e4018acb2881bd8b3c2e4
<brick path>/Rec218
setfattr -n trusted.glusterfs.dht -v
0x000000010000000000000000ffffffff <brick path>/Rec218
On client:
check that the directory is accessible and its contents seem ok. If so:
rm -rf <mount point>/Rec218
If you have a way to reproduce this situation, let me know.
Xavi
On 01/07/2015 03:31 PM, RASTELLI Alessandro wrote:
[root@gluster01-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@gluster01-mi ~]# getfattr -m. -e hex -d
/brick2/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick2/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@gluster02-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@gluster02-mi ~]# getfattr -m. -e hex -d
/brick2/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick2/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@gluster03-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218
trusted.gfid=0xbda849fca556469ead84ed074f2c1bcd
[root@gluster03-mi ~]# getfattr -m. -e hex -d
/brick2/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick2/recorder/Rec218
trusted.gfid=0xbda849fca556469ead84ed074f2c1bcd
[root@gluster04-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218
trusted.ec.version=0x0000000000006939
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@gluster04-mi ~]# getfattr -m. -e hex -d
/brick2/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick2/recorder/Rec218
trusted.ec.version=0x0000000000006939
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@gluster05-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
[root@gluster05-mi ~]# getfattr -m. -e hex -d
/brick2/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick2/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users