Re: Input/Output Error when deleting folder

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

On 01/08/2015 09:51 AM, RASTELLI Alessandro wrote:
Hi Xavi,
now there are some files on nodes1-2-3 and others on nodes4-5, so I think I'm going to destroy and re-create the volume from scratch (I can afford it now).

If data is not needed, this is the best way to remove all problems. However, if you continue testing and arrive at this situation again, I would be very interested to know what operations have you made and as many details on your workload as you can. Maybe there's a bug causing this problem.


In your opinion, having 5 nodes with 10x 4TB disks each, what's the best way to dimension the bricks?
Now we configured disperse FS, 2 bricks per node per volume (2x 4TB RAID0 each), if I'm not wrong we can afford losing 2 bricks (= an entire node)
Would it be better using distributed FS, and having 1 brick per node (10x 4TB RAID5 each)?
Or you have other suggestions?

The best configuration depends on your specific hardware characteristics and your needs or preferences.

The main factor is the MTBF/AFR of the disks (Mean Time Between Failures / Annualized Failure Rate).

Relationship between MTBF and AFR is defined by (assuming disks are working uninterrupted all the year):

    AFR = 1 - exp(-8760 / MTBF)

AFR is the probability that a single disk fails in one year.

In your environment, each server has 10 disks. If we assume an AFR of 3%, we can calculate some failure probabilities in different configurations (all probabilities are in one year):

Failure probability of a single disk:            3.00%
Failure probability of a RAID-0  with  5 disks: 14.13%
Failure probability of a RAID-0  with 10 disks: 26.26%
Failure probability of a RAID-5  with  5 disks:  0.85%
Failure probability of a RAID-5  with 10 disks:  3.45% *
Failure probability of a RAID-6  with  5 disks:  0.03%
Failure probability of a RAID-6  with 10 disks:  0.28%
Failure probability of a RAID-50 with 10 disks:  1.69%

* Note that a RAID-1 of 10 disks has more probability of failure than a single disk in this case.

Once you have calculated the failure probability for your hardware configuration, this probability can be considered as the AFR of a single disk used as a brick for gluster.

Then you can calculate the failure probability of the gluster volume (assuming you have an AFR of 3.45% using a RAID-5 of 10 disks):

Failure probability of a Disperse  3:1:  0.35%
Failure probability of a Disperse  5:1:  1.11%
Failure probability of a Disperse  6:2:  0.08%
Failure probability of a Disperse 10:2:  0.41%

Gluster has the possibility of using Distribute. Distribute is similar to a RAID-0 (it combines multiple subvolumes into a single one), but if one subvolume fails, only data stored in that subvolume is lost (in a RAID-0, if a single disk fails, the entire RAID is lost).

This doesn't reduce the probability of failure, but it reduces the impact of that failure (it's much harder to lose all data):

Failure prob of a Distributed-Dispersed 2x3:1: 0.6956% (1 subvol)
                                               0.0012% (2 subvol)
Failure prob of a Distributed-Dispersed 2x5:1: 0.0428135% (1 subvol)
                                               0.0000046% (2 subvol)

Of course all these numbers are only statistical. A batch of defective drives or servers can ruin any configuration.

You should also consider the time needed to rebuild a brick if a RAID fails. If you create RAID-5 of 10 disks, for example, gluster will need to recover up to 36 TB of information (if brick was full). Using smaller RAIDs reduces this amount of data.

If you use a single RAID to store multiple bricks, you will get multiple brick failures in case of a RAID or server failure. In any case it's not recommended to have more than one brick of the same subvolume in the same server. It's better to use distribute in this case (a 10:2 configuration where a single server failure causes 2 bricks to fail, is almost equivalent to a 5:1 configuration with respect to probabilities, specially if disks are configured in a RAID-0).

I wouldn't recommend to use RAID-0 with gluster. Instead of creating a RAID-0 of 2 disks, it's better to create 2 bricks belonging to two different gluster subvolumes and use distribute.

Failure probability of one brick using RAID-0 of two disks:  5.91%
Failure probability of two bricks using two disks:  5.82% (1 subvol)
                                                    0.09% (2 subvol)

RAID-5 or RAID-6 can be useful for single disk failure because the disk can be recovered locally in the server without having to read data from other servers. Only a more critical failure will require that gluster rebuilds brick contents. However bigger RAIDs have greater failure probabilities (though they waste less physical disk space).

You must also consider the cost of growing a volume. Disperse and Replicate need to grow in multiples of the subvolume size. This means that if you create a 3:1 configuration you will need to add 3 new bricks if you want to get more space. If you start with a 10:2 configuration you will need to add 10 new bricks to get more space.

In your case I would recommend using two RAID-5 of 5 disks each, or a single RAID-6 of 10 disks, in each server. You can also opt to not use any RAID and have 10 independent disks in each server. I would also create relatively small bricks (for example 4TB each) and use a distributed-dispersed 5:1, with one brick of each subvolume in each server.

With this configuration, if you lose one RAID or an entire server, you will only lose, at most, one brick of each subvolume.

Probability of failure using RAID-6:            0.0076% (1 subvol)
Probability of failure using RAID-5 (5 disks):  0.07% (1 subvol)
Probability of failure without RAID:            0.85% (1 subvol)
Probability of failure using RAID-5 (10 disks): 1.11% (1 subvol)

Of course it's better using RAID, but you also waste more space:

Available space using RAID-6:            128 TB
Available space using RAID-5 (5 disks):  128 TB
Available space using RAID-5 (10 disks): 144 TB
Available space without RAID:            160 TB

Using RAID you will recover integrity faster when only one or two disks fails. But it will take more time when gluster has to recover more than one brick (all bricks contained in the failed RAID).

You can also use disperse with redundancy 2. In your case it should be a 5:2. This configuration is not considered optimal, but it's possible that with your workload it performs quite well (you should test it). With this configuration I wouldn't recommend any RAID, or RAID-5 with 5 disks at most.

Probability of failure using RAID-6:            0.00002% (1 subvol)
Probability of failure using RAID-5 (5 disks):  0.00060% (1 subvol)
Probability of failure without RAID:            0.026% (1 subvol)
Probability of failure using RAID-5 (10 disks): 0.039% (1 subvol)

Available space using RAID-6:             96 TB
Available space using RAID-5 (5 disks):   96 TB
Available space using RAID-5 (10 disks): 108 TB
Available space without RAID:            120 TB

Hope this helps a little to decide the best configuration for you.

Xavi


Thanks
A.

-----Original Message-----
From: Xavier Hernandez [mailto:xhernandez@xxxxxxxxxx]
Sent: mercoledì 7 gennaio 2015 18:14
To: RASTELLI Alessandro
Cc: gluster-users@xxxxxxxxxxx; CAZZANIGA Stefano; UBERTINI Gabriele; TECHNOLOGY - Supporto Sistemi OTT e Cloud; ORLANDO Luca
Subject: Re:  Input/Output Error when deleting folder

If that file is missing only from gluster03-mi, and it has the same attributes in all remaining bricks, self-heal should recover it automatically.

Are there differences in the extended attributes of the file on bricks that have it ?

On 01/07/2015 05:22 PM, RASTELLI Alessandro wrote:
It worked... partially :)
now I can access the folders again,  but I can't delete them because
that there are a couple of files into them (which I don't need) The files exist only on node1,2,4,5 , but not on node3:

[root@gluster02-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218/Rec_218_1_part_14656.ts
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218/Rec_218_1_part_14656.ts
trusted.ec.config=0x0000080a02000200
trusted.ec.size=0x0000000034400000
trusted.ec.version=0x0000000000001a20
trusted.gfid=0x8d5da5a1cd1949618a5b96657857ceb6

[root@gluster03-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218/Rec_218_1_part_14656.ts
getfattr: /brick1/recorder/Rec218/Rec_218_1_part_14656.ts: No such
file or directory

How do I proceed?
Thanks

-----Original Message-----
From: Xavier Hernandez [mailto:xhernandez@xxxxxxxxxx]
Sent: mercoledì 7 gennaio 2015 16:45
To: RASTELLI Alessandro
Cc: gluster-users@xxxxxxxxxxx; CAZZANIGA Stefano; UBERTINI Gabriele;
TECHNOLOGY - Supporto Sistemi OTT e Cloud; ORLANDO Luca
Subject: Re:  Input/Output Error when deleting folder

Sorry, the command should be:

       setfattr -n trusted.ec.version -v 0x0000000000000001 <brick
path>/Rec218

On 01/07/2015 04:34 PM, RASTELLI Alessandro wrote:
See my answers below:
1.
[root@gluster03-mi ~]# ls -l
/brick1/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
ls: cannot access
/brick1/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
: No such file or directory [root@gluster03-mi ~]# ls -l
/brick1/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d lrwxrwxrwx 1 root root 55 Dec 17 17:37
/brick1/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d
-> ../../00/00/00000000-0000-0000-0000-000000000001/Rec218
[root@gluster03-mi ~]# ls -l
/brick2/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
ls: cannot access
/brick2/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
: No such file or directory [root@gluster03-mi ~]# ls -l
/brick2/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d lrwxrwxrwx 1 root root 55 Dec 17 17:37
/brick2/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d
-> ../../00/00/00000000-0000-0000-0000-000000000001/Rec218

2.
/Rec218 is supposed to be empty (or, I don't need to restore the
files) I stopped the volume, but when executing the command I get an error:
[root@gluster01-mi ~]# setfattr -n trusted.ec.version -v 0x1
/brick1/recorder/Rec218 bad input encoding

Regards
A.



-----Original Message-----
From: Xavier Hernandez [mailto:xhernandez@xxxxxxxxxx]
Sent: mercoledì 7 gennaio 2015 16:08
To: RASTELLI Alessandro
Cc: gluster-users@xxxxxxxxxxx; CAZZANIGA Stefano; UBERTINI Gabriele;
TECHNOLOGY - Supporto Sistemi OTT e Cloud; ORLANDO Luca
Subject: Re:  Input/Output Error when deleting folder

I see two problems here:

1. There has happened something very strange on gluster03-mi. It
contains the directory, but it's not the same one that there's on the
other bricks (8 bricks have gfid
a9d904af-0d9e-4018-acb2-881bd8b3c2e4,
while that node has gfid bda849fc-a556-469e-ad84-ed074f2c1bcd)

Whatever that has happened here has affected both bricks of that node in the same way.

What return these commands on gluster03-mi:

ls -l
/brick1/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
ls -l
/brick1/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d

ls -l
/brick2/recorder/.glusterfs/a9/d9/a9d904af-0d9e-4018-acb2-881bd8b3c2e
4
ls -l
/brick2/recorder/.glusterfs/bd/a8/bda849fc-a556-469e-ad84-ed074f2c1bc
d

2. It seems that node gluster04-mi has been stopped (or rebooted or
has
failed) while an operation that modifies the directory contents was being executed, so it has lost an update an it's out of sync (both bricks on the same server have missed one update, so it seems clear that it's not a brick problem but a server problem).

The global result of all this is that you have 4 failed bricks on a configuration that only supports 2 failed bricks.

BTW, having two or more bricks on the same server is not recommended because a single server failure causes multiple bricks to be lost. In this case a directory can be recovered, but if this happens to a file, it won't be 100% recoverable.

Are there any files inside /Rec218 ?

If you are going to delete the directory and all its contents and
brick contents in gluster03-mi are the same than in other servers,
the following commands should be safe (otherwise let me know before
doing
anything):

Before starting you must be sure that nothing is creating or deleting entries inside /Rec218. It would be even better if this could be done with volume stopped.

On each brick (including gluster03-mi):
        setfattr -n trusted.ec.version -v 0x1 <brick path>/Rec218

On bricks in gluster03-mi:
        setfattr -n trusted.gfid -v 0xa9d904af0d9e4018acb2881bd8b3c2e4
<brick path>/Rec218
        setfattr -n trusted.glusterfs.dht -v
0x000000010000000000000000ffffffff <brick path>/Rec218

On client:
        check that the directory is accessible and its contents seem ok. If so:
            rm -rf <mount point>/Rec218

If you have a way to reproduce this situation, let me know.

Xavi

On 01/07/2015 03:31 PM, RASTELLI Alessandro wrote:
[root@gluster01-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

[root@gluster01-mi ~]# getfattr -m. -e hex -d
/brick2/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick2/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff


[root@gluster02-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

[root@gluster02-mi ~]# getfattr -m. -e hex -d
/brick2/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick2/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff


[root@gluster03-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218
trusted.gfid=0xbda849fca556469ead84ed074f2c1bcd

[root@gluster03-mi ~]# getfattr -m. -e hex -d
/brick2/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick2/recorder/Rec218
trusted.gfid=0xbda849fca556469ead84ed074f2c1bcd


[root@gluster04-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218
trusted.ec.version=0x0000000000006939
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

[root@gluster04-mi ~]# getfattr -m. -e hex -d
/brick2/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick2/recorder/Rec218
trusted.ec.version=0x0000000000006939
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff


[root@gluster05-mi ~]# getfattr -m. -e hex -d
/brick1/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick1/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff

[root@gluster05-mi ~]# getfattr -m. -e hex -d
/brick2/recorder/Rec218
getfattr: Removing leading '/' from absolute path names # file:
brick2/recorder/Rec218 trusted.ec.version=0x000000000000693a
trusted.gfid=0xa9d904af0d9e4018acb2881bd8b3c2e4
trusted.glusterfs.dht=0x000000010000000000000000ffffffff
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users




[Index of Archives]     [Gluster Development]     [Linux Filesytems Development]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux