Re: Failures with Ceph without redundancy/replication

John Spray <john.spray@xxxxxxxxxx> · Thu, 16 Jul 2015 12:09:01 +0100

On 16/07/15 11:58, Vedran Furač wrote:
I'm experimenting with ceph for caching, it's configured with size=1 (so
no redundancy/replication) and exported via cephfs to clients, now I'm
wondering what happens is an SSD dies and all of its data is lost? I'm
seeing files being in 4MB chunks in PGs, do we know if a whole file as
saved through cephfs (all its chunks) are in a single PG (or at least in
a multiple PGs within a single OSD), or it might be spread over multiple
OSD, so in that case an SSD failure would entail effectively loosing
more than data than it fits on a single drive, or even worse, massive
corruption potentially affecting most of the content. Note that losing a
single drive and all of its data (so 1% in case of a 100 drives) isn't
an issue for me. However losing much more or files being silently
corrupted with holes in them is unacceptable. I would then have to go
with some erasure coding.

There is no locality in where the objects in a file go.  They will be 
spread uniformly across your PGs, which are spread uniformly across your 
OSDs.

In a filesystem containing large files, the loss of a PG in the data 
pool will lead to holes scattered throughout files.  While the PG is 
unavailable reads on the files will block trying to access the 
inaccessible objects.  If you "repair" the system by deleting the 
objects that were on the lost PG, those regions of files will show up as 
zeros.  If you lose the first object in a file, this will also 
potentially break hardlinks to the file.

Working out which files were affected by a PG loss is an O(N_files * 
size) operation, because it involves calculating all the object names in 
all the files, hashing them, and seeing if the hash falls into the dead 
PG.  We don't have a tool that does that (yet).

You can mitigate this a little bit by creating multiple data pools on 
different groups of OSDs, and associating different filesystem 
directories with different pools, but it's a pretty hacky thing to do.

You mention erasure coding, bear in mind that you can't use an EC pool 
directly as a cephfs data pool, you can only use it with a non-EC cache 
tier on top of it.

John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com