Re: Failures with Ceph without redundancy/replication

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 16/07/15 11:58, Vedran Furač wrote:
I'm experimenting with ceph for caching, it's configured with size=1 (so
no redundancy/replication) and exported via cephfs to clients, now I'm
wondering what happens is an SSD dies and all of its data is lost? I'm
seeing files being in 4MB chunks in PGs, do we know if a whole file as
saved through cephfs (all its chunks) are in a single PG (or at least in
a multiple PGs within a single OSD), or it might be spread over multiple
OSD, so in that case an SSD failure would entail effectively loosing
more than data than it fits on a single drive, or even worse, massive
corruption potentially affecting most of the content. Note that losing a
single drive and all of its data (so 1% in case of a 100 drives) isn't
an issue for me. However losing much more or files being silently
corrupted with holes in them is unacceptable. I would then have to go
with some erasure coding.


There is no locality in where the objects in a file go. They will be spread uniformly across your PGs, which are spread uniformly across your OSDs.

In a filesystem containing large files, the loss of a PG in the data pool will lead to holes scattered throughout files. While the PG is unavailable reads on the files will block trying to access the inaccessible objects. If you "repair" the system by deleting the objects that were on the lost PG, those regions of files will show up as zeros. If you lose the first object in a file, this will also potentially break hardlinks to the file.

Working out which files were affected by a PG loss is an O(N_files * size) operation, because it involves calculating all the object names in all the files, hashing them, and seeing if the hash falls into the dead PG. We don't have a tool that does that (yet).

You can mitigate this a little bit by creating multiple data pools on different groups of OSDs, and associating different filesystem directories with different pools, but it's a pretty hacky thing to do.

You mention erasure coding, bear in mind that you can't use an EC pool directly as a cephfs data pool, you can only use it with a non-EC cache tier on top of it.

John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux