On 16/07/15 11:58, Vedran Furač wrote:
I'm experimenting with ceph for caching, it's configured with size=1 (so no redundancy/replication) and exported via cephfs to clients, now I'm wondering what happens is an SSD dies and all of its data is lost? I'm seeing files being in 4MB chunks in PGs, do we know if a whole file as saved through cephfs (all its chunks) are in a single PG (or at least in a multiple PGs within a single OSD), or it might be spread over multiple OSD, so in that case an SSD failure would entail effectively loosing more than data than it fits on a single drive, or even worse, massive corruption potentially affecting most of the content. Note that losing a single drive and all of its data (so 1% in case of a 100 drives) isn't an issue for me. However losing much more or files being silently corrupted with holes in them is unacceptable. I would then have to go with some erasure coding.
There is no locality in where the objects in a file go. They will be spread uniformly across your PGs, which are spread uniformly across your OSDs.
In a filesystem containing large files, the loss of a PG in the data pool will lead to holes scattered throughout files. While the PG is unavailable reads on the files will block trying to access the inaccessible objects. If you "repair" the system by deleting the objects that were on the lost PG, those regions of files will show up as zeros. If you lose the first object in a file, this will also potentially break hardlinks to the file.
Working out which files were affected by a PG loss is an O(N_files * size) operation, because it involves calculating all the object names in all the files, hashing them, and seeing if the hash falls into the dead PG. We don't have a tool that does that (yet).
You can mitigate this a little bit by creating multiple data pools on different groups of OSDs, and associating different filesystem directories with different pools, but it's a pretty hacky thing to do.
You mention erasure coding, bear in mind that you can't use an EC pool directly as a cephfs data pool, you can only use it with a non-EC cache tier on top of it.
John _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com