Re: Looking to Use Ceph

Mark Nelson <mark.nelson@xxxxxxxxxxx> · Thu, 03 Jan 2013 08:43:03 -0600

On 01/03/2013 06:29 AM, emyr.james wrote:
Hi,

I'm thinking of starting to use ceph initially for evaluation...seeing 
how it compares to our existing lustre file system.
One thing that I would like confirmation of is how ceph stores large 
files. If I store a large file in CephFS is it automatically split up 
into chunks with the various chunks stored and replicated 
automatically across the whole cluster, or does it store the whole 
file as one object on one individial OSD and then has individual 
replicants of the whole file on a small number of other OSD's ? What 
is the typical block size used if files are split up....can this be 
configured ?

Regards,

Emyr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html

Hi Emry,

If you are using CephFS (ie the posix file system component), large 
files will by default be broken up into 4MB objects.  The object size is 
configurable.  Each object is distributed pseudo randomly between the 
different OSDs (though if you have a replication level > 1, object 
replicas will obey the rules defined in your crush map).  Replication is 
on a per pool basis and happens automatically.  If an OSD goes down and 
replication is used, Ceph will attempt to heal itself by redistributing 
the objects on the down OSD to the remaining ones.

For a more in-depth explanation about objects and striping, see:
http://ceph.com/docs/master/dev/file-striping/
http://ceph.com/docs/master/architecture/

One thing you should know is that Ceph's journal is similar to EXT4's 
"data=journal" mode in that data is always written to the journal before 
it goes to disk.  If I remember correctly, ldiskfs by default uses 
ext3/4's "data=ordered" mode that only writes the data once.  The upshot 
of this is that Ceph needs to do more writes for the same amount of data 
vs lustre, but there is a lower chance of data corruption as the data is 
written.  When not network bound, Ceph will likely be slower for long 
sequential writes on the same hardware vs a highly tuned lustre system, 
but theoretically is faster for short bursty traffic as writes can be 
acknowledged as soon as they hit the journal (which requires fewer seeks 
vs writing the data out to the underlying filesystem) .  By putting OSD 
journals on high-throughput SSDs, you can mitigate the sequential write 
penalties and get the best of both worlds, though you need more PCIE and 
controller throughput, and do potentially lose a bit of capacity and 
read throughput if you have to reduce your OSD count to add the SSD 
journals.  PCIE SSDs may be a very interesting solution for journals as 
the price comes down.

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html