Re: Looking to Use Ceph

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01/03/2013 06:29 AM, emyr.james wrote:
Hi,

I'm thinking of starting to use ceph initially for evaluation...seeing how it compares to our existing lustre file system. One thing that I would like confirmation of is how ceph stores large files. If I store a large file in CephFS is it automatically split up into chunks with the various chunks stored and replicated automatically across the whole cluster, or does it store the whole file as one object on one individial OSD and then has individual replicants of the whole file on a small number of other OSD's ? What is the typical block size used if files are split up....can this be configured ?

Regards,

Emyr

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html

Hi Emry,

If you are using CephFS (ie the posix file system component), large files will by default be broken up into 4MB objects. The object size is configurable. Each object is distributed pseudo randomly between the different OSDs (though if you have a replication level > 1, object replicas will obey the rules defined in your crush map). Replication is on a per pool basis and happens automatically. If an OSD goes down and replication is used, Ceph will attempt to heal itself by redistributing the objects on the down OSD to the remaining ones.

For a more in-depth explanation about objects and striping, see:
http://ceph.com/docs/master/dev/file-striping/
http://ceph.com/docs/master/architecture/

One thing you should know is that Ceph's journal is similar to EXT4's "data=journal" mode in that data is always written to the journal before it goes to disk. If I remember correctly, ldiskfs by default uses ext3/4's "data=ordered" mode that only writes the data once. The upshot of this is that Ceph needs to do more writes for the same amount of data vs lustre, but there is a lower chance of data corruption as the data is written. When not network bound, Ceph will likely be slower for long sequential writes on the same hardware vs a highly tuned lustre system, but theoretically is faster for short bursty traffic as writes can be acknowledged as soon as they hit the journal (which requires fewer seeks vs writing the data out to the underlying filesystem) . By putting OSD journals on high-throughput SSDs, you can mitigate the sequential write penalties and get the best of both worlds, though you need more PCIE and controller throughput, and do potentially lose a bit of capacity and read throughput if you have to reduce your OSD count to add the SSD journals. PCIE SSDs may be a very interesting solution for journals as the price comes down.

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux