On Sun, Aug 1, 2010 at 5:08 AM, Roland Rabben <roland@xxxxxxxx> wrote: > I know Ceph is not production ready yet, but from the activity on this > mailing list things looks promising. As you note, Ceph is definitely not production-ready yet. Part of this means that its testing in large-scale environments is limited, so there may be bugs or unexpected behaviors. That said: > I am researching alternatives to GlusterFS that I am currently using. > My need is to store billions of files (big and small), and I am trying > to find out if there are any considerations I should make when > planning folder structure and server config using Ceph. > > On my GlusterFS system things seems to slow down dramatically as I > grow the number of files. A simple ls takes forever. So I am looking > for alternatives. > > Right now my folder structrure looks like this: > > Users are grouped into folders, named /000, /001, ... /999 , using a hash. > Each user has its own folder inside the numbered folders > Inside each user-folder the users files are stored in folders named > /000, /001, ... /999, also using a hash. > > Would this folder structure or the ammount of files become a problem using Ceph? This structure *should* definitely be okay for Ceph -- it stores dentries as part of the containing inode, and the metadata servers make extensive use of an in-memory cache, so an ls will generally require either zero or one on-disk lookups. > I generally use 4U storage nodes with 36 x 1,5 TB or 2 TB SATA drives, > 8 core CPU and 6 GB RAM. My application is write once and read many. > What recommendations would you give with regards to setting up the > filesystem on the storage nodes? ext3? ext4? lvm? RAID? Again going back to the limited testing, I don't think anybody knows what the best configuration will be with such disk-heavy nodes. But you'll definitely want to run btrfs on storage nodes as it supports a number of features which enable Ceph to be faster under some circumstances and to have more reliable recovery behavior. Speculating on the best-performance configuration is hard without knowing your usage patterns, but: Given the limited memory availability, I would probably create 2 or 3 btrfs volumes across all the disks (except saving one extra disk per volume to use as a journal), and then run one OSD per volume (with an appropriately-configured CRUSH map to prevent replicating data onto the same node!). If you can expand the memory above 12GB I'd probably stuff the machine full and then run one OSD per 2GB (or maybe 1GB, but your caching will be weaker) and partition the drives using btrfs accordingly. > Today I am mounting all disks as individual ext3 partitions and tying > them together with GlusterFS. Would this work with Ceph or would you > recommend making one large LVM volume on each storage node that you > expose to Ceph? This would be a bad idea with Ceph, you'll need to combine the disks into logical volumes (but as I said, btrfs can do this). The reason being that Ceph can only handle one directory/OSD instance and you don't want to stuff 36 OSDs into 8 cores and 6 GB of RAM. :) -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html