Re: misc performance tuning queries (related to OpenStack in particular)

Gautam Saxena <gsaxena@xxxxxxxxxxx> · Tue, 19 Nov 2013 21:20:07 -0500

Thanks Michael.
So quick correction based on Michael's response. In question 4, I should have not made any reference to Ceph objects, since objects are not striped (per Michael's response). Instead, I should simply have used the words "Ceph VM Image" instead of "Ceph objects". A Ceph VM image would constitute 1000s of objects, and the different objects are striped/spread across multiple OSDs from multiple servers. In that situation, what's answer to #4....

For question #2, a quick question to clarify Michael's response: if the underlying filesystem is xfs (and not btrfs), it is still more-or-less instantaneous because the snapshotting still uses some sort of copy-on-write technology?

On Tue, Nov 19, 2013 at 8:29 PM, Michael Lowe <j.michael.lowe@xxxxxxxxx> wrote:

1a. I believe it's dependent on format 2 images not btrfs.
1b. Snapshot works independent of the backing file system.

2. All data goes through the journals.
4a. Rbd image objects are not striped, they come in default 4MB chunks, consecutive sectors will come from the same object and osd.  I don't know what the result of the convolution of crush, vm filesystem sector allocation, and ethernet bonding would be.

Sent from my iPad

On Nov 19, 2013, at 8:12 PM, Gautam Saxena <gsaxena@xxxxxxxxxxx> wrote:

1a) The Ceph documentation on Openstack integration make a big (and valuable) point that cloning images should be instantaneous/quick due to the copy-on-write functionality. See "Boot from volume" at bottom of http://ceph.com/docs/master/rbd/rbd-openstack/. Here's the excerpt:

When Glance and Cinder are both using Ceph block devices, the image is a copy-on-write clone, so volume creation is very fast.

However, is this true *only* if we are using btrfs as the underlying file system for the OSDs? If so, then I don't think we can get this nice "quick" cloning, since CEPH documentation states all over the place that btrfs is not yet production ready.

1b) Ceph also describes snapshoting/layering as being super quick due to "copy on write". http://ceph.com/docs/master/rbd/rbd-snapshot/

Does this feature also depend on btrfs being used as underlying filesystem for OSDs?

2) If we have about 10 TB of data to transfer to CEPH (initial migration), would all 10 TB pass through the journals? If so, would it make sense to initially put the journals on each disk's separate partition (instead of an SSD), then once the 10 TB have been copied, to then change the Ceph configuration to now use SSDs for journaling instead of a partition on each disk? In this way, we don't "kill" (or significantly reduce) the SSDs life expectancy on day 1? (It's ok if the intiial migration takes longer if we're not using SSDs -- and I'm not sure that it will take more than twice as long anyways....)

3) Ceph documentation recommends multiple networks (front-side and back-side). I was wondering though which is "better": one large bonded interface of 6*1 GB/s = 6 GB/s or two or three interfaces, each of which would only be 2 or 3 GB/s (after bonding). My initial instincts is to just go for the nice fat 6 GB/s one, since I'm not worried about denial of service attacks (DOS) on my internal network and I figure this way I'll get excellent performance *most* of the time with some (minor?) risk that occassionally a client request may (or may not?) experience latency due to network traffic from back-end activities like replication? (My replication level will most likely be 2.)

4a) Regarding bonding: If I understood Ceph architecture correctly, any client request will automatically be routed to the individual OSDs that contain the a piece (a stripe) of the overall object that is being sought. So a single client request for an object could generate "n" requests to "n" OSDs. Since the OSDs (in a perfect world) will reside equally on all servers, then the normal hashing algorithm that Linux + LACP switches uses should balance these "n" requests accross "m" physical ethernet ports. So if I have 6 ethernet ports per server and say 6 servers, then in a perfect world, my "n" requests would use 6 ethernet ports. (In a real world, I imagine the hashing is not perfect and so maybe only 4 ethernet ports get used and the other two do nothing....). Is this understanding correct? If so, normal LACP hashing should suffice for my needs.

4b) A variation of the above question: if the 6 servers I have are NOT of equal size, such that the storage distributions are 24TB, 16TB, 12TB, 6TB, 4TB and 4TB (for a total of 68 TB hard disks across all servers) -- would it be reasonably to assume that CEPH would balance any object data roughly proportionally to the size of each server? (You can assume that the CRUSH setup is just using the default setup that comes with ceph-deploy, and that each server typically has 6 to 8 disks.) So a 1 TB vm, for example, would be split 24/68 on server 1; 16/68 on server 2; 12/68 on server 3; 4/68 on server 4; and 4/68 on servers 5 and 6?

-- 
Gautam Saxena 

President & CEO

Integrated Analysis Inc.

Making Sense of Data.™

Biomarker Discovery Software | Bioinformatics Services | Data Warehouse Consulting | Data Migration Consulting
www.i-a-inc.com 

gsaxena@xxxxxxxxxxx
(301) 760-3077  office

(240) 479-4272  direct
(301) 560-3463  fax

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Gautam Saxena 

President & CEO

Integrated Analysis Inc.

Making Sense of Data.™

Biomarker Discovery Software | Bioinformatics Services | Data Warehouse Consulting | Data Migration Consulting
www.i-a-inc.com 

gsaxena@xxxxxxxxxxx
(301) 760-3077  office

(240) 479-4272  direct
(301) 560-3463  fax

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com