Re: is rados block cluster production ready ?

Alexandre DERUMIER <aderumier@xxxxxxxxx> · Fri, 18 May 2012 11:58:44 +0200 (CEST)

Hi Christian,
thanks for your response.

>>We are using 0.45 in production. Recent ceph versions are quite stable 
>>(although we hat some troubles with excessive logging and a full log 
>>partition lately which caused our cluster to halt). 

excessive logging because of a configuration error ? 

>>For the moment I would definitely recommend using XFS as the 
>>underlying filesystem. At least until there is a fix for the 
>>orphan_commit_root problem. XFS comes with a slight performance 
>>impact, but it seems to be the only filesystem that is able to handle 
>>heavy ceph workload for the moment. 

What's the benefit of using btrfs ? snapshots ? (I would like to be able to do snapshots, maybe clones)

>>We are running a small ceph cluster (4 Servers with 4 OSDs each) on a 
>>10GE network. Servers are spread across two datacenters with a 5km (3 
>>mile) long 10GE fibre-link for data replication. Our servers are 
>>equipped with 80GB Fusion-IO drives (for the journal) and traditional 
>>3,5'' SAS drives in a RAID5 configuration (but I would not reccommend 
>>this setup). 

>>From a guest we can get a throughput ~ 500MB/s. 

Great ! (And from multiple guests ? do you have more throughput ?)

Also about latencies, do you have good latencies with you fusion-io journal?
I currently use zfs storage, and writes are going to fast journal nvram then flushed to disk 15K.
It is the same behaviour with ceph ?

>>This is probably the best hardware for a ceph cluster money can buy. 
>>Are you planning a single SAS drive per OSD? 

Yes, on osd by drive. 
So if something goes wrong with brtfs or xfs, I'll have only 1 failed disk and not the whole raid.
Is it the right way for osd ?

>>I still don't know the cause exactly, but we are not able to saturate 
>>10GE (maybe it's the latency on the WAN link or some network 
>>configuration problem). 

yes maybe. (I would like to have money for this kind of setup ;)

>>I did some artificial tests with btrfs with large metadata enabled 
>>(e.g. mkfs.btrfs -l 64k -n 64k) the performance degradation seems to 
>>be gone. 

Great! (I'm very scary about this kind of bugs)

>>We are using bonding. The rados-client is doing a failover to another 
>>osd node after a few seconds, when there is no response from the OSD. 
>>(You should read about CRUSH in the ceph docs). 

Thanks again for all your reponse. (Ceph community seem to be great :)

Regards,

Alexandre

----- Mail original ----- 

De: "Christian Brunner" <christian@xxxxxxxxxxxxxx> 
À: "Alexandre DERUMIER" <aderumier@xxxxxxxxx> 
Cc: ceph-devel@xxxxxxxxxxxxxxx 
Envoyé: Vendredi 18 Mai 2012 10:45:48 
Objet: Re: is rados block cluster production ready ? 

2012/5/18 Alexandre DERUMIER <aderumier@xxxxxxxxx>: 
> Hi, 
> I'm going to build a rados block cluster for my kvm hypervisors. 
> 
> Is it already production ready ? (stable,no crash) 

We are using 0.45 in production. Recent ceph versions are quite stable 
(although we hat some troubles with excessive logging and a full log 
partition lately which caused our cluster to halt). 

> I have read some btrfs bugs on this mailing list, so I'm a bit scary... 

For the moment I would definitely recommend using XFS as the 
underlying filesystem. At least until there is a fix for the 
orphan_commit_root problem. XFS comes with a slight performance 
impact, but it seems to be the only filesystem that is able to handle 
heavy ceph workload for the moment. 

> Also, what performance could I expect ? 

We are running a small ceph cluster (4 Servers with 4 OSDs each) on a 
10GE network. Servers are spread across two datacenters with a 5km (3 
mile) long 10GE fibre-link for data replication. Our servers are 
equipped with 80GB Fusion-IO drives (for the journal) and traditional 
3,5'' SAS drives in a RAID5 configuration (but I would not reccommend 
this setup). 

>From a guest we can get a throughput ~ 500MB/s. 

> I try to build a fast cluster, with fast ssd disk. 
> each node : 8 osds with "ocz talos" sas drive + stec zeusram drive (8GB nvram) for the journal + 10GB ethernet. 
> Do you think I can saturate the 10GB ? 

This is probably the best hardware for a ceph cluster money can buy. 
Are you planning a single SAS drive per OSD? 

I still don't know the cause exactly, but we are not able to saturate 
10GE (maybe it's the latency on the WAN link or some network 
configuration problem). 

> I also have some questions about performance in time. 
> I have had somes problems with my zfs san and zfs fragmentation and metastab problem. 
> How does btrfs perform in time ? 

I did some artificial tests with btrfs with large metadata enabled 
(e.g. mkfs.btrfs -l 64k -n 64k) the performance degradation seems to 
be gone. 

> About network, does the rados protocol support some kind of multipathing ? Or does I need to use bonding/lacp ? 

We are using bonding. The rados-client is doing a failover to another 
osd node after a few seconds, when there is no response from the OSD. 
(You should read about CRUSH in the ceph docs). 

Regards, 
Christian 

-- 

-- 

	Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html