Hello, On Wed, 07 Oct 2015 07:34:16 +0200 Loic Dachary wrote: > Hi Christian, > > Interesting use case :-) How many OSDs / hosts do you have ? And how are > they connected together ? > If you look far back in the archives you'd find that design. And of course there will be a lot of "I told you so" comments, but it worked just as planned while being within the design specifications. For example one of the first things I did was to have 64 VMs install themselves automatically from a virtual CD-ROM in parallel. This Ceph cluster handled that w/o any slow requests and in decent time. To answer your question, just 2 nodes with 2 OSDs (RAID6 with a 4GB cache Areca controller) each, replication of 2 obviously. Initially 3, now 6 compute nodes. All interconnected via redundant 40Gb/s Infiniband (IPoIB), 2 ports per server and 2 switches. While the low number of OSDs is obviously part of the problem here this is masked by the journal SSDs and the large HW cache for the steady state. My revised design is 6 RAID10 OSDs per node, the change to RAID10 is mostly to accommodate the type of VMs this cluster wasn't designed for in the first place. My main suspect for the excessive slowness are actually the Toshiba DT type drives used. We only found out after deployment that these can go into a zombie mode (20% of their usual performance for ~8 hours if not permanently until power cycled) after a week of uptime. Again, the HW cache is likely masking this for the steady state, but asking a sick DT drive to seek (for reads) is just asking for trouble. To illustrate this: --- DSK | sdd | busy 86% | read 0 | write 99 | avio 43.6 ms | DSK | sda | busy 12% | read 0 | write 151 | avio 4.13 ms | DSK | sdc | busy 8% | read 0 | write 139 | avio 2.82 ms | DSK | sdb | busy 7% | read 0 | write 132 | avio 2.70 ms | --- The above is a snippet from atop on another machine here, the 4 disks are in a RAID 10. I'm sure you can guess which one is the DT01ACA200 drive, sdb and sdc are Hitachi HDS723020BLA642 and sda is a Toshiba MG03ACA200. I have another production cluster that originally only had just 3 nodes and 8 OSDs each. It performed much better using MG drives. So the new node I'm trying to phase has these MG HDDs and the older ones will be replaced eventually. Christian [snip] -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com