Hi Dan, hi Robert, Am 04.09.2014 21:09, schrieb Dan van der Ster: > Thanks again for all of your input. I agree with your assessment -- in > our cluster we avg <3ms for a random (hot) 4k read already, but > 40ms > for a 4k write. That's why we're adding the SSDs -- you just can't run a > proportioned RBD service without them. How did you measure these latencies? > I'll definitely give bcache a try in my test setup, but more reading has > kinda tempered my expectations -- the rate of oopses and hangs on the > bcache ML seems a bit high. And a 3.14 kernel would indeed still be a > challenge on our RHEL6 boxes. bcache works fine with 3.10 and a bunch of patches ;-) Not sure if you can upgrade to RHEL7 and also not sure if RHEL has already some of them ready. We're using bcache on one of our bcache clusters since more than year based on kernel 3.10 + 15 patches and i never saw a crash or hang since applying them ;-) But yes with a vanilla kernel it's not that stable. Greets, Stefan > Cheers, Dan > > September 4 2014 8:47 PM, "Robert LeBlanc" <robert at leblancnet.us > <mailto:%22Robert%20LeBlanc%22%20<robert at leblancnet.us>>> wrote: > > You should be able to use any block device in a bcache device. Right > now, we are OK losing one SSD and it takes out 5 OSDs. We would > rather have twice the cache. Our opinion may change in the future. > We wanted to keep as much overhead as low as possible. I think we > may spend the extra on heavier duty SSDs; less overhead not having a > mirror, and less SSD drives allow us to put in more high capacity > spindles in each host (we have a density need). We still have to > test how much cache is optimal. I'm not sure how much 1% SSD will > help and if 2% will make any difference, etc. That's why we need to > test. > My reasoning is that if we can buffer the writes, hopefully we can > write to the spindles in a more linearly manner and give reads a > better chance of getting serviced faster. My theory is that with all > the caching already happening at KVM, etc that 1% writeback will be > much more useful than 1% read cache because the reads at the OSD > level will be cache misses anyway because they will be cold pages. > Write behind is really our target, reads can be serviced from cache > a good portion of the time, but writes have to always hit a disk > (our current disk system is about 33% read and 66% writes), in the > Ceph case (and in our config) three disks. When you add the latency > from all the levels, network, kernel, buffers, disk, etc it adds up > to a lot. If you are always doing large transfers, then it won't be > too noticeable because bandwidth helps outweigh the latency. But > when we are dealing with thousands of VMs all doing very small > things like writing a couple of lines to a log, reading/writing some > database pages, etc, the latency just kills the performance in a big > way. Reads aren't too bad because only the primary OSD has to > service the request, but on writes all three OSDs have to > acknowledge the write. So I'm trying to get the absolute best write > performance as I can. If I can reduce a 10 millisecond write to disk > to 1 ms, then I've saved about 18 ms in the transaction (primary OSD > write plus parallel write to secondary OSDs presumably). > With bcache, I'd even like to get rid of the journal writes in a > test case. Since it is hitting SSD and then being serialized to the > backing disk anyway by bcache, it seems that the journal is just a > double write penalty. Then it will be time to tackle the network > latency, can't wait to get our Infiniband gear to test AccelIO with > Ceph. > I'm interested to see what you decide to do and what your results are. > On Thu, Sep 4, 2014 at 12:12 PM, Dan Van Der Ster > <daniel.vanderster at cern.ch <mailto:daniel.vanderster at cern.ch>> wrote: > > I've just been reading the bcache docs. It's a pity the mirrored > writes aren't implemented yet. Do you know if you can use an md > RAID1 as a cache dev? And is the graceful failover from wb to > writethrough actually working without data loss? > > Also, write behind sure would help the filestore, since I'm > pretty sure the same 4k blocks are being overwritten many times > (from our RBD clients). > > Cheers, Dan > > On Sep 4, 2014 7:44 PM, Robert LeBlanc <robert at leblancnet.us > <mailto:robert at leblancnet.us>> wrote: > So far it was worked really well, we can > raise/lower/disable/enable the cache in realtime and watch how > the load and traffic changes. There has been some positive > subjective results, but definitive results are still forth > coming. bcache on CentOS 7 was not easy, makes me wish we were > running Debian or Ubuntu. If there are enough reasons to train > our admins on Debian/Ubuntu in addition to learning CentOS7 for > customer facing boxes, we may move that way for Ceph and > OpenStack, but I'm not sure how Red Hat purchasing Inktank will > shift the development from Debian/Ubuntu, so we don't want to > make any big changes until we have a better idea of what the > future looks like. I think the Enterprise versions of Ceph (n-1 > or n-2) will be a bit too old from where we want to be, which > I'm sure will work wonderfully on Red Hat, but how will n.1, n.2 > or n.3 run? > Robert LeBlanc > On Thu, Sep 4, 2014 at 11:22 AM, Dan Van Der Ster > <daniel.vanderster at cern.ch <mailto:daniel.vanderster at cern.ch>> > wrote: > > Hi Robert, > > That's actually a pretty good idea, since bcache would also > accelerate the filestore flushes and leveldb. I actually > wonder if an SSD-only pool would even be faster than such a > setup... probably not. > > We're using an ancient enterprise n distro, so it will be a > bit of a headache to get the right kernel, etc .. But my > colleague is planning to use bcache to accelerate our > hypervisors' ephemeral storage, so I guess that's a solved > problem. > > Hmm... > > Thanks! > > Dan > > On Sep 4, 2014 6:42 PM, Robert LeBlanc <robert at leblancnet.us > <mailto:robert at leblancnet.us>> wrote: > We are still pretty early on in our testing of how to best > use SSDs as well. What we are trying right now, for some of > the reasons you mentioned already, is to use bcache as a > cache for both journal and data. We have 10 spindles in our > boxes with 2 SSDs. We created two bcaches (one for each SSD) > and put five spindles behind it with the journals as just > files on the spindle (because it is hot, it should stay in > SSD). This should have the advantage that if the SSD fails, > it could automatically fail to write-through mode (although > I don't think it will help if the SSD suddenly fails). > However, it seems that if any part of the journal is lost, > the OSD is toast and needs to be rebuilt. Bcache was > appealing to us because one SSD could front multiple backend > disks and make the most efficient use of the SSD, it also > has write around for large sequential writes so that cache > is not evicted for large sequential writes which spindles > are good at. Since we have a high read cache hit from KVM > and other layers, this is primary intended to help > accelerate writes more than reads (we are also more write > heavy in our environment). > So far it seems to help, but we are going to start more > in-depth testing soon. One drawback is that bcache devices > don't seem to like partitions, so we have created the OSDs > manually instead if using ceph-deploy. > I too am interested with other's experience with SSD and > trying to cache/accelerate Ceph. I think the Caching pool in > the long run will be the best option, but it can still use > some performance tweaking with small reads before it will be > really viable for us. > Robert LeBlanc > On Thu, Sep 4, 2014 at 10:21 AM, Dan Van Der Ster > <daniel.vanderster at cern.ch > <mailto:daniel.vanderster at cern.ch>> wrote: > > Dear Cephalopods, > > In a few weeks we will receive a batch of 200GB Intel DC > S3700?s to augment our cluster, and I?d like to hear > your practical experience and discuss options how best > to deploy these. > > We?ll be able to equip each of our 24-disk OSD servers > with 4 SSDs, so they will become 20 OSDs + 4 SSDs per > server. Until recently I?ve been planning to use the > traditional deployment: 5 journal partitions per SSD. > But as SSD-day approaches, I growing less comfortable > with the idea of 5 OSDs going down every time an SSD > fails, so perhaps there are better options out there. > > Before getting into options, I?m curious about real > reliability of these drives: > > 1) How often are DC S3700's failing in your deployments? > 2) If you have SSD journals at a ratio of 1 to 4 or 5, > how painful is the backfilling which results from an SSD > failure? Have you considered tricks like increasing the > down out interval so backfilling doesn?t happen in this > case (leaving time for the SSD to be replaced)? > > Beyond the usually 5 partitions deployment, is anyone > running a RAID1 or RAID10 for the journals? If so, are > you using the raw block devices or formatting it and > storing the journals as files on the SSD array(s)? > Recent discussions seem to indicate that XFS is just as > fast as the block dev, since these drives are so fast. > > Next, I wonder how people with puppet/chef/? are > handling the creation/re-creation of the SSD devices. > Are you just wiping and rebuilding all the dependent > OSDs completely when the journal dev fails? I?m not keen > on puppetizing the re-creation of journals for OSDs... > > We also have this crazy idea of failing over to a local > journal file in case an SSD fails. In this model, when > an SSD fails we?d quickly create a new journal either on > another SSD or on the local OSD filesystem, then restart > the OSDs before backfilling started. Thoughts? > > Lastly, I would also consider using 2 of the SSDs in a > data pool (with the other 2 SSDs to hold 20 journals ? > probably in a RAID1 to avoid backfilling 10 OSDs when an > SSD fails). If the 10-1 ratio of SSDs would perform > adequately, that?d give us quite a few SSDs to build a > dedicated high-IOPS pool. > > I?d also appreciate any other suggestions/experiences > which might be relevant. > > Thanks! > Dan > > -- Dan van der Ster || Data & Storage Services || CERN > IT Department -- > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com <mailto:ceph-users at lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >