Hello, a bit of back story first, it may prove educational for others a future generations. As some may recall, I have a firefly production cluster with a storage node design that was both optimized for the use case at the time and with an estimated capacity to support 140 VMs (all running the same OS/application, thus the predictable usage pattern). Alas people starting running different VMs and also my request for new HW was delayed. So now there are 280 VMs doing nearly exclusively writes (8MB/s, 1000 ceph ops) and while the ceph cluster can handle this steady state w/o breaking a sweat (avio is less then 0.01ms and "disks" are less than 5% busy). Basically nicely validating my design for this use case. ^o^ It becomes slightly irritated when asked to do reads (like VM reboots). Those will drive utilization up to 100% at times, alas avio is still reasonable at less than 5ms. This is also why I disabled scrubbing 9 months ago when it did hit my expected capacity limit (and asked for more HW). However when trying to add the new node (much faster design in several ways) to the cluster the resulting backfilling (when adding the first OSD to the CRUSH map, not even starting it or anything) totally kills things with avio frequently over 100ms and thus VMs croaking left and right. This was of course with all the recently discussed backfill and recovery parameters tuned all the way down. There simply is no maintenance window long enough to phase in that 3rd node. This finally got the attention of the people who approve HW orders and now the tack seems to be "fix it whatever it takes" ^o^ So the least invasive plan I've come up with so far is to create a SSD backed cache tier pool, wait until most (hot) objects have made it in there and the old (now backing) pool has gone mostly quiescent and then add that additional node and re-build the older ones as planned. The size of that SSD cache pool would be at least 80% of the total current data (which of course isn't all hot), so do people who have actually experience with cache tiers under firefly that aren't under constant pressure to evict things think this is feasible? Again, I think based on the cache size I can tune things to avoid evictions and flushes, but if it should start flushing things for example, is that an asynchronous operation or will that impede performance of the cache tier? As in, does the flushing of an object have to be finished before it can be written to again? Obviously I can't do anything about slow reads from the backing pool for objects that somehow didn't make it into cache yet. But while slow reads are not nice, it is slow WRITES that really upset the VMs and the application they run. Clearly what I'm worried about here is that the old pool backfilling/recovering will be quite comatose (as mentioned above) during that time. Regards, Christian -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com