OK, thanks for the clarifications! -Wyllys On Wed, Feb 18, 2015 at 10:52 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Wed, 18 Feb 2015, Wyllys Ingersoll wrote: >> Thanks! More below inline... >> >> On Wed, Feb 18, 2015 at 10:05 AM, Wido den Hollander <wido@xxxxxxxx> wrote: >> > On 18-02-15 15:39, Wyllys Ingersoll wrote: >> >> Can someone explain the interaction and effects of all of these >> >> "full_ratio" parameters? I havent found any real good explanation of how >> >> they affect the distribution of data once the cluster gets above the >> >> "nearfull" and close to the "close" ratios. >> >> >> > >> > When only ONE (1) OSD goes over the mon_osd_nearfull_ratio the cluster >> > goes from HEALTH_OK into HEALTH_WARN state. >> > >> >> >> >> mon_osd_full_ratio >> >> mon_osd_nearfull_ratio >> >> >> >> osd_backfill_full_ratio >> >> osd_failsafe_full_ratio >> >> osd_failsafe_nearfull_ratio >> >> >> >> We have a cluster with about 144 OSDs (518 TB) and trying to get it to a >> >> 90% full rate for testing purposes. >> >> >> >> We've found that when some of the OSDs get above the mon_osd_full_ratio >> >> value (.95 in our system), then it stops accepting any new data, even >> >> though there is plenty of space left on other OSDs that are not yet even up >> >> to 90%. Tweaking the osd_failsafe ratios enabled data to move again for a >> >> bit, but eventually it becomes unbalanced and stops working again. >> >> >> > >> > Yes, that is because with Ceph safety goes first. When only one OSD goes >> > over the full ratio the whole cluster stops I/O. >> >> >> >> Which full_ratio? The problem is that there are at least 3 >> "full_ratios" - mon_osd_full_ratio, osd_failsafe_full_ratio, and >> osd_backfill_full_ratio - how do they interact? What is the >> consequence of having one be higher than the others? > > mon_osd_full_ratio (.95) ... when any OSD reaches this threshold the > monitor marks the cluster as 'full' and client writes are not accepted. > > mon_osd_nearfull_ratio (.85) ... when any OSD reaches this threshold the > cluster goes HEALTH_WARN and calls out near-full OSDs. > > osd_backfill_full_ratio (.85) ... when an OSD locally reaches this > threshold it will refuse to migrate a PG to itself. This prevents > rebalancing or repair from overfilling an OSD. It should be lower than > the > > The osd_failsafe_full_ratio (.97) is a final sanity check that makes the > OSD throw out writes if it is really close to full. > > It's bad news if an OSD fills up completely so we do what we can to > prevent it. > >> Its seems extreme that 1 full osd out of potentially hundreds would >> cause all IO into the cluster to stop when there are literally 10s or >> 100s of terrabytes of space left on other, less-full OSDs. > > Yes, but the nature of hash-based distribution is that you don't know > where a write will go, so you don't want to let the cluster fill up. 85% > is pretty conservative; you could increase it if you're comfortable. Just > be aware that file systems over 80% start to get very slow so it is a > bad idea to run them this full anyway. > >> The confusion for me (and probably for others) is the proliferation of >> "full_ratio" parameters and a lack of clarity on how they all affect >> the cluster health and ability to balance when things start to fill >> up. >> >> >> > >> > CRUSH does not take OSD utilization into account when placing data, so >> > it's almost impossible to predict which I/O can continue. >> > >> > Data safety and integrity is priority number 1. Full disks are a danger >> > to those priorities, so I/O is stopped. >> >> >> Understood, but 1 full disk out of hundreds should not cause the >> entire system to stop accepting new data or even balancing out the >> data that it already has especially when there is room to grow yet on >> other OSDs. > > The "proper" response to this currently is that if an OSD reaches the > lower nearfull threshold the admin gets a warning and triggers some > rebalancing. That's why it's 10% lower then the actual full cutoff--there > is plenty of time to adjust weights and/or expand the cluster. > > It's not an ideal approach, perhaps, but it's simple and works well > enough. And it's not clear that there's is anything better we can do that > isn't also very complicated... > > sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html