Excellent, thanks for the detailed breakdown.
Take care,
Bill
From: Michael J. Kidd [michael.kidd@xxxxxxxxxxx]
Sent: Wednesday, January 07, 2015 4:50 PM
To: Sanders, Bill
Cc: Loic Dachary; ceph-users@xxxxxxxx
Subject: Re: PG num calculator live on Ceph.com
Hello Bill,
Either 2048 or 4096 should be acceptable. 4096 gives about a 300 PG per OSD ratio, which would leave room for tripling the OSD count without needing to increase the PG number. While 2048 gives about 150 PGs per OSD, not leaving room but for about a 50%
OSD count expansion.
The high PG count per OSD issue really doesn't manifest aggressively until you get around 1000 PGs per OSD and beyond. At those levels, steady state operation continues without issue.. but recovery within the cluster will see the memory utilization of the
OSDs climb and could push into out of memory conditions on the OSD host (or at a minimum, heavy swap usage if enabled). It still depends of course on the # of OSDs per node, and the amount of memory on the node as to if you'll actually experience issues or
not.
As an example though, I worked on a cluster which was about 5500 PGs per OSD. The cluster experienced a network config issue in the switchgear which isolated 2/3's of the OSD nodes from each other and the other 1/3 of the cluster. When the network issue was
cleared, the OSDs started dropping like flies... They'd start up, spool up the memory they needed for map update parsing, and get killed before making any real headway. We were finally able to get the cluster online by limiting what the OSDs were doing to
a small slice of the normal start-up, waiting for the OSDs to calm down, then opening up a bit more for them to do (noup, noin, norecover, nobackfill, pause, noscrub, nodeep-scrub were all set, and then unset one at a time until all OSDs were up/in and able
to handle the recovery).
6 weeks later, that same cluster lost about 40% of the OSDs during a power outage due to corruption from an HBA bug.. (it didn't flush the write cache to disk). This pushed the PG per OSD count over 9000!! It simply couldn't recover with the available memory
at that PG count. Each OSD, started by itself, would consume > 60gb of RAM and get killed (the nodes only had 64gb total).
While this is an extreme example... we see cases generated with > 1000 PGs per OSD on a regular basis. This is the type of thing we're trying to head off.
It should be noted that you can increase the PG num of a pool.. but cannot decrease! The only way to reduce your cluster PG count is to create new smaller PG num pools, migrate the data and then delete the old, high PG count pools. You could also simply
add more OSDs to reduce the PG per OSD ratio.
The issue with too few PGs is poor data distribution. So it's all about having enough PGs to get good data distribution without going too high and having resource exhaustion during recovery.
Hope this helps put things into perspective.
|