First I'm addressing increasing your PG counts as that is what you specifically asked about, however I do not believe that is your problem and I'll explain that later.
There are a few recent threads on the ML about increasing the pg_num and pgp_num on a cluster. But if you learn how to search the archives, let me know... I always get an error. The gist is to set nobackfill, norecover, noout, and nodown on your cluster; increase your pg_num and then pgp_num in small increments waiting for all peering, creating, inactive, etc pgs to clear before doing the next set of pgs (generally 256 at a time, but we're seeing poor performance with this number and since we have it mostly automated we're starting to increment by 64 to mitigate the cluster impact). What percentage of your data is in each of your pools? Based on your amount of PGs you should have 2/3 of your data is in Volumes and 1/3 in Images. If that is correct and will continue to be true, then you want to keep that ratio similar. Let me know if you have any questions on this This is where I propose where your slow and blocked requests come from. Every time we have persistent, but seemingly random, slow/blocked requests it is always PG sub-folders splitting. The threshold for this is calculated off of a constant and 2 settings that you can set in your config (filestore merge threshold, filestore split multiple). "filestore split multiple" is not a direct value used by the cluster, it is a variable used to find the value used by the cluster (the equation is shown later). "filestore merge threshold" is how many objects in subfolders before it will merge them back together into 1 directory; this is a sum of all objects in subfolders. If you set this to negative, then you will never merge subfolders, but the value is still used in the equation with "filestore split multiple" (notice the abs in the equation ignoring if this value is negative). The equation for how many objects you can have in a folder before it splits into sub-folders is = 16 * { filestore split multiple } * abs( { filestore merge threshold } ) These settings cannot be injected, you must change your cluster config and restart your osds to change the settings. The way you can tell if this is happening on your cluster is to check what your values are and plug them into the equation and then check a pg in your cluster with a command similar to this to see if you are in the middle of splitting sub-folders or recently split sub-folders. cd /var/lib/ceph/osd/ceph-$osd/current/ for folder in *_head; do echo $folder; ls -1R $folder | cut -d. -f1 | uniq -c | grep -Ev '^\s+1 '; done That assumes you go into a valid osd current folder and will give you a count of all objects inside of your sub-folders for each PG. If you are in the middle of splitting sub-folders, then you will see that the smallest numbers are about 1/16 of the largest numbers. That would be because they are dividing all of the objects into 16 subfolders. When our clusters have this happen, we don't only see slow/blocked requests but we also see osds being marked down for a bit. We have to combat this by injecting '--osd_heartbeat_grace=180' to a high enough value to allow the osd to finish splitting it's sub-folders before it continues to respond to requests. This value is how long an osd will wait for a response from another osd before telling the mons that it's not responding. We used to use 180 (3 minutes), but that is no longer high enough and we're now using 240 when we see that a cluster is splitting sub-folders and errantly marking osds down.
From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Emilio Moreno Fernandez [emilio.moreno@xxxxxxx]
Sent: Tuesday, October 11, 2016 5:29 AM To: 'ceph-users@xxxxxxxxxxxxxx' Subject: [ceph-users] Modify placement group pg and pgp in production environment Hi,
We have production platform of Ceph in our farm of Openstack. This platform have following specs:
1 Admin Node 3 Monitors 7 Ceph Nodes with 160 OSD of SAS HDD 1.2TB 10K. Maybe 30 OSD have Journal SSD...we are in update progress...;-)
All network have 10GB Ethernet link and we have some problems now of slow and block request...no problem we are diagnosing the platform.
One of our problems are the placement groups, this number has not been changed by mistake for a long time...our pools:
GLOBAL: SIZE AVAIL RAW USED %RAW USED OBJECTS 173T 56319G 118T 68.36 10945k POOLS: NAME ID CATEGORY USED %USED MAX AVAIL OBJECTS DIRTY READ WRITE rbd 0 - 0 0 14992G 0 0 1 66120 volumes 6 - 42281G 23.75 14992G 8871636 8663k 47690M 55474M images 7 - 18151G 10.20 14992G 2324108 2269k 1456M 1622k backups 8 - 0 0 14992G 1 1 18578 104k vms 9 - 91575M 0.05 14992G 12827 12827 2526k 6863k
And our PG on pools are (only of used pools):
Volumes 2048 Images 1024
We think that our performance problem, after verify network, servers, hardware, disk, software, bugs, logs, etc...is the number of PG volumes...
Our Question:
How we can update de pg number and after pgp number in production environment without interrupting service, poor performance or down the virtual instances...??? The last update was made from 512 to 1024 in the pool of pictures and had a drop service 2 hours because the platform did not support data traffic....we are scaried :-( We can do this change with little increments in two weeks? How?
Thanks Thanks Thanks
_________________________________________________________________________
Emilio Moreno Fernández
|
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com