Re: Modify placement group pg and pgp in production environment

David Turner <david.turner@xxxxxxxxxxxxxxxx> · Tue, 11 Oct 2016 16:18:59 +0000

First I'm addressing increasing your PG counts as that is what you specifically asked about, however I do not believe that is your problem and I'll explain that later.

There are a few recent threads on the ML about increasing the pg_num and pgp_num on a cluster.  But if you learn how to search the archives, let me know... I always get an error.  The gist is to set nobackfill, norecover, noout, and nodown on your cluster;
 increase your pg_num and then pgp_num in small increments waiting for all peering, creating, inactive, etc pgs to clear before doing the next set of pgs (generally 256 at a time, but we're seeing poor performance with this number and since we have it mostly
 automated we're starting to increment by 64 to mitigate the cluster impact).

What percentage of your data is in each of your pools?  Based on your amount of PGs you should have 2/3 of your data is in Volumes and 1/3 in Images.  If that is correct and will continue to be true, then you want to keep that ratio similar.  Let me know if
 you have any questions on this

This is where I propose where your slow and blocked requests come from.  Every time we have persistent, but seemingly random, slow/blocked requests it is always PG sub-folders splitting.  The threshold for this is calculated off of a constant and 2 settings
 that you can set in your config (filestore merge threshold, filestore split multiple).  "filestore split multiple" is not a direct value used by the cluster, it is a variable used to find the value used by the cluster (the equation is shown later).  "filestore
 merge threshold" is how many objects in subfolders before it will merge them back together into 1 directory; this is a sum of all objects in subfolders.  If you set this to negative, then you will never merge subfolders, but the value is still used in the
 equation with "filestore split multiple" (notice the abs in the equation ignoring if this value is negative).  The equation for how many objects you can have in a folder before it splits into sub-folders is

 = 16 * { filestore split multiple } * abs( { filestore merge threshold } )

These settings cannot be injected, you must change your cluster config and restart your osds to change the settings.  The way you can tell if this is happening on your cluster is to check what your values are and plug them into the equation and then check a
 pg in your cluster with a command similar to this to see if you are in the middle of splitting sub-folders or recently split sub-folders.

cd /var/lib/ceph/osd/ceph-$osd/current/

for folder in *_head; do echo $folder; ls -1R $folder | cut -d. -f1 | uniq -c | grep -Ev '^\s+1 '; done

That assumes you go into a valid osd current folder and will give you a count of all objects inside of your sub-folders for each PG.  If you are in the middle of splitting sub-folders, then you will see that the smallest numbers are about 1/16 of the largest
 numbers.  That would be because they are dividing all of the objects into 16 subfolders.

When our clusters have this happen, we don't only see slow/blocked requests but we also see osds being marked down for a bit.  We have to combat this by injecting '--osd_heartbeat_grace=180' to a high enough value to allow the osd to finish splitting it's sub-folders
 before it continues to respond to requests.  This value is how long an osd will wait for a response from another osd before telling the mons that it's not responding.  We used to use 180 (3 minutes), but that is no longer high enough and we're now using 240
 when we see that a cluster is splitting sub-folders and errantly marking osds down.

David Turner |
Cloud Operations Engineer |
StorageCraft
 Technology Corporation

380 Data Drive Suite 300 |
Draper |
Utah |
84020

Office:
801.871.2760 |
Mobile:
385.224.2943

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this
 message is prohibited.

From: ceph-users [ceph-users-bounces@xxxxxxxxxxxxxx] on behalf of Emilio Moreno Fernandez [emilio.moreno@xxxxxxx]

Sent: Tuesday, October 11, 2016 5:29 AM

To: 'ceph-users@xxxxxxxxxxxxxx'

Subject: [ceph-users] Modify placement group pg and pgp in production environment

Hi,

We have production platform of Ceph in our farm of Openstack. This platform have following specs:

1 Admin Node
3 Monitors
7 Ceph Nodes with 160 OSD of SAS HDD 1.2TB 10K. Maybe 30 OSD have Journal SSD...we are in update progress...;-)

All network have 10GB Ethernet link and we have some problems now of slow and block request...no problem we are diagnosing the platform.

One of our problems are the placement groups, this number has not been changed by mistake for a long time...our pools:

GLOBAL:
    SIZE     AVAIL      RAW USED     %RAW USED     OBJECTS

    173T     56319G         118T         68.36      10945k

POOLS:
    NAME        ID     CATEGORY     USED       %USED     MAX AVAIL     OBJECTS     DIRTY     READ       WRITE 

    rbd         0      -                 0         0        14992G           0         0          1      66120

    volumes     6      -            42281G     23.75        14992G     8871636     8663k     47690M     55474M

    images      7      -            18151G     10.20        14992G     2324108     2269k      1456M      1622k

    backups     8      -                 0         0        14992G           1         1      18578       104k

    vms         9      -            91575M      0.05        14992G       12827     12827      2526k      6863k

And our PG on pools are (only of used pools):

Volumes              2048
Images                 1024

We think that our performance problem, after verify network, servers, hardware, disk, software, bugs, logs, etc...is the number of PG volumes...

Our Question:

How we can update de pg number and after pgp number in production environment without interrupting service, poor performance or down the virtual instances...???
The last update was made from 512 to 1024 in the pool of pictures and had a drop service 2 hours because the platform did not support data traffic....we are scaried :-(

We can do this change with little increments in two weeks? How?

Thanks Thanks Thanks

_________________________________________________________________________

Emilio Moreno Fernández
@ 
emilio.moreno@xxxxxxx

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com