Re: OSD is near full and slow in accessing storage from client

David Turner <drakonstein@xxxxxxxxx> · Tue, 21 Nov 2017 16:13:13 +0000

Your rbd pool can be removed (unless you're planning to use it) which will delete those PGs from your cluster/OSDs.  Also all of your backfilling finished and has settled.  Now you just need to work on balancing the weights for the OSDs in your cluster.
There are multiple ways to balance the usage of the clusters.  Changing the crush weight of the OSD, changing the reweight of the OSD, doing that by using `ceph osd reweight-by-utilization`, doing that by using Cern's modified version of that which can weight things up as well as down, etc.  I use a method that changes the crush weight of the OSD, but does so by downloading the crush map and using the crushtool to generate a balanced map and do it in one go.  A very popular method on the list is to create a cron that does very small modifications in the background and keeps things balanced by utilization.

You should be able to find a lot of references in the ML or in blog posts about doing these various options.  The take away is that the CRUSH algorithm is putting too much data on osd.4 and not enough data on osd.2 (those are the extremes, but there are others not quite as extreme) and you need to modify the weight and/or reweight of the osd to help the algorithm balance that out.

On Tue, Nov 21, 2017 at 12:11 AM gjprabu <gjprabu@xxxxxxxxxxxx> wrote:
Hi David,

           This is our current status.

~]# ceph status
    cluster b466e09c-f7ae-4e89-99a7-99d30eba0a13
     health HEALTH_WARN
            mds0: Client integ-hm3 failing to respond to cache pressure
            mds0: Client integ-hm9-bkp failing to respond to cache pressure
            mds0: Client me-build1-bkp failing to respond to cache pressure
     monmap e2: 3 mons at {intcfs-mon1=192.168.113.113:6789/0,intcfs-mon2=192.168.113.114:6789/0,intcfs-mon3=192.168.113.72:6789/0}
            election epoch 16, quorum 0,1,2 intcfs-mon3,intcfs-mon1,intcfs-mon2
      fsmap e177798: 1/1/1 up {0=intcfs-osd1=up:active}, 1 up:standby
     osdmap e4388: 8 osds: 8 up, 8 in
            flags sortbitwise
      pgmap v24129785: 564 pgs, 3 pools, 6885 GB data, 17138 kobjects
            14023 GB used, 12734 GB / 26757 GB avail
                 560 active+clean
                   3 active+clean+scrubbing
                   1 active+clean+scrubbing+deep
  client io 47187 kB/s rd, 965 kB/s wr, 125 op/s rd, 525 op/s wr

]# ceph df
GLOBAL:
    SIZE       AVAIL      RAW USED     %RAW USED
    26757G     12735G       14022G         52.41
POOLS:
    NAME                   ID     USED       %USED     MAX AVAIL     OBJECTS 
    rbd                    0           0         0         3787G            0
    downloads_data         3       6885G     51.46         3787G     16047944
    downloads_metadata     4      84773k         0         3787G      1501805

Regards
Prabu GJ

---- On Mon, 20 Nov 2017 21:35:17 +0530 David Turner <drakonstein@xxxxxxxxx> wrote ----

What is your current `ceph status` and `ceph df`? The status of your cluster has likely changed a bit in the last week.

On Mon, Nov 20, 2017 at 6:00 AM gjprabu <gjprabu@xxxxxxxxxxxx> wrote:

Hi David,

            Sorry for the late reply and its completed OSD Sync and more ever still fourth OSD available size is keep reducing. Is there any option to check or fix .

ID WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS 

0 3.29749  1.00000  3376G  2320G  1056G 68.71 1.10 144
1 3.26869  1.00000  3347G  1871G  1475G 55.92 0.89 134
2 3.27339  1.00000  3351G  1699G  1652G 50.69 0.81 134
3 3.24089  1.00000  3318G  1865G  1452G 56.22 0.90 142
4 3.24089  1.00000  3318G  2839G   478G 85.57 1.37 158
5 3.32669  1.00000  3406G  2249G  1156G 66.04 1.06 136
6 3.27800  1.00000  3356G  1924G  1432G 57.33 0.92 139
7 3.20470  1.00000  3281G  1949G  1331G 59.42 0.95 141
              TOTAL 26757G 16720G 10037G 62.49         
MIN/MAX VAR: 0.81/1.37  STDDEV: 10.26

Regards
Prabu GJ

---- On Mon, 13 Nov 2017 00:27:47 +0530 David Turner <drakonstein@xxxxxxxxx> wrote ----

You cannot reduce the PG count for a pool.  So there isn't anything you can really do for this unless you create a new FS with better PG counts and migrate your data into it.
The problem with having more PGs than you need is in the memory footprint for the osd daemon. There are warning thresholds for having too many PGs per osd.  Also in future expansions, if you need to add pools, you might not be able to create the pools with the proper amount of PGs due to older pools that have way too many PGs.
It would still be nice to see the output from those commands I asked about.
The built-in reweighting scripts might help your data distribution.  reweight-by-utilization

On Sun, Nov 12, 2017, 11:41 AM gjprabu <gjprabu@xxxxxxxxxxxx> wrote:

Hi David,

Thanks for your valuable reply , once complete the backfilling for new osd and will consider by increasing replica value asap. Is it possible to decrease the metadata pg count ?  if the pg count for metadata for value same as data count what kind of issue may occur ? 

Regards
PrabuGJ

---- On Sun, 12 Nov 2017 21:25:05 +0530  David Turner<drakonstein@xxxxxxxxx> wrote ----

What's the output of `ceph df` to see if your PG counts are good or not?  Like everyone else has said, the space on the original osds can't be expected to free up until the backfill from adding the new osd has finished.
You don't have anything in your cluster health to indicate that your cluster will not be able to finish this backfilling operation on its own.
You might find this URL helpful in calculating your PG counts. http://ceph.com/pgcalc/  As a side note. It is generally better to keep your PG counts as base 2 numbers (16, 64, 256, etc). When you do not have a base 2 number then some of your PGs will take up twice as much space as others. In your case with 250, you have 244 PGs that are the same size and 6 PGs that are twice the size of those 244 PGs.  Bumping that up to 256 will even things out.
Assuming that the metadata pool is for a CephFS volume, you do not need nearly so many PGs for that pool. Also, I would recommend changing at least the metadata pool to 3 replica_size. If we can talk you into 3 replica for everything else, great! But if not, at least do the metadata pool. If you lose an object in the data pool, you just lose that file. If you lose an object in the metadata pool, you might lose access to the entire CephFS volume.

On Sun, Nov 12, 2017, 9:39 AM gjprabu <gjprabu@xxxxxxxxxxxx> wrote:

Hi Cassiano,

       Thanks for your valuable feedback and will wait for some time till new osd sync get complete. Also for by increasing pg count it is the issue will solve? our setup pool size for data and metadata pg number is 250. Is this correct for 7 OSD with 2 replica. Also currently stored data size is 17TB.

ceph osd df
ID WEIGHT  REWEIGHT SIZE   USE    AVAIL %USE  VAR  PGS
0 3.29749  1.00000  3376G  2814G  562G 83.35 1.23 165
1 3.26869  1.00000  3347G  1923G 1423G 57.48 0.85 152
2 3.27339  1.00000  3351G  1980G 1371G 59.10 0.88 161
3 3.24089  1.00000  3318G  2131G 1187G 64.23 0.95 168
4 3.24089  1.00000  3318G  2998G  319G 90.36 1.34 176
5 3.32669  1.00000  3406G  2476G  930G 72.68 1.08 165
6 3.27800  1.00000  3356G  1518G 1838G 45.24 0.67 166
              TOTAL 23476G 15843G 7632G 67.49         
MIN/MAX VAR: 0.67/1.34  STDDEV: 14.53

ceph osd tree
ID WEIGHT   TYPE NAME            UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 22.92604 root default                                          
-2  3.29749     host intcfs-osd1                                  
0  3.29749         osd.0             up  1.00000          1.00000
-3  3.26869     host intcfs-osd2                                  
1  3.26869         osd.1             up  1.00000          1.00000
-4  3.27339     host intcfs-osd3                                  
2  3.27339         osd.2             up  1.00000          1.00000
-5  3.24089     host intcfs-osd4                                  
3  3.24089         osd.3             up  1.00000          1.00000
-6  3.24089     host intcfs-osd5                                  
4  3.24089         osd.4             up  1.00000          1.00000
-7  3.32669     host intcfs-osd6                                  
5  3.32669         osd.5             up  1.00000          1.00000
-8  3.27800     host intcfs-osd7                                  
6  3.27800         osd.6             up  1.00000          1.00000

ceph osd pool ls detail

pool 0 'rbd' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 3 'downloads_data' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 250 pgp_num 250 last_change 39 flags hashpspool crash_replay_interval 45 stripe_width 0
pool 4 'downloads_metadata' replicated size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 250 pgp_num 250 last_change 36 flags hashpspool stripe_width 0

Regards
Prabu GJ

---- On Sun, 12 Nov 2017 19:20:34 +0530 Cassiano Pilipavicius <cassiano@xxxxxxxxxxx> wrote ----

I am also not an expert, but it looks like you have big data       volumes on few PGs, from what I've seen, the pg data is only       deleted from the old OSD when is completed copied to the new osd.
So, if 1 pg have 100G por example, only when it is fully copied       to the new OSD, the space will be released on the old OSD.
If you have a busy cluster/network, it may take a good while.       Maybe just wait a litle and check from time to time and the space       will eventually be released.

Em 11/12/2017 11:44 AM, Sébastien       VIGNERON escreveu:

_______________________________________________
ceph-users mailing list 
ceph-users@xxxxxxxxxxxxxx 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
I’m not an expert either so if someone in the list have some ideas       on this problem, don’t be shy, share them with us. 

For now, I only have hypothese that the OSD space         will be recovered as soon as the recovery process is complete. 
Hope everything will get back in order soon (before         reaching 95% or above).

I saw some messages on the list about the fstrim         tool which can help reclaim unused free space, but i don’t know         if it’s apply to your case.

Cordialement / Best regards,

Sébastien VIGNERON 
CRIANN, 
Ingénieur / Engineer
Technopôle du Madrillet 
745, avenue de l'Université 
76800 Saint-Etienne du Rouvray - France 
tél. +33 2 32 91 42 91 
fax. +33 2 32 91 42 92 
http://www.criann.fr 
mailto:sebastien.vigneron@xxxxxxxxx
support: support@xxxxxxxxx

Le 12 nov. 2017 à 13:29, gjprabu <gjprabu@xxxxxxxxxxxx> a             écrit :

Hi Sebastien,

    Below is the query details. I am not                   that much expert and still learning . pg's are not                   stuck stat before adding osd and pg are slowly                   clearing stat to active-clean. Today morning there was                   around                   53 active+undersized+degraded+remapped+wait_backfill                   and now it is 21 only, hope its going on and i am                   seeing the space keep increasing in newly added OSD                   (osd.6) 

ID WEIGHT  REWEIGHT SIZE   USE    AVAIL                   %USE  VAR  PGS 
0 3.29749  1.00000  3376G                      2814G  562G 83.35 1.23 165  ( Available Spaces not                     reduced after adding new OSD)
1 3.26869  1.00000  3347G  1923G 1423G                   57.48 0.85 152
2 3.27339  1.00000  3351G  1980G 1371G                   59.10 0.88 161
3 3.24089  1.00000  3318G  2131G 1187G                   64.23 0.95 168
4 3.24089  1.00000  3318G                      2998G  319G 90.36 1.34 176  ( Available Spaces not reduced after                       adding new OSD)
5 3.32669  1.00000  3406G                      2476G  930G 72.68 1.08 165  ( Available Spaces not reduced                         after adding new OSD)
6 3.27800  1.00000  3356G  1518G 1838G                   45.24 0.67 166
              TOTAL 23476G 15843G 7632G                   67.49         
MIN/MAX VAR: 0.67/1.34  STDDEV: 14.53
...

_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com