Re: How to recover from OSDs full in small cluster

Lukáš Kubín <lukas.kubin@xxxxxxxxx> · Thu, 18 Feb 2016 21:39:39 +0000

Hi,we've managed to release some space from our cluster. Now I would like to restart those 2 full OSDs. As they're completely full I probably need to delete some data from them.

I would like to ask: Is it OK to delete all pg directories (eg. all subdirectories in /var/lib/ceph/osd/ceph-5/current/) and start the stopped OSD daemon then? This process seems most simple I'm just not sure if it is correct - if ceph can handle such situation. (I've noticed similar advice here: http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ )

Another option as suggested by Jan is to remove OSD from cluster, and recreate them back. That presents more steps though and perhaps some more safety prerequirements (nobackfill?) to prevent more block movements/disks full while removing/readding.

Thanks!

Lukas

Current status:

[root@ceph1 ~]# ceph osd stat
     osdmap e1107: 12 osds: 10 up, 10 in; 29 remapped pgs
[root@ceph1 ~]# ceph pg stat
v21691144: 640 pgs: 503 active+clean, 29 active+remapped, 108 active+undersized+degraded; 1892 GB data, 3476 GB used, 1780 GB / 5256 GB avail; 0 B/s rd, 323 kB/s wr, 49 op/s; 42998/504482 objects degraded (8.523%); 10304/504482 objects misplaced (2.042%)
[root@ceph1 ~]# df -h|grep osd
/dev/sdg1                554G  383G  172G  70% /var/lib/ceph/osd/ceph-3
/dev/sdf1                554G  401G  154G  73% /var/lib/ceph/osd/ceph-2
/dev/sde1                554G  381G  174G  69% /var/lib/ceph/osd/ceph-0
/dev/sdb1                275G  275G   20K 100% /var/lib/ceph/osd/ceph-5
/dev/sdd1                554G  554G   20K 100% /var/lib/ceph/osd/ceph-4
/dev/sdc1                554G  359G  196G  65% /var/lib/ceph/osd/ceph-1
[root@ceph1 ~]# ceph osd tree
ID WEIGHT  TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 5.93991 root default
-2 2.96996     host ceph1
 0 0.53999         osd.0       up  1.00000          1.00000
 1 0.53999         osd.1       up  1.00000          1.00000
 2 0.53999         osd.2       up  1.00000          1.00000
 3 0.53999         osd.3       up  1.00000          1.00000
 4 0.53999         osd.4     down        0          1.00000
 5 0.26999         osd.5     down        0          1.00000
-3 2.96996     host ceph2
 6 0.53999         osd.6       up  1.00000          1.00000
 7 0.53999         osd.7       up  1.00000          1.00000
 8 0.53999         osd.8       up  1.00000          1.00000
 9 0.53999         osd.9       up  1.00000          1.00000
10 0.53999         osd.10      up  1.00000          1.00000
11 0.26999         osd.11      up  1.00000          1.00000

On Wed, Feb 17, 2016 at 9:43 PM Lukáš Kubín <lukas.kubin@xxxxxxxxx> wrote:
Hi,I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 pools, each of size=2. Today, one of our OSDs got full, another 2 near full. Cluster turned into ERR state. I have noticed uneven space distribution among OSD drives between 70 and 100 perce. I have realized there's a low amount of pgs in those 2 pools (128 each) and increased one of them to 512, expecting a magic to happen and redistribute the space evenly. 

Well, something happened - another OSD became full during the redistribution and cluster stopped both OSDs and marked them down. After some hours the remaining drives partially rebalanced and cluster get to WARN state. 

I've deleted 3 placement group directories from one of the full OSD's filesystem which allowed me to start it up again. Soon, however this drive became full again.

So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no drives to add. 

Is there a way how to get out of this situation without adding OSDs? I will attempt to release some space, just waiting for colleague to identify RBD volumes (openstack images and volumes) which can be deleted.

Thank you.

Lukas

This is my cluster state now:

[root@compute1 ~]# ceph -w
    cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
     health HEALTH_WARN
            10 pgs backfill_toofull
            114 pgs degraded
            114 pgs stuck degraded
            147 pgs stuck unclean
            114 pgs stuck undersized
            114 pgs undersized
            1 requests are blocked > 32 sec
            recovery 56923/640724 objects degraded (8.884%)
            recovery 29122/640724 objects misplaced (4.545%)
            3 near full osd(s)
     monmap e3: 3 mons at {compute1=10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0}
            election epoch 128, quorum 0,1,2 compute1,compute2,compute3
     osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
      pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
            4365 GB used, 890 GB / 5256 GB avail
            56923/640724 objects degraded (8.884%)
            29122/640724 objects misplaced (4.545%)
                 493 active+clean
                 108 active+undersized+degraded
                  29 active+remapped
                   6 active+undersized+degraded+remapped+backfill_toofull
                   4 active+remapped+backfill_toofull

[root@ceph1 ~]# df|grep osd
/dev/sdg1               580496384 500066812  80429572  87% /var/lib/ceph/osd/ceph-3
/dev/sdf1               580496384 502131428  78364956  87% /var/lib/ceph/osd/ceph-2
/dev/sde1               580496384 506927100  73569284  88% /var/lib/ceph/osd/ceph-0
/dev/sdb1               287550208 287550188        20 100% /var/lib/ceph/osd/ceph-5
/dev/sdd1               580496384 580496364        20 100% /var/lib/ceph/osd/ceph-4
/dev/sdc1               580496384 478675672 101820712  83% /var/lib/ceph/osd/ceph-1

[root@ceph2 ~]# df|grep osd
/dev/sdf1               580496384 448689872 131806512  78% /var/lib/ceph/osd/ceph-7
/dev/sdb1               287550208 227054336  60495872  79% /var/lib/ceph/osd/ceph-11
/dev/sdd1               580496384 464175196 116321188  80% /var/lib/ceph/osd/ceph-10
/dev/sdc1               580496384 489451300  91045084  85% /var/lib/ceph/osd/ceph-6
/dev/sdg1               580496384 470559020 109937364  82% /var/lib/ceph/osd/ceph-9
/dev/sde1               580496384 490289388  90206996  85% /var/lib/ceph/osd/ceph-8

[root@ceph2 ~]# ceph df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    5256G      890G        4365G         83.06
POOLS:
    NAME       ID     USED      %USED     MAX AVAIL     OBJECTS
    glance     6      1714G     32.61          385G      219579
    cinder     7       676G     12.86          385G       97488

[root@ceph2 ~]# ceph osd pool get glance pg_num
pg_num: 512
[root@ceph2 ~]# ceph osd pool get cinder pg_num
pg_num: 128

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com