Re: Cluster unable to finish balancing

Berant Lemmenes <berant@xxxxxxxxxxxx> · Tue, 7 May 2013 21:10:48 -0400

So just a little update... after replacing the original failed drive things seem to be progressing a little better however I noticed something else odd. Looking at a 'rados df' it looks like the system thinks that the data pool has 32 TB of data, this is only a 18TB raw system.

pool name       category                 KB      objects       clones     degraded      unfound           rd        rd KB           wr        wr KB
data            -                32811540110       894927            0       240445           0            1            0      2720415   4223435021
media_video     -                          1            1            0            0           0            2            1      2611361   1177389479
metadata        -                     210246        18482            0         4592           1         6970       561296      1253955     19500149
rbd             -                  330731965        82018            0        19584           0        26295      1612689     54606042   2127030019
  total used     10915771968       995428
  total avail     6657285104
  total space    17573057072

Any recommendations on how I can sort out why it thinks it has way more data in that pool than it actually does?

Thanks in advance.
Berant

On Mon, May 6, 2013 at 4:43 PM, Berant Lemmenes <berant@xxxxxxxxxxxx> wrote:

TL;DR

bobtail Ceph cluster unable to finish rebalance after drive failure, usage increasing even with no clients connected.....

I've been running a test bobtail cluster for a couple of months and it's been working great. Last week I had a drive die and rebalance; durring that time another OSD crashed. All was still well, however as the second osd had just crashed I restarted made sure that it re-entered properly and rebalancing continued and then I went to bed.

Waking up in the morning I found 2 OSDs were 100% full and two more were almost full. To get out of the situation I decreased the replication size from 3 to 2, and then also carefully (I believe carefully enough) remove some PGs in order to start things up again.

I got things going again and things appeared to be rebalancing correctly; however it got to the point were it stopped at 1420 PGs active+clean and the rest were stuck backfilling.

Looking at the PG dump, all of the PGs that were having issues were on osd.1. So I stopped it, verified things were continuing to rebalance after it was down/out and then formated osd.1's disk and put it back in.

Since then I've not been able to get the cluster back to HEALTHY, due to a combination of OSDs dying while recovering (not due to disk failure, just crashes) as well as the used space in the cluster increasing abnormally.

Right now I have all the clients disconnected and just the cluster rebalancing and the usage is increasing to the point where I have 12TB used when I have only < 3TB in cephfs and 2TB in a single RBD image (replication size 2). I've since shutdown the cluster so I don't fill it up.

My crushmap is the default, here is the usual suspects. I'm happy to provide additional information.

pg dump: http://pastebin.com/LUyu6Z09

ceph osd tree:
osd.8 is the failed drive (I will be replacing tonight), weight on osd.1 and osd.6 was done via reweight-by-utilization

# id	weight	type name	up/down	reweight

-1	19.5	root default
-3	19.5		rack unknownrack

-2	19.5			host ceph-test
0	1.5				osd.0	up	1	

1	1.5				osd.1	up	0.6027	

2	1.5				osd.2	up	1	

3	1.5				osd.3	up	1	

4	1.5				osd.4	up	1	

5	2				osd.5	up	1	

6	2				osd.6	up	0.6676	

7	2				osd.7	up	1	

8	2				osd.8	down	0	

9	2				osd.9	up	1	

10	2				osd.10	up	1

ceph -s:

   health HEALTH_WARN 24 pgs backfill; 85 pgs backfill_toofull; 29 pgs backfilling; 40 pgs degraded; 1 pgs recovery_wait; 121 pgs stuck unclean; recovery 109306/2091318 degraded (5.227%);  recovering 3 o/s, 43344KB/s; 2 near full osd(s); noout flag(s) set

   monmap e2: 1 mons at {a=10.200.200.21:6789/0}, election epoch 1, quorum 0 a
   osdmap e16251: 11 osds: 10 up, 10 in
    pgmap v3145187: 1536 pgs: 1414 active+clean, 6 active+remapped+wait_backfill, 10 active+remapped+wait_backfill+backfill_toofull, 4 active+degraded+wait_backfill+backfill_toofull, 22 active+remapped+backfilling, 42 active+remapped+backfill_toofull, 7 active+degraded+backfilling, 17 active+degraded+backfill_toofull, 1 active+recovery_wait+remapped, 4 active+degraded+remapped+wait_backfill+backfill_toofull, 8 active+degraded+remapped+backfill_toofull, 1 active+clean+scrubbing+deep; 31607 GB data, 12251 GB used, 4042 GB / 16293 GB avail; 109306/2091318 degraded (5.227%);  recovering 3 o/s, 43344KB/s

   mdsmap e3363: 1/1/1 up {0=a=up:active}

rep size:
pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 384 pgp_num 384 last_change 897 owner 0 crash_replay_interval 45

pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 384 pgp_num 384 last_change 13364 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 384 pgp_num 384 last_change 13208 owner 0

pool 4 'media_video' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 384 pgp_num 384 last_change 890 owner 0

ceph.conf:
[global]
	auth cluster required = cephx

	auth service required = cephx
	auth client required = cephx

	osd pool default size = 3

	osd pool default min size = 1

	osd pool default pg num = 366

	osd pool default pgp num = 366

[osd]
	osd journal size = 1000
	journal_aio = true

	#osd recovery max active = 10

	osd mkfs type = xfs
	osd mkfs options xfs = -f -i size=2048

	osd mount options xfs = inode64,noatime

[mon.a]

	host = ceph01

	mon addr = 10.200.200.21:6789

[osd.0]
	# 1.5 TB SATA

	host = ceph01
	devs = /dev/sdc
	weight = 1.5

[osd.1]
	# 1.5 TB SATA
	host = ceph01
	devs = /dev/sdd

	weight = 1.5

[osd.2]
	# 1.5 TB SATA
	host = ceph01

	devs = /dev/sdg
	weight = 1.5

[osd.3]
	# 1.5 TB SATA

	host = ceph01
	devs = /dev/sdj
	weight = 1.5

[osd.4]
	# 1.5 TB SATA
	host = ceph01
	devs = /dev/sdk

	weight = 1.5

[osd.5]
	# 2 TB SAS
	host = ceph01

	devs = /dev/sdf
	weight = 2

[osd.6]
	# 2 TB SAS

	host = ceph01
	devs = /dev/sdh
	weight = 2

[osd.7]
	# 2 TB SAS
	host = ceph01
	devs = /dev/sda

	weight = 2

[osd.8]
	# 2 TB SAS
	host = ceph01

	devs = /dev/sdb
	weight = 2

[osd.9]
	# 2 TB SAS

	host = ceph01
	devs = /dev/sdi
	weight = 2

[osd.10]
	# 2 TB SAS
	host = ceph01
	devs = /dev/sde

	weight = 2

[mds.a]
	host = ceph01

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com