Excessive OSD memory use on adding new OSD's, cluster will not start.

Mark Dignam <mark.dignam@xxxxxxxxxxxx> · Tue, 5 Jan 2016 16:11:34 +0000

Hello all, medium term user of CEPH, avid reader of this list for the hints and tricks, but first time poster…

I have a working cluster, been operating for around 18 months in various versions and decided we knew enough how it worked to depend on it for long term storage.

Basic storage nodes are not high end but quite capable Supermicro based, Dual AMD board, 32gig of ram, 24 OSD’s attached to each, 9 nodes, a variety of 2tb and 3tb drives, single OSD per drive, 216 OSD’s in total. All running Ubuntu 14.04
 LTS with ceph version 0.80.10, which is the default for ubuntu's "trusty tahr". All nodes boot from a 120 gig SSD, 80 gig OS, 40 gig swap, all journals stored on the spinning storage, and everything ceph-ish is installed via ceph-deploy.

I have a different CRUSH ruleset for the production OSD's, so that new (or flapping) OSD's don't automatically get added to the pool.

Normal running memory use was 24gig ram (out of 32gig) in userland, so it wasn't quick, but it worked well enough to for our storage needs, with a plan for a ram upgrade when needed.

6 of the OSD’s got “near full” over the xmas break – so the plan was to add two nodes on the return to work on the 4th Jan.

That turned out… badly.

On adding the new nodes, "ceph -w" reported the normal backfilling when I moved the OSD's into the correct CRUSH map place.... and then three of the nodes started gobbling ram like mad, hitting swap, eventually dropping all their OSD's,
 which then cascaded to the rest of the nodes..... with the end result of all the OSD's being marked as out, and moving back into the default CRUSH map location.

Now, 24 hrs later, despite my best efforts, reading all the hints and tricks to reduce memory usage, as soon as I move more than three OSD's into the production HDD ruleset, RAM is just gobbled with the machine(s) dying with load'avs in
 the thousands, producing funky messages to syslog and generally being unhappy. 

We upgraded 7 of the nodes to 48gig ram (what we had lying around), and reduced the last nodes to 12 OSD's each, moving the remaining 24 OSD's to the two new nodes. Still no joy.

What I ended with in the ceph.conf...

[global]
fsid = xxxxxxxxxxxxxxxxxxxxxxxx
mon_initial_members = ceph-iscsi01, ceph-iscsi02
mon_host = 10.201.4.198,10.201.4.199
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public network = 10.201.4.0/24
cluster network = 10.201.248.0/24
osd map message max = 5
[osd]
osd target transaction size = 50
osd recovery max active = 1
osd max backfills = 1
osd op threads = 1
osd disk threads = 1
osd map cache size = 4
osd map cache bl size = 1
osd map max advance = 1
osd map share max epochs = 1
osd pg epoch persisted max stale = 1
osd backfill scan min = 2
osd backfill scan max = 4
osd_min_pg_log_entries = 100
osd_max_pg_log_entries = 500

Its not a PID/thread issue, threads generally sit around 28k-29k or so, limits.conf has 64k open file limit set for both root and ceph-admin.

I can see all data on the OSD's, I'm prepared to do a recovery (it had five RBD images on there, all XFS, looks like just glueing the image back together hopefully) which could take a while, thats ok.

Whats concerning me is ... what have I done wrong, and once I rebuild the cluster after doing the recovery, will it happen again when some of the OSD's get over 85% used?

thanks in advance for the help...

Mark.

Mark Dignam
Technical Director

t: +61 8 6141 1011
f: +61 8 9499 4083
e: 
mark.dignam@xxxxxxxxxxxx
w: 
www.dctwo.com.au

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com