Re: CephFS very unstable with many small files

Linh Vu <vul@xxxxxxxxxxxxxx> · Mon, 26 Feb 2018 07:15:15 +0000

Sounds like you just need more RAM on your MDS. Ours have 256GB each, and the OSD nodes have 128GB each. Networking is 2x25Gbe.  

We are on luminous 12.2.1, bluestore, and use CephFS for HPC, with about 500-ish compute nodes. We have done stress testing with small files up to 2M per directory as part of our acceptance testing, and encountered no
 problem.

From: ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx>

Sent: Monday, 26 February 2018 3:45:59 AM

To: ceph-users@xxxxxxxxxxxxxx

Subject:  CephFS very unstable with many small files

Dear Cephalopodians,

in preparation for production, we have run very successful tests with large sequential data,

and just now a stress-test creating many small files on CephFS. 

We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 hosts with 32 OSDs each, running in EC k=4 m=2.

Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 12.2.3.

There are (at the moment) only two MDS's, one is active, the other standby. 

For the test, we had 1120 client processes on 40 client machines (all cephfs-fuse!) extract a tarball with 150k small files

( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a separate subdirectory.

Things started out rather well (but expectedly slow), we had to increase

mds_log_max_segments => 240

mds_log_max_expiring => 160

due to https://github.com/ceph/ceph/pull/18624

and adjusted mds_cache_memory_limit to 4 GB. 

Even though the MDS machine has 32 GB, it is also running 2 OSDs (for metadata) and so we have been careful with the cache

(e.g. due to http://tracker.ceph.com/issues/22599 ).

After a while, we tested MDS failover and realized we entered a flip-flop situation between the two MDS nodes we have.

Increasing mds_beacon_grace to 240 helped with that. 

Now, with about 100,000,000 objects written, we are in a disaster situation. 

First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap.

So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes:

2018-02-25 04:16:02.299107 7fe20ce1f700  1 mds.0.17657 rejoin_start

2018-02-25 04:19:00.618514 7fe20ce1f700  1 mds.0.17657 rejoin_joint_start

and finally, 5 minutes later, OOM. 

I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine.

So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles.

Is there anything that can be configured to prevent this from happening? 

Now, I only lost some "stress test data", but later, it might be user's data... 

In parallel, I had reinstalled one OSD host. 

It was backfilling well, but now, <24 hours later, before backfill has finished, several OSD hosts enter OOM condition.

Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the default bluestore cache size of 1 GB. However, it seems the processes are using much more,

up to several GBs until memory is exhausted. They then become sluggish, are kicked out of the cluster, come back, and finally at some point they are OOMed.

Now, I have restarted some OSD processes and hosts which helped to reduce the memory usage - but now I have some OSDs crashing continously,

leading to PG unavailability, and preventing recovery from completion. 

I have reported a ticket about that, with stacktrace and log:

http://tracker.ceph.com/issues/23120

This might well be a consequence of a previous OOM killer condition. 

However, my final question after these ugly experiences is: 

Did somebody ever stresstest CephFS for many small files? 

Are those issues known? Can special configuration help? 

Are the memory issues known? Are there solutions? 

We don't plan to use Ceph for many small files, but we don't have full control of our users, which is why we wanted to test this "worst case" scenario.

It would be really bad if we lost a production filesystem due to such a situation, so the plan was to test now to know what happens before we enter production.

As of now, this looks really bad, and I'm not sure the cluster will ever recover.

I'll give it some more time, but we'll likely kill off all remaining clients next week and see what happens, and worst case recreate the Ceph cluster.

Cheers,

        Oliver

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com