I second Stijn's question for more details, also on the stress testing. Did you "only" have each node write 2M of files per directory, or each "job", i.e. nodes*(number of cores per node) processes? Do you have monitoring of the memory usage? Is the large amount of RAM actually used on the MDS? Did you increase the mds_cache_memory_limit setting? Am 26.02.2018 um 08:15 schrieb Linh Vu: > Sounds like you just need more RAM on your MDS. Ours have 256GB each, and the OSD nodes have 128GB each. Networking is 2x25Gbe. > > > We are on luminous 12.2.1, bluestore, and use CephFS for HPC, with about 500-ish compute nodes. We have done stress testing with small files up to 2M per directory as part of our acceptance testing, and encountered no problem. > > ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ > *From:* ceph-users <ceph-users-bounces@xxxxxxxxxxxxxx> on behalf of Oliver Freyermuth <freyermuth@xxxxxxxxxxxxxxxxxx> > *Sent:* Monday, 26 February 2018 3:45:59 AM > *To:* ceph-users@xxxxxxxxxxxxxx > *Subject:* CephFS very unstable with many small files > > Dear Cephalopodians, > > in preparation for production, we have run very successful tests with large sequential data, > and just now a stress-test creating many small files on CephFS. > > We use a replicated metadata pool (4 SSDs, 4 replicas) and a data pool with 6 hosts with 32 OSDs each, running in EC k=4 m=2. > Compression is activated (aggressive, snappy). All Bluestore, LVM, Luminous 12.2.3. > There are (at the moment) only two MDS's, one is active, the other standby. > > For the test, we had 1120 client processes on 40 client machines (all cephfs-fuse!) extract a tarball with 150k small files > ( http://distfiles.gentoo.org/snapshots/portage-latest.tar.xz ) each into a separate subdirectory. > > Things started out rather well (but expectedly slow), we had to increase > mds_log_max_segments => 240 > mds_log_max_expiring => 160 > due to https://github.com/ceph/ceph/pull/18624 > and adjusted mds_cache_memory_limit to 4 GB. > > Even though the MDS machine has 32 GB, it is also running 2 OSDs (for metadata) and so we have been careful with the cache > (e.g. due to http://tracker.ceph.com/issues/22599 ). > > After a while, we tested MDS failover and realized we entered a flip-flop situation between the two MDS nodes we have. > Increasing mds_beacon_grace to 240 helped with that. > > Now, with about 100,000,000 objects written, we are in a disaster situation. > First off, the MDS could not restart anymore - it required >40 GB of memory, which (together with the 2 OSDs on the MDS host) exceeded RAM and swap. > So it tried to recover and OOMed quickly after. Replay was reasonably fast, but join took many minutes: > 2018-02-25 04:16:02.299107 7fe20ce1f700 1 mds.0.17657 rejoin_start > 2018-02-25 04:19:00.618514 7fe20ce1f700 1 mds.0.17657 rejoin_joint_start > and finally, 5 minutes later, OOM. > > I stopped half of the stress-test tar's, which did not help - then I rebooted half of the clients, which did help and let the MDS recover just fine. > So it seems the client caps have been too many for the MDS to handle. I'm unsure why "tar" would cause so many open file handles. > Is there anything that can be configured to prevent this from happening? > Now, I only lost some "stress test data", but later, it might be user's data... > > > In parallel, I had reinstalled one OSD host. > It was backfilling well, but now, <24 hours later, before backfill has finished, several OSD hosts enter OOM condition. > Our OSD-hosts have 64 GB of RAM for 32 OSDs, which should be fine with the default bluestore cache size of 1 GB. However, it seems the processes are using much more, > up to several GBs until memory is exhausted. They then become sluggish, are kicked out of the cluster, come back, and finally at some point they are OOMed. > > Now, I have restarted some OSD processes and hosts which helped to reduce the memory usage - but now I have some OSDs crashing continously, > leading to PG unavailability, and preventing recovery from completion. > I have reported a ticket about that, with stacktrace and log: > http://tracker.ceph.com/issues/23120 > This might well be a consequence of a previous OOM killer condition. > > However, my final question after these ugly experiences is: > Did somebody ever stresstest CephFS for many small files? > Are those issues known? Can special configuration help? > Are the memory issues known? Are there solutions? > > We don't plan to use Ceph for many small files, but we don't have full control of our users, which is why we wanted to test this "worst case" scenario. > It would be really bad if we lost a production filesystem due to such a situation, so the plan was to test now to know what happens before we enter production. > As of now, this looks really bad, and I'm not sure the cluster will ever recover. > I'll give it some more time, but we'll likely kill off all remaining clients next week and see what happens, and worst case recreate the Ceph cluster. > > Cheers, > Oliver >
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com