Re: ceph fs crashes on simple fio test

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Thu, 12 Sep 2019 10:41:20 -0700

On Tue, Sep 10, 2019 at 1:11 PM Frank Schilder <frans@xxxxxx> wrote:
Hi Robert,

I have meta data on SSD (3xrep) and data on 8+2 EC on spinning disks, so the speed difference is orders of magnitudes. Our usage is quite meta data heavy, so this suits us well. In particular since EC pools are high throughput with large IO sizes.

As long as one uses fio with direct=1 (probably also if using sync=1 and/or fsync=1), everything is fine and behaves as you describe. IOPs fluctuate but adjust to media speed. No problems at all.

As mentioned in my last update (I cut it out below), the destructive fio command runs with direct=0 and neither sync=1 nor fsync=1. This test just writes as fast as it can (to buffers) without waiting for acks. I would expect that a ceph client would translate that to synced or direct IO, which would be fine.

But it doesn't. Instead, it pushes the IO also as fast as possible to the cluster. I have seen 40kops write on the EC pool (on 100+ HDDs) that can handle maybe 1kops write in total. The queues were constantly increasing at an incredible rate (several hundred ops per second). I hope with the change of cut_off=high that heartbeats will not get lost any more, but this will still destabilize our ceph cluster quite dramatically.

Changing the cut_off to high will not allow heartbeats to not get lost (heartbeats have a priority far above the high mark). What cut_off = high does is put replication ops into the main queue instead of the strict priority queue. That way an OSD doesn't get DDOSed from it's peers and is never able to service it's own clients.

When I did my fio testing, was on FireFly/Hammer and on RBD, so I can't talk specifically to newer versions and CephFS. We haven't had time to set up our test cluster, so I can't run benches at the moment.

My problem is not so much that such an IO pattern could occur in reasonable software, but

- that someone might try just for fun, and that

- the number of 500+ clients might occasionally produce such a workload by aggregation.

I find it somewhat alarming that a storage system that promises data integrity and reliability can be taken down with a publicly available benchmark tool in a matter of a few dozen seconds by ordinary users. Potentially with damaging effects. I guess something similar could be achieved with a modified rogue client.

I would expect that a storage cluster should have basic self-defence mechanisms that prevent this kind of overload or DOS attack by throttling clients with crazy IO requests. Are there any settings that can be enabled to prevent this from happening?

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx