Dear all, we finally found the reason for the unexpected growth in our cluster. The data was created by a collectd plugin [1] that measures latency by running rados bench once a minute. Since our cluster was stressed out for a while, removing the objects created by rados bench failed. We completely overlooked the log messages that should have given us the hint a lot earlier. e.g.: Jan 18 23:26:09 ceph1 ceph-osd: 2018-01-18 23:26:09.931638 7f963389f700 0 -- IP:6802/1986 submit_message osd_op_reply(374 benchmark_data_ceph3_31746_object158 [delete] v21240'22867646 uv22867646 ack = 0) v7 remote, IP:0/3091801967, failed lossy con, dropping message 0x7f96672a6680 Over time we "collected" some 1.5TB of benchmark data :( Furthermore, due to some misunderstanding we had the collectd plugin that runs the benchmarks running on two machines, doubling the stress on the cluster. And finally we created benchmark data in our main production pool, which also was a bad idea. Hope this info will be useful for someone :) [1] https://github.com/rochaporto/collectd-ceph Cheers, -- J.Hofmüller We are all idiots with deadlines. - Mike West
Attachment:
signature.asc
Description: This is a digitally signed message part
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com