resolved - unusual growth in cluster after replacing journalSSDs

Jogi Hofmüller <jogi@xxxxxx> · Tue, 06 Feb 2018 12:18:58 +0100

Dear all,

we finally found the reason for the unexpected growth in our cluster. 
The data was created by a collectd plugin [1] that measures latency by
running rados bench once a minute.  Since our cluster was stressed out
for a while, removing the objects created by rados bench failed.  We
completely overlooked the log messages that should have given us the
hint a lot earlier.  e.g.:

Jan 18 23:26:09 ceph1 ceph-osd: 2018-01-18 23:26:09.931638
7f963389f700  0 -- IP:6802/1986 submit_message osd_op_reply(374
benchmark_data_ceph3_31746_object158 [delete] v21240'22867646
uv22867646 ack = 0) v7 remote, IP:0/3091801967, failed lossy con,
dropping message 0x7f96672a6680

Over time we "collected" some 1.5TB of benchmark data :(

Furthermore, due to some misunderstanding we had the collectd plugin
that runs the benchmarks running on two machines, doubling the stress
on the cluster.

And finally we created benchmark data in our main production pool,
which also was a bad idea.

Hope this info will be useful for someone :)

[1]  https://github.com/rochaporto/collectd-ceph

Cheers,
-- 
J.Hofmüller
                We are all idiots with deadlines.
                - Mike West
Attachment:
signature.asc

Description: This is a digitally signed message part
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com