resolved - unusual growth in cluster after replacing journalSSDs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear all,

we finally found the reason for the unexpected growth in our cluster. 
The data was created by a collectd plugin [1] that measures latency by
running rados bench once a minute.  Since our cluster was stressed out
for a while, removing the objects created by rados bench failed.  We
completely overlooked the log messages that should have given us the
hint a lot earlier.  e.g.:

Jan 18 23:26:09 ceph1 ceph-osd: 2018-01-18 23:26:09.931638
7f963389f700  0 -- IP:6802/1986 submit_message osd_op_reply(374
benchmark_data_ceph3_31746_object158 [delete] v21240'22867646
uv22867646 ack = 0) v7 remote, IP:0/3091801967, failed lossy con,
dropping message 0x7f96672a6680

Over time we "collected" some 1.5TB of benchmark data :(

Furthermore, due to some misunderstanding we had the collectd plugin
that runs the benchmarks running on two machines, doubling the stress
on the cluster.

And finally we created benchmark data in our main production pool,
which also was a bad idea.

Hope this info will be useful for someone :)

[1]  https://github.com/rochaporto/collectd-ceph

Cheers,
-- 
J.Hofmüller
                We are all idiots with deadlines.
                - Mike West

Attachment: signature.asc
Description: This is a digitally signed message part

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux