Hello ceph-devel, My lab is concerned with developing data mining application for detecting and 'deanonymizing' spamming botnets from high-volume spam feeds. Currently, we just store everything in large mbox files in directories separated by day (think: /data/feed1/2015/09/13.mbox) on a single NFS server. We have ad-hoc scripts to extract features from these mboxes and pass them to our analysis pipelines (written in a mixture of R/matlab/python/etc). This system is reaching its limit point. We already have a small Ceph installation with which we've had good luck for storing other data,and would like to see how we can use it to solve our mail problem. Our basic requirements are that: - We need to be able to access each message by its extracted features. These features include simple information found in its header (for example From: and To:) as well as more complex information like signatures from attachments and network information (for example, presence in blacklists). - We will frequently add/remove features. - Faster access to recent data is more important than to older data. - Maintaining strict ordering of incoming messages is not necessary. In other words, if we received two spam messages on our feeds, it doesn't matter too much if they are stored in that order, as long as we can have coarse-grained temporal accuracy (say, 5 minutes). So we don't need anything as sophisticated as Zlog. - We need to be able to remove messages older than some specific age, due to storage constraints. Any advice on how to use Ceph and librados to accomplish this? Here are my initial thoughts: - Each message is an object with some unique ID. Use omap to store all its features in the same object. - For each time period (which will have to be pre-specified to, say, an hour), we have an object which contains a list of ID's, as a bytestring of contatenated ID's. This should make expiring old messages trivial. - For each feature, we have a timestamped index (like 20150930-from-foo@xxxxxxx or 20150813-has-attachment-with-hash-123abddeadbeef) the which contains a list of ID's. - Hopefully use Rados classes to index/feature-extract on the OSD's. How does this sound? One glaring omission is that I do not know how to create indices which would support querying by inequality/ranges ('find all messages between 1000 and 2000 bytes'). Thank you, Tom N. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html