Re: advice on indexing sequential data?

John Spray <jspray@xxxxxxxxxx> · Thu, 1 Oct 2015 13:42:36 +0100

On Thu, Oct 1, 2015 at 11:44 AM, Tom Nakamura <tnakamura@xxxxxx> wrote:
> Hello ceph-devel,
>
> My lab is concerned with developing data mining application for
> detecting and 'deanonymizing' spamming botnets from high-volume spam
> feeds.
>
> Currently, we just store everything in large mbox files in directories
> separated by day (think: /data/feed1/2015/09/13.mbox) on a single NFS
> server. We have ad-hoc scripts to extract features from these mboxes and
> pass them to our analysis pipelines (written in a mixture of
> R/matlab/python/etc). This system is reaching its limit point.
>
> We already have a small Ceph installation with which we've had good luck
> for storing other data,and would like to see how we can use it to solve
> our mail problem. Our basic requirements are that:
>
> - We need to be able to access each message by its extracted features.
> These features include simple information found in its header (for
> example From: and To:) as well as more complex information like
> signatures from attachments and network information (for example,
> presence in blacklists).
> - We will frequently add/remove features.
> - Faster access to recent data is more important than to older data.
> - Maintaining strict ordering of incoming messages is not necessary. In
> other words, if we received two spam messages on our feeds, it doesn't
> matter too much if they are stored in that order, as long as we can have
> coarse-grained temporal accuracy (say, 5 minutes). So we don't need
> anything as sophisticated as Zlog.
> - We need to be able to remove messages older than some specific age,
> due to storage constraints.
>
> Any advice on how to use Ceph and librados to accomplish this?  Here are
> my initial thoughts:
>
> - Each message is an object with some unique ID. Use omap to store all
> its features in the same object.
> - For each time period (which will have to be pre-specified to, say, an
> hour), we have an object which contains a list of ID's, as a bytestring
> of contatenated ID's. This should make expiring old messages trivial.
> - For each feature, we have a timestamped index (like
> 20150930-from-foo@xxxxxxx or
> 20150813-has-attachment-with-hash-123abddeadbeef) the which contains a
> list of ID's.
> - Hopefully use Rados classes to index/feature-extract on the OSD's.
>
> How does this sound? One glaring omission is that I do not know how to
> create indices which would support querying by inequality/ranges ('find
> all messages between 1000 and 2000 bytes').

I would suggest some sort of hybrid approach, where you store your
messages and your time index in ceph (so that you can insert data and
expire data all within ceph), then use an external database for the
queries your application layer is interested in.  That way the
external database becomes somewhat disposable (you can always rebuild
efficiently for any given time period by consulting your time index in
ceph), but you don't have to implement any multi-axis querying inside
ceph.

With that kind of approach, you don't have to worry about implementing
indices (let the existing database do it), but you do still have to
worry about recovery from failure, i.e. keeping the ceph store and the
database index up to date.  You might need a "regenerate data for this
time period" call that re-inserted all the last 5 minutes emails to
the database after a failure of whatever is injecting the data.

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html