On Thu, Jan 7, 2010 at 8:23 AM, Gary Warner <gar@xxxxxxxxxxx> wrote: > Hello, > > I've been lurking on this list a couple weeks now, and have asked some "side questions" to some of the list members, who have been gracious, and helpful, and encouraged me to just dive in and participate on the list. > > I'll not tell you the whole story right off the bat, but let me give you a basic outline. > > I dual report into two academic departments at the University of Alabama at Birmingham - Computer & Information Sciences and Justice Sciences. Although my background is on the CS side, I specialize in cybercrime investigations and our research focuses on things that help to train or equip cybercrime investigators. > > One of our research projects is called the "UAB Spam Data Mine". Basically, we collect spam, use it to detect emerging malware threats, phishing sites, or spam campaigns, and share our findings with law enforcement and our corporate partners. > > We started off small, with only around 10,000 to 20,000 emails per day running on a smallish server. Once we had our basic workflow down, we moved to nicer hardware, and opened the floodgates a bit. We're currently receiving about 1.2 million emails per day, and hope to very quickly grow that to more than 5 million emails per day. > > I've got very nice hardware - many TB of very fast disks, and several servers with 12GB of RAM and 8 pentium cores each. Are you running 8.3.x? I'd go with that as a minimum for now. You'll want to make sure those fast disks are under a fast RAID setup like RAID-10, perhaps with a high quality RAID controller with battery backed cache as well. I/O is going to be your real issue here, not CPU, most likely. Also look at increasing RAM to 48 or 96Gig if if you can afford it. I assume your pentium cores are Nehalem since you've got 12G of ram (multiple of 3). Those are a good choice here, they're fast and have memory access. > For the types of investigative support we do, some of our queries are of the 'can you tell me what this botnet was spamming for the past six months', but most of them are more "real time", of the "what is the top spam campaign today?" or "what domains are being spammed by this botnet RIGHT NOW". Then you'll probably want to look at partitioning your data. > We currently use 15 minute batch queues, where we parse between 10,000 to 20,000 emails every 15 minutes. Each message is assigned a unique message_id, which is a combination of what date and time "batch" it is in, followed by a sequential number, so the most recent batch processed this morning starts with "10Jan07.0" and goes through "10Jan07.13800". > I don't know what this list considers "large databases", but I'm going to go ahead and call 170 million records a "large" table. > - if you have 8 pentium cores, 12GB of RAM and "infinite" diskspace, what sorts of memory settings would you have in your start up tables? crank up shared_buffers and effective cache size. Probably could use a bump for work_mem, something in the 16 to 32 Meg range. Especially since you're only looking at a dozen or so, not hundreds, of concurrent users. work_mem is per sort, so it can get out of hand fast if you crank it up too high, and for most users higher settings won't help anyway. > My biggest question mark there really has to do with how many users I have and how that might alter the results. My research team has about 12 folks who might be using the UAB Spam Data Mine at any given time, plus we have the "parser" running pretty much constantly, and routines that are fetching IP addresses for all spammed URLs and nameservers for all spammed domains and constantly updating the databases with that information. In the very near future, we'll be accepting queries directly from law enforcement through a web interface and may have as many as another 12 simultaneous users, so maybe 25 max users. We plan to limit "web users" to a "recent" subset of the data, probably allowing "today" "previous 24 hour" and "previous 7 days" as query options within the web interface. The onsite researchers will bang the heck out of much larger datasets. You might want to look at pre-rolling common requests if that isn't what you're already doing with the routines you're mentioning up there. > (I'll let this thread run a bit, and then come back to ask questions about "vacuum analyze" and "partitioned tables" as a second and third round of questions.) You shouldn't have a lot of vacuum problems since you won't be deleting anything. You might wanna crank up autovacuum aggressiveness as regards analyze though. Partitioning is in your future. -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance