Hello,
We're a digital archive that stores digital images of old records and
books. We're about to evaluate glusterfs as a solution to our main
storage needs. I'm soliciting advice from both glusterfs crew but also
other users with similar needs.
Today we've got about 30 million original images, there is the high
quality originals and batch processed highly compressed copy that's used
by our customers.
So this gives 30 million large files (3-12MB) plus 30 million converted
copies that lands in about 500KB per image.
The use-cases are a bit different: the big images will written once and
batched read-only once or twice a year.
The small images will be written once or twice a year, but read-accessed
24/7, and is more latency sensitive.
We want the data replicated at least 3 times physically (box-wise), so
we've ordered 3 test servers with 24x3TB "enterprise" SATA disks each
with an areca card + bbu. We'll probably be running the tests feeding
raid volumes to glusterfs, and from what I've seen this seems to be a
standard.
Possible future:
Since our storage system will be in it for a really long term, we're
looking at the total economics of the solution vs. the data safety concerns.
We've seen suggestions on letting glusterfs manage the disk directly.
The way I see it, this would give a win in that
1) We would be using all disks, no RAID/spare storage overhead
2) No RAID-rebuilds
3) ...
4) Profit
Also, we know that any long time system we build should be planned with
replacing disks continuously.
So in my mind we could buy quality boxes with 24-36 disks run by 3-4
SATA controller cards (Marvell?), using cheap and large desktop disks
(maybe not the "green" variety). We could have a reporting system on top
of glusterfs that reports defective disks that would be replaced as part
of our on-duty maintenance. Since the storage is replicated over 3+
boxes, the breakage of a single disk would not compromise the data
safety as long as the disks are replaced in timely manner.
I would be very interested to hear other peoples experience or ideas
about storing this kind of data, and particular on the pros/cons on the
pass-thru/direct disk model.
Any constructive input is welcome!
Regards,
Magnus Näslund