Re: [Linux-cluster] GFS limits?

Don MacAskill <don@xxxxxxxxxxx> · Tue, 13 Jul 2004 20:25:35 -0700

Ken Preslan wrote:

Our current allocation methods try to allocate from areas of the disk
where there isn't much contention for the allocation bitmap locks.  It
doesn't know anything about spreading load on the basis of disk load.
(That would be an interesting thing to add, but we don't have any plans
to do so for the short term.)

My use case isn't very standard.  Rather than needing tons of read/write 
random access all over the disk, we're almost completely  linear 
write-once-per-file, read-many operations.

We do photo sharing and storage.  So lots and lots of photos get 
uploaded, and they're serially stored on disk.  Once they're on disk, 
though, they're rarely modified.  Just read.

It's forseeable in the future, though, to where we can't push these 
linear writes to disk fast enough as people upload photos.  Either the 
interface (GigE, iSCSI, Fibre Channel) isn't fast enough or whatever. 
It's way out in the future, but it'll come faster than I like to think 
about.

In that case, we need a nice way to spread those writes across multiple 
disks/servers/whatever.  GigE bonding might solve it temporarily, but 
that can only last so far.

Ideally, I want to scale horizontally (tons of cheap linx boxes attached 
to big disks) and have the writes "passed out" among those boxes.  If I 
have to write my own stuff to do that, fine.  But if GFS can potentially 
provide something along those lines down the road, great.

In the event of some multiple-catastrophe failure (where some data isn't 
online at all, let alone redundant), how graceful is GFS?  Does it "rope 
off" the data that's not available and still allow full access to the 
data that is?  Or does the whole cluster go down?

Right now, a malfunctioning or non-present disk can cause the whole
cluster to go down.  That's assuming the error isn't masked by hardware
RAID or CLVM mirroing (when we get there).

One of the next projects on my plate is fixing the filesystem so that a
node will gracefully withdraw itself from the cluster when it sees a
malfunctioning storage device.  Each node will stay up and could
potentially be able to continue accessing other GFS filesystems on
other storage devices.

I/We haven't thought much about trying to get GFS to continue to function
when only part of a filesystem is present.

When I'm talking about petabytes, this weighs on my mind heavily.  I 
can't have some power outage take out a couple of nodes which may have 
both sets of "redundant data" for, say, 10TB, take down a 20PB cluster.

I realize 20PB sounds fairly ridiculous at the moment, but I can see it 
coming.  And it's a management nightmare when it's spread across small 
1TB block devices all over the place instead of an aggregate volume. 
I'm sure it's a software nightmare to think of the aggregate volume, but 
that's not my problem.  :)

I notice the pricing for GFS is $2200.  Is that per seat?  And if so, 
what's a "seat"?  Each client?  Each server with storage participating 
in the cluster?  Both?  Some other distinction?

I'm not a marketing/sales person, just a code monkey, so take this with
a grain of salt:  It's per node running the filesystem.  I don't think
machines running GULM lock servers or GNBD block servers count as machine
that need to be paid for.

Looks like I have more reading to do, since apparently I don't totally 
get what a GNDB block server is.  Or a GULM lock server, for that matter.

Is AS a prereq for clients?  Servers?  Both?  Or will ES and WS boxes be 
able to participate as well?

According to the web page, you should be able to add a GFS entitlement to
all RHEL trimlines (WS, ES, and AS).

http://www.redhat.com/apps/commerce/rha/gfs/

Thanks!

Don

begin:vcard
fn:Don MacAskill
n:MacAskill;Don
org:smugmug.com
adr:;;3347 Shady Spring Lane;Mountain View;CA;94043;USA
email;internet:don@xxxxxxxxxxx
title:CEO
tel;fax:(650) 641-3125
x-mozilla-html:FALSE
url:http://www.smugmug.com/
version:2.1
end:vcard