Re: weighted distributed processing.

Greg Farnum <gregory.farnum@xxxxxxxxxxxxx> · Wed, 2 May 2012 16:26:40 -0700

On Wednesday, May 2, 2012 at 3:42 PM, Clint Byrum wrote:

> Excerpts from Joseph Perry's message of Wed May 02 15:05:23 -0700 2012:
> > Hello All,
> > First off, I'm sending this email to three discussion groups:
> > gearman@xxxxxxxxxxxxxxxx (mailto:gearman@xxxxxxxxxxxxxxxx) - distributed processing library
> > ceph-devel@xxxxxxxxxxxxxxx (mailto:ceph-devel@xxxxxxxxxxxxxxx) - distributed file system
> > archivematica@xxxxxxxxxxxxxxxx (mailto:archivematica@xxxxxxxxxxxxxxxx) - my project's discussion list, a  
> > distributed processing system.
> >  
> > I'd like to start a discussion about something I'll refer to as weighted  
> > distributed task based processing.
> > Presently, we are using gearman's library's to meet our distributed  
> > processing needs. The majority of our processing is file based, and our  
> > processing stations are accessing the files over an nfs share. We are  
> > looking at replacing the nfs server share with a distributed file  
> > systems, like ceph.
> >  
> > It occurs to me that our processing times could theoretically be reduced  
> > by by assigning tasks to processing clients where the file resides, over  
> > places where it would need to be copied over the network. In order for  
> > this to happen, the gearman server would need to get file location  
> > information from the ceph system.
>  
>  
>  
> If I understand the design of CEPH completely, it spreads I/O at the
> block level, not the file level.
>  
> So there is little point in weighting since it seeks to spread the whole
> file across all the machines/block devices in the cluster. Even if you
> do ask ceph "which servers is file X on", which I'm sure it could tell
> you, You will end up with high weights for most of the servers, and no
> real benefit.
>  
> In this scenario, you're just better off having a really powerful network
> and CEPH will balance the I/O enough that you can scale out the I/O
> independently of the compute resources. This seems like a huge win, as
> I don't believe most workloads scale at a 1:1 I/O:CPU ratio. 10Gigabit
> switches are still not super cheap, but they are probably cheaper than
> software engineer hours.
>  
> If your network is not up to the task of transferring all those blocks
> around, you probably need to focus instead on something that keeps whole
> files in a certain place. One such system would be MogileFS. This has a
> database with a list of keys that say where the data lives, and in fact
> the protocol the MogileFS tracker uses will tell you all the places a
> key lives. You could then place a hint in the payload and have 2 levels
> of workers. The pseudo becomes:
>  
> -workers register two queues. 'dispatch_foo', and 'do_foo_$hostname'
> -client sends task w/ filename to 'dispatch_foo'  
> -dispatcher looks at filename, asks mogile where the file is, looks at
> recent queue lengths in gearman, and decides whether or not it is enough
> of a win to direct the job to the host where the file is, or to farm it
> out to somewhere that is less busy.
>  
> This will take a lot of poking at to get tuned right, but it should be
> tunable to a single number, the ratio of localized queue length versus
> non-localized queue length.
>  
> > pseudo:
> > gearman client creates a task & includes a weight, of type ceph file
> > gearman server identifies the file & polls the ceph system for clients  
> > that have this file
> > ceph system returns a list of clients that have the file locally
> > gearman assigns the task
> > . if there is a client available for processing that has the file locally
> > . assign it there
> > . (that client has local access to the file, still on the ceph  
> > system)
> > . else
> > . assign to other client
> > . (that processing client will pull the file from the ceph system  
> > over the network)
> >  
> >  
> > I call it a weighted distributed processing system, because it reminds  
> > me of a weighted die: The outcome is influenced to a certain direction  
> > (in the task assignment).
> >  
> > I wanted to start this as a discussion, rather than filing feature  
> > requests, because of the complex nature of the requests, and the nicer  
> > medium for feedback, clarification and refinement.
> >  
> > I'd be very interested to hear feedback on the idea,
> > Joseph Perry
>  

https://groups.google.com/group/gearman/browse_thread/thread/12a1b3aa64f103d1
^ is the Google Groups link for this (ceph-devel doesn't seem to have gotten the original email — at least I didn't!).

Clint is mostly correct: Ceph does not store files in a single location. It's not block-based in the sense of 4K disk blocks though — instead it breaks up files into (by default) 4MB chunks. It's possible to change this default to a larger number though; our Hadoop bindings break files into 64MB chunks. And it is possible to retrieve this location data using the cephfs tool:
./cephfs  
not enough parameters!
usage: cephfs path command [options]*
Commands:
show_layout -- view the layout information on a file or dir
set_layout -- set the layout on an empty file,
or the default layout on a directory
show_location -- view the location information on a file
Options:
Useful for setting layouts:
--stripe_unit, -u: set the size of each stripe
--stripe_count, -c: set the number of objects to stripe across
--object_size, -s: set the size of the objects to stripe across
--pool, -p: set the pool to use
Useful for getting location data:
--offset, -l: the offset to retrieve location data for

I suspect this provides the information you're looking for?

-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html