Re: weighted distributed processing.

Greg Farnum <gregory.farnum@xxxxxxxxxxxxx> · Wed, 2 May 2012 16:30:47 -0700

(Trimmed CC:) apparently neither Gearman nor Archivematica lists allow posting from non-members, which leads to some wonderful spam from Google and is going to make holding a cross-list conversation…difficult.  

On Wednesday, May 2, 2012 at 4:26 PM, Greg Farnum wrote:

>  
>  
> On Wednesday, May 2, 2012 at 3:42 PM, Clint Byrum wrote:
>  
> > Excerpts from Joseph Perry's message of Wed May 02 15:05:23 -0700 2012:
> > > Hello All,
> > > First off, I'm sending this email to three discussion groups:
> > > gearman@xxxxxxxxxxxxxxxx (mailto:gearman@xxxxxxxxxxxxxxxx) - distributed processing library
> > > ceph-devel@xxxxxxxxxxxxxxx (mailto:ceph-devel@xxxxxxxxxxxxxxx) - distributed file system
> > > archivematica@xxxxxxxxxxxxxxxx (mailto:archivematica@xxxxxxxxxxxxxxxx) - my project's discussion list, a  
> > > distributed processing system.
> > >  
> > > I'd like to start a discussion about something I'll refer to as weighted  
> > > distributed task based processing.
> > > Presently, we are using gearman's library's to meet our distributed  
> > > processing needs. The majority of our processing is file based, and our  
> > > processing stations are accessing the files over an nfs share. We are  
> > > looking at replacing the nfs server share with a distributed file  
> > > systems, like ceph.
> > >  
> > > It occurs to me that our processing times could theoretically be reduced  
> > > by by assigning tasks to processing clients where the file resides, over  
> > > places where it would need to be copied over the network. In order for  
> > > this to happen, the gearman server would need to get file location  
> > > information from the ceph system.
> >  
> >  
> >  
> >  
> >  
> > If I understand the design of CEPH completely, it spreads I/O at the
> > block level, not the file level.
> >  
> > So there is little point in weighting since it seeks to spread the whole
> > file across all the machines/block devices in the cluster. Even if you
> > do ask ceph "which servers is file X on", which I'm sure it could tell
> > you, You will end up with high weights for most of the servers, and no
> > real benefit.
> >  
> > In this scenario, you're just better off having a really powerful network
> > and CEPH will balance the I/O enough that you can scale out the I/O
> > independently of the compute resources. This seems like a huge win, as
> > I don't believe most workloads scale at a 1:1 I/O:CPU ratio. 10Gigabit
> > switches are still not super cheap, but they are probably cheaper than
> > software engineer hours.
> >  
> > If your network is not up to the task of transferring all those blocks
> > around, you probably need to focus instead on something that keeps whole
> > files in a certain place. One such system would be MogileFS. This has a
> > database with a list of keys that say where the data lives, and in fact
> > the protocol the MogileFS tracker uses will tell you all the places a
> > key lives. You could then place a hint in the payload and have 2 levels
> > of workers. The pseudo becomes:
> >  
> > -workers register two queues. 'dispatch_foo', and 'do_foo_$hostname'
> > -client sends task w/ filename to 'dispatch_foo'  
> > -dispatcher looks at filename, asks mogile where the file is, looks at
> > recent queue lengths in gearman, and decides whether or not it is enough
> > of a win to direct the job to the host where the file is, or to farm it
> > out to somewhere that is less busy.
> >  
> > This will take a lot of poking at to get tuned right, but it should be
> > tunable to a single number, the ratio of localized queue length versus
> > non-localized queue length.
> >  
> > > pseudo:
> > > gearman client creates a task & includes a weight, of type ceph file
> > > gearman server identifies the file & polls the ceph system for clients  
> > > that have this file
> > > ceph system returns a list of clients that have the file locally
> > > gearman assigns the task
> > > . if there is a client available for processing that has the file locally
> > > . assign it there
> > > . (that client has local access to the file, still on the ceph  
> > > system)
> > > . else
> > > . assign to other client
> > > . (that processing client will pull the file from the ceph system  
> > > over the network)
> > >  
> > >  
> > > I call it a weighted distributed processing system, because it reminds  
> > > me of a weighted die: The outcome is influenced to a certain direction  
> > > (in the task assignment).
> > >  
> > > I wanted to start this as a discussion, rather than filing feature  
> > > requests, because of the complex nature of the requests, and the nicer  
> > > medium for feedback, clarification and refinement.
> > >  
> > > I'd be very interested to hear feedback on the idea,
> > > Joseph Perry
> >  
>  
>  
>  
> https://groups.google.com/group/gearman/browse_thread/thread/12a1b3aa64f103d1
> ^ is the Google Groups link for this (ceph-devel doesn't seem to have gotten the original email — at least I didn't!).
>  
> Clint is mostly correct: Ceph does not store files in a single location. It's not block-based in the sense of 4K disk blocks though — instead it breaks up files into (by default) 4MB chunks. It's possible to change this default to a larger number though; our Hadoop bindings break files into 64MB chunks. And it is possible to retrieve this location data using the cephfs tool:
> ./cephfs  
> not enough parameters!
> usage: cephfs path command [options]*
> Commands:
> show_layout -- view the layout information on a file or dir
> set_layout -- set the layout on an empty file,
> or the default layout on a directory
> show_location -- view the location information on a file
> Options:
> Useful for setting layouts:
> --stripe_unit, -u: set the size of each stripe
> --stripe_count, -c: set the number of objects to stripe across
> --object_size, -s: set the size of the objects to stripe across
> --pool, -p: set the pool to use
> Useful for getting location data:
> --offset, -l: the offset to retrieve location data for
>  
>  
>  
> I suspect this provides the information you're looking for?
>  
> -Greg  

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html