(Trimmed CC:) apparently neither Gearman nor Archivematica lists allow posting from non-members, which leads to some wonderful spam from Google and is going to make holding a cross-list conversation…difficult. On Wednesday, May 2, 2012 at 4:26 PM, Greg Farnum wrote: > > > On Wednesday, May 2, 2012 at 3:42 PM, Clint Byrum wrote: > > > Excerpts from Joseph Perry's message of Wed May 02 15:05:23 -0700 2012: > > > Hello All, > > > First off, I'm sending this email to three discussion groups: > > > gearman@xxxxxxxxxxxxxxxx (mailto:gearman@xxxxxxxxxxxxxxxx) - distributed processing library > > > ceph-devel@xxxxxxxxxxxxxxx (mailto:ceph-devel@xxxxxxxxxxxxxxx) - distributed file system > > > archivematica@xxxxxxxxxxxxxxxx (mailto:archivematica@xxxxxxxxxxxxxxxx) - my project's discussion list, a > > > distributed processing system. > > > > > > I'd like to start a discussion about something I'll refer to as weighted > > > distributed task based processing. > > > Presently, we are using gearman's library's to meet our distributed > > > processing needs. The majority of our processing is file based, and our > > > processing stations are accessing the files over an nfs share. We are > > > looking at replacing the nfs server share with a distributed file > > > systems, like ceph. > > > > > > It occurs to me that our processing times could theoretically be reduced > > > by by assigning tasks to processing clients where the file resides, over > > > places where it would need to be copied over the network. In order for > > > this to happen, the gearman server would need to get file location > > > information from the ceph system. > > > > > > > > > > > > If I understand the design of CEPH completely, it spreads I/O at the > > block level, not the file level. > > > > So there is little point in weighting since it seeks to spread the whole > > file across all the machines/block devices in the cluster. Even if you > > do ask ceph "which servers is file X on", which I'm sure it could tell > > you, You will end up with high weights for most of the servers, and no > > real benefit. > > > > In this scenario, you're just better off having a really powerful network > > and CEPH will balance the I/O enough that you can scale out the I/O > > independently of the compute resources. This seems like a huge win, as > > I don't believe most workloads scale at a 1:1 I/O:CPU ratio. 10Gigabit > > switches are still not super cheap, but they are probably cheaper than > > software engineer hours. > > > > If your network is not up to the task of transferring all those blocks > > around, you probably need to focus instead on something that keeps whole > > files in a certain place. One such system would be MogileFS. This has a > > database with a list of keys that say where the data lives, and in fact > > the protocol the MogileFS tracker uses will tell you all the places a > > key lives. You could then place a hint in the payload and have 2 levels > > of workers. The pseudo becomes: > > > > -workers register two queues. 'dispatch_foo', and 'do_foo_$hostname' > > -client sends task w/ filename to 'dispatch_foo' > > -dispatcher looks at filename, asks mogile where the file is, looks at > > recent queue lengths in gearman, and decides whether or not it is enough > > of a win to direct the job to the host where the file is, or to farm it > > out to somewhere that is less busy. > > > > This will take a lot of poking at to get tuned right, but it should be > > tunable to a single number, the ratio of localized queue length versus > > non-localized queue length. > > > > > pseudo: > > > gearman client creates a task & includes a weight, of type ceph file > > > gearman server identifies the file & polls the ceph system for clients > > > that have this file > > > ceph system returns a list of clients that have the file locally > > > gearman assigns the task > > > . if there is a client available for processing that has the file locally > > > . assign it there > > > . (that client has local access to the file, still on the ceph > > > system) > > > . else > > > . assign to other client > > > . (that processing client will pull the file from the ceph system > > > over the network) > > > > > > > > > I call it a weighted distributed processing system, because it reminds > > > me of a weighted die: The outcome is influenced to a certain direction > > > (in the task assignment). > > > > > > I wanted to start this as a discussion, rather than filing feature > > > requests, because of the complex nature of the requests, and the nicer > > > medium for feedback, clarification and refinement. > > > > > > I'd be very interested to hear feedback on the idea, > > > Joseph Perry > > > > > > https://groups.google.com/group/gearman/browse_thread/thread/12a1b3aa64f103d1 > ^ is the Google Groups link for this (ceph-devel doesn't seem to have gotten the original email — at least I didn't!). > > Clint is mostly correct: Ceph does not store files in a single location. It's not block-based in the sense of 4K disk blocks though — instead it breaks up files into (by default) 4MB chunks. It's possible to change this default to a larger number though; our Hadoop bindings break files into 64MB chunks. And it is possible to retrieve this location data using the cephfs tool: > ./cephfs > not enough parameters! > usage: cephfs path command [options]* > Commands: > show_layout -- view the layout information on a file or dir > set_layout -- set the layout on an empty file, > or the default layout on a directory > show_location -- view the location information on a file > Options: > Useful for setting layouts: > --stripe_unit, -u: set the size of each stripe > --stripe_count, -c: set the number of objects to stripe across > --object_size, -s: set the size of the objects to stripe across > --pool, -p: set the pool to use > Useful for getting location data: > --offset, -l: the offset to retrieve location data for > > > > I suspect this provides the information you're looking for? > > -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html