Since I was talking about this in #fedora-mktg as I made this, I thought I'd share. Basically, Diana was talking about how it's hard for her to figure out who's an active contributor (for her research) since there are so many ways and means and places (git, wiki, lists, etc) to contribute to Fedora, so I said "well, fire up twill, scrape 'em all down, do some text processing, and you'll have a per-user portfolio you can analyze to get an 'activity count.'" After several hours of being too distracted to actually implement a quick-and-dirty proof of concept, I sat down and spent (according to IRC timestamps) 8 minutes actually looking up twill python API syntax and writing 11 lines of code to do the job, then 29 minutes to comment it, perhaps a little too exhaustively. http://mchua.fedorapeople.org/FAS_scraper When run, this will take a list of FAS usernames and spit out a series of <username>.html files containing multiple-service "portfolios" for that user (currently: wiki edits and packages maintained, but easily extensible). I've pasted the README below to give folks an idea of what this does. It's a proof-of-concept looking for someone who can architecture and implement it better, as I don't really have the time to do it properly. --- README.txt --- # FAS_scraper.py # v.1.0 (March 1, 2010) # Mel Chua <mchua@xxxxxxxxxxxxxxxxx> # This is a quick proof-of concept scraper inspired by Diana Martin's research # on the Fedora community; she's trying to get a gauge on who in Fedora # is an "active contributor," so I suggested making a tiny scraper to gather # all the FAS-authenticated activity of a user from existing webpages. # I'm pretty sure most of these services have APIs that would do the job # better and less kludgily, but this is just to see if it's a useful thing. == Caveat == This isn't actually a proper README.txt - rather, a quick hack taken from the opening code comments. The python code itself is extensively commented (there are 11 lines of actual code in the 46-line file). == Installation == You will need python and twill installed to run this script. On Fedora: yum install python python-twill Then download FAS_scraper.py into a directory and run it: python FAS_scraper.py You'll see a lot of output (the html of the pages being scraped) being dumped into your terminal; I'm leaving it verbose for now on purpose so people can see what's going on. You'll end up with a series of <username>.html in the directory that FAS_scraper.py is in. These contain the raw html dumps of the profile pages for that FAS user for each specified service. == Sample output == http://mchua.fedorapeople.org/FAS_scraper/sample_output == Further developments == Some quick suggestions for further work - what actually needs to happen is for this to be re-architected into a good general-purpose python library for getting data from FAS-authenticated services. * Instead of manually defining the list of FAS usernames in the code, grab the list of usernames from the actual FAS system. * Check for validity of FAS users you're looking for - right now, if you enter a username that doesn't exist, the program will try to download the pages for that user anyway. (It won't stop the program, you'll just get output for that user consisting of webpages saying that the user doesn't exist.) * Add more services. * Check for validity of services. * Create a class for services so that we can handle cases that aren't reachable by the format <start_of_url>/<username>. (For instance, what if it's <start_of_url>/<username>/<end_of_url>?) * Create a class for users that can parse and spit out statistics for each of the services you're looking at. For instance, can you automatically get the value of username.pkgdb.number_maintained()? -- marketing mailing list marketing@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/marketing