FAS scraper

Mel Chua <mel@xxxxxxxxxx> · Mon, 01 Mar 2010 21:53:10 -0500

Since I was talking about this in #fedora-mktg as I made this, I thought 
I'd share. Basically, Diana was talking about how it's hard for her to 
figure out who's an active contributor (for her research) since there 
are so many ways and means and places (git, wiki, lists, etc) to 
contribute to Fedora, so I said "well, fire up twill, scrape 'em all 
down, do some text processing, and you'll have a per-user portfolio you 
can analyze to get an 'activity count.'"

After several hours of being too distracted to actually implement a 
quick-and-dirty proof of concept, I sat down and spent (according to IRC 
timestamps) 8 minutes actually looking up twill python API syntax and 
writing 11 lines of code to do the job, then 29 minutes to comment it, 
perhaps a little too exhaustively.

http://mchua.fedorapeople.org/FAS_scraper

When run, this will take a list of FAS usernames and spit out a series 
of <username>.html files containing multiple-service "portfolios" for 
that user (currently: wiki edits and packages maintained, but easily 
extensible).

I've pasted the README below to give folks an idea of what this does. 
It's a proof-of-concept looking for someone who can architecture and 
implement it better, as I don't really have the time to do it properly.

--- README.txt ---

# FAS_scraper.py
# v.1.0 (March 1, 2010)
# Mel Chua <mchua@xxxxxxxxxxxxxxxxx>

# This is a quick proof-of concept scraper inspired by Diana Martin's 
research
# on the Fedora community; she's trying to get a gauge on who in Fedora
# is an "active contributor," so I suggested making a tiny scraper to gather
# all the FAS-authenticated activity of a user from existing webpages.
# I'm pretty sure most of these services have APIs that would do the job
# better and less kludgily, but this is just to see if it's a useful thing.

== Caveat ==

This isn't actually a proper README.txt - rather, a quick hack taken 
from the opening code comments. The python code itself is extensively 
commented (there are 11 lines of actual code in the 46-line file).

== Installation ==

You will need python and twill installed to run this script. On Fedora:

         yum install python python-twill

Then download FAS_scraper.py into a directory and run it:

         python FAS_scraper.py

You'll see a lot of output (the html of the pages being scraped) being 
dumped into your terminal; I'm leaving it verbose for now on purpose so 
people can see what's going on.

You'll end up with a series of <username>.html in the directory that 
FAS_scraper.py is in. These contain the raw html dumps of the profile 
pages for that FAS user for each specified service.

== Sample output ==

http://mchua.fedorapeople.org/FAS_scraper/sample_output

== Further developments ==

Some quick suggestions for further work - what actually needs to happen 
is for this to be re-architected into a good general-purpose python 
library for getting data from FAS-authenticated services.

* Instead of manually defining the list of FAS usernames in the code, 
grab the list of usernames from the actual FAS system.

* Check for validity of FAS users you're looking for - right now, if you 
enter a username that doesn't exist, the program will try to download 
the pages for that user anyway. (It won't stop the program, you'll just 
get output for that user consisting of webpages saying that the user 
doesn't exist.)

* Add more services.

* Check for validity of services.

* Create a class for services so that we can handle cases that aren't 
reachable by the format <start_of_url>/<username>. (For instance, what 
if it's <start_of_url>/<username>/<end_of_url>?)

* Create a class for users that can parse and spit out statistics for 
each of the services you're looking at. For instance, can you 
automatically get the value of username.pkgdb.number_maintained()?
-- 
marketing mailing list
marketing@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/marketing