Re: Poll: Should mhonarc.org mail archives hide mail addresses

Chuq Von Rospach <chuqui@xxxxxxxxxxxxxx> · Thu, 1 Jan 2004 23:26:43 -0800

Finally, Chuq had a good point about requirements changing over
time. In the future, MHonArc may want to move towards encouraging more
semantic markup

The problem with this approach is that it won't work with text-based
browsers.  Accessibility is something I try to maintain,

Sure it will. Jeffrey Zeldman has a lot of useful information on how to 
be accessible and compliant by degrading gracefully. you can start 
here: http://www.happycog.com/lectures/access/ to get a first cut on 
this. The idea is to build things that use XHTML/CSS such that if 
certain features aren't supported by a browser, the site does the 
"right thing" instead of simply breaking, and does it without building 
multiple versions with browser sniffing. And accessible means more than 
sight-limited, it means alternative browsing tools, like my phone's 
mini-browser, and search engines like google.
So accessibility is good. CSS/XHTML is good. and since mHonarc gets 
used in so many sites where people have to skin an interface onto it, I 
think moving to those models is a great idea (and basically a 
no-brainer), once you get past a bunch of the myths about those tools.
I first thought of using libgd to have address changed into CGI
links that generate an image on the fly with showing email address.
I.e. Harvesters would have to use OCR to get the address.

and there's evidence that some harvester are experimenting in that 
direction. After all, it's only CPU time, and they're infinitely 
patient. Even if they only get a 10-15% hit rate on OCR conversions, 
that merely means that have to hit the site 10 times to get everything. 
That was the ultimate failure of the slashdot "random" obfuscation 
tool: spammers didn't have to break all of them, just enough of them to 
get useful data, and then cycle through the site enough times to get 
around the versions they didn't crack. took about a week.
Another alternative is to remove linking of addresses, and then
using a obfsucation technique like:
  earl<!--
  -->&#64;<!--
  -->example.com
This way the address renders like "earl@xxxxxxxxxxx" (and can be
copy-n-pasted by readers to their MUA), but a harverster may not
catch it.  Of course, a smart harvester that expands entity references
and deletes comment declarations would.

be very wary of "fixes" that merely make the problem more difficult. As 
soon as they have a financial incentive to crack them, they'll be 
cracked. you're basically looking to try to implement the "I don't have 
to outrun the bear, I just have to outrun you" solution, meaning you 
make it tough enough to crack they go harvest someone else's site.
In the case of mHonarc especially, that's a bad design choice. Since so 
many sites use mHonarc, any change you make to mHonarc will be a focus 
of the spammers to crack. mHonarc doesn't have the option of making it 
tough enough for the spammers to go elsewhere. So you risk putting 
energy into things that won't fix the problem long (if at all), and 
worse, might create a false sense of security for developers and users 
of the tools.
My suggestion: don't get involved in any "solution" that merely makes 
it "harder" or "causes more work", because they only solve things as 
long as the spammers don't feel it's worth it. and if you get into an 
arms race with them, you'll lose. So you have to fix things in ways 
they can't crack, or you probably shouldn't fix them at all. 
half-measures waste time and energy and give people a sense of comfort 
that is worse than doing nothing.
I don't believe any obfuscation setup is safe. Period. They may work 
today, but if they ever get adopted widely enough to annoy the 
spammers, they'll be broken. And with their continuing to build huge 
farms of zombied machines for delivery (which is what's hosed over the 
RBLs, the spammers have figured out how to hack around them by changing 
their delivery methods and using stolen system access), if they can use 
a machine for zombie delivery of spam, they can use that  machine for 
computational work, too, so you should assume the spammers have a 
roughly infinitely large cluster of machines they can use to throw 
cycles at whatever you build. Because they do.
I read a study dated March 2003 that showed that simple obfsucation
techniques actually work, but I think (and the study even states)
that it likely that it is a matter of time that spammers adapt.

most of them are broken now. basically useless.
 Mail-archive.com
uses a POST form to obfsucate addresses, but it is straight-forward
to customize a harvester to defeat it.

anythign with a large enough data-set to warrant the spammer's 
attention will get it. mHonarc, sort of by definition, will be high on 
their lists.
Obfuscation is a waste of energy. It works only as long as the spammers 
don't bother worrying about it. Graphic representations are 
non-accessible, crackable (via OCR) and not easily used by end-users, 
so they not only don't solve the problem, they create new ones. 
javascript-based and POST-based stuff, ditto -- you break in all sorts 
of systems today (like phone browsers) where people want access to that 
data, and it only holds off the spammers as long as they don't bother 
implementing it. those aren't solutions, just delaying tactics. Bad use 
of time.
 Since text-only
browsers can still read the messages in the archives, is it okay that
they will not have the ability to determine the author's address if
an image-based solution is adobted?  Is this an acceptable limitation
weighed against the problem of spam?

I think a "guest" has no demand on access to sensitive data. I don't 
allow "guests" open access to private mail lists, for instance, and I 
see no reason why they should assume they should have access to it.
I think it's safe to extend that to data I consider sensitive or 
private. Just because we've always been open and that data is 
accessible doesn't mean there's any requirement it remain so. After 
all, there was a time in life when few houses had locks on them, too. 
Times change. not only do we lock doors and windows, we build gated 
communities.
I think the only safe way to do this is to make sure that this 
sensitive data is simply never in the data stream -- it's edited out 
before a user can get to it. If it's not there, it can't be 
de-obfuscated, it can't be reconstructed, it can't be 
reverse-engineered, because it's not there.
If people want more access, including that restricted data, then biuld 
a system to let them authenticate in and be granted access. I think 
that's more or less beyond the scope of mHonarc, but strongly related 
to it. In a perfect world, however you authenticate yourself to the 
maling list to prove that "you are you" for purposes of posting or 
accepting list mail is how you'd authenticate into the archives, too, 
which implies this is probably a list-server operation which pulls data 
out of mHonarc, not a mHonarc operation, unless you want to start 
tightly coupling all of these different pieces together. Which has 
advantages and disadvantages...
I'd probably argue against building data-stripping data into mHonarc, 
but perhaps a group of mHonarc folks would be interested in building a 
separate-but-equal project (similar to mharc) to handle the 
delivery/stripping/authentication piece, with hooks that allow it to 
interface into other systems for authentication data, so it could, 
perhaps, use Mailman email addresses and passwords, or Sympa user data 
to simplify things for the users a bit.