Simple Anti-Spam

I read my non-Sun email through GMail these days. It’s amusing to watch the ebb and flow of spam, as the bad guys figure out a way through the defenses, then Google patches that hole; repeat forever. But there could be a lot less. If I could pick the character sets I can read, then it’d be automatic that anything in Chinese or Cyrillic or Hebrew or Arabic or Farsi or Devanagari is spam. Plus, anything that mentions lottery in the title or the body, even once. Plus, anything where the title is of the form “From XXX XXX”. That’d catch at least 75% of what’s still getting through.

Contributions

Comment feed for ongoing:

From: Antone Roundy (Oct 04 2007, at 18:57)

Yeah, I've wanted to be able to block all emails containing Cyrillic characters before. I haven't had much trouble with any of the others you named.

The one thing you'd need to be sure of is that you're blocking Cyrillic characters, and not just messages that use a Cyrillic-specific charset, because many (if not all?) non-Roman character sets also contain Roman characters.

[link]

From: Graham Hughes (Oct 04 2007, at 19:44)

I'm always leery of bouncing things based on character sets that I don't necessarily understand. If I asked a friend of mine who spoke Russian what some phrase meant, even though the dialogue would be entirely in English I would still want to use Cyrillic codepoints for the characters in dispute. I haven't had this with Russian but I have been known to ask friends of mine about Japanese characters. A more moderate version could be proposed, but now you're into the realm of individually personalizing the spam filter in question and the character set it's using—information already in the message and available to e.g. Baysean filter systems—is just another datum.

[link]

From: John Cowan (Oct 04 2007, at 19:49)

I think you don't want to block particular charsets, you want to block particular characters or sequences of them. And GMail filters do provide that. Take some spam of each of the various kinds and look for common short words (okay, won't work so well for Chinese, sorry) and set filters to block messages containing them. Similarly, create a filter to block "lottery".

You can do the same sort of thing with filtering on the Subject: header.

[link]

From: Perry Lorier (Oct 04 2007, at 20:13)

I've been on the receiving end of some of these filters ("I don't talk to anyone in japan/china/korea so I'll filter out all of asia pacific..." "uhh, Australia/New Zealand? Hello? Oh you've filtered us and can't hear us anymore..."), you do have to be careful.

On the flipside, I've done really well with filtering everything that includes "http://" as being spam. This require a bit of thought as to when to apply it (obviously being involved heavily in Atom and XML things makes it much more likely to occur) but in areas such as a blog where you can say "anyone posting urls will have their comment auto discarded" it can significantly reduce the amount of spam seen.

[link]

From: Michael H. (Oct 04 2007, at 21:18)

One of the things I like about pobox.com is that they have about three dozen different spam detection methods which you can enable or disable. Since I don't expect mail from Korea, Russia, &c., I can just bounce those, and just check the bounce logs once in a while.

[link]

From: Eric Meyer (Oct 04 2007, at 22:04)

I just recently started filtering mail in Windows character sets for languages I can't read. Usually the charset designator is in the subject or body, so I check both. So far I've had no false positives but I'm taking things slowly.

I'd list them here, but said filters are on my desktop computer, which is currently 2,150+ miles from my current position...

[link]

From: Ethan Stock (Oct 04 2007, at 23:02)

Tim,

I couldn't agree more. Leaving aside the significant benefit to the individual user, major email providers can use simple categorization to help identify spam by cross-correlation. For instance, two similar mass emailings written in Chinese, with A going only to self-identified Chinese speakers, while B goes to a lot of non-Chinese speakers -- guess which one is much more likely to be spam? I first wrote a post about this 18 months ago.

[link]

From: Bruno (Oct 05 2007, at 01:02)

You can check the "Content-type" header for a charset. Spamoracle additionally includes a feature to describe the attachments of a message (charset, type, etc.) in an additional header "X-Attachments". You can use this to filter out charsets you do not understand. For example, its README file suggest the following procmail recipe (i.e. regular expression):

* ^(Content-type:.*|X-Attachments:.*cset="|^Subject:.*=\?)(ks_c|gb2312|iso-2|euc-|big5|windows-1251)

spambox

See http://pauillac.inria.fr/~xleroy/software.html#spamoracle

[link]

From: Devdas Bhagat (Oct 05 2007, at 01:50)

What about people who write in multiple character sets (possibly in the same message)? I have done it before, for people who don't necessarily speak good English.

[link]

From: Graham Parks (Oct 05 2007, at 02:36)

"Plus, anything that mentions lottery in the title or the body, even once."

And then someone tries to email you, quoting this blog entry...

[link]

From: Toby DiPasquale (Oct 05 2007, at 05:52)

You guys should all go to work for an anti-spam vendor for about 6 months to a year. I keep seeing this pop up here and again: every so often, some blogger gets the idea that "hey, I can catch ~XX% of my spam with 5 simple rules" and blogs it. The problem is not that XX%. The problem is twofold:

1. The other YY%, and

2. Those pesky other users on your mail server who are also getting spam

And the most subtle issue of all is that whatever percentage you can get today doesn't matter so much as how you get it. It doesn't matter because you have to increase it over time due to dramatically rising message volumes.

99% spam catch rate on 10,000 messages yields 100 spams making it into your inbox. Manageable. However, when the message volume rises to 100,000 messages, now you're getting 1,000 spams in your inbox. But, your filter never changed: it is still 99% effective, which sounds a lot more impressive than it is once you take the above into account.

On the other hand, the Tim Bray's of the world are always free to open up their .procmailrc's and drop in those 5 simple rules to block 75% of their inbound spam. This is valuable, in that you know best what your legit mailstream looks like and in some cases doing this can really take a load off of your spam filter. However, be aware that maintenance of a really good spam filter is very much a full-time job, so don't be too surprised if those same 5 rules keep changing and blow up into 100 - 1000 rules before long.

[link]

From: Walter Underwood (Oct 05 2007, at 08:59)

It would block some spam for a day, or maybe a few hours.

E-mail spammers have accounts on Hotmail, GMail, whatever, and use genetic programming to figure out what variation is getting through today, then they'll jump on it until it stops working.

At least that was the state of things a few years ago. By now, I expect that the spam fighters are detecting mailboxes that look like spambot behavior and randomizing the filters, or some such counter measure.

Spam is an extremely expensive hassle for an ad-supported business. Ten years ago at Infoseek, over 10% of our search staff was fighting spam. That's a big tax.

[link]

From: Justin Mason (Oct 06 2007, at 12:45)

Toby's comment is fantastic.

For what it's worth, SpamAssassin offers 'ok_locales' to specify which charsets are acceptable.

[link]

From: Erik (Oct 09 2007, at 13:03)

Since all the charsets I can think of include basic latin as a subset I think you'll be doing non monolingual English speakers who wish to communicate with you a great disservice by blocking charsets. For example, I get a lot of email on lists from people who respond in English but whose mailer prefaces the quotation in Chinese/Russian/whatever. In this case the email is going to be in a non-ISO-8859-x charset so if I'd set my mailer to filter it this would be a bad thing.

Besides if we all go to unicode this isn't going to help, is it?

[link]

ongoing

What this is ·

Truth · Biz · Tech

author · Dad
colophon · rights

October 04, 2007
· Technology (90 fragments)
· · Internet (116 more)

By Tim Bray.

The opinions expressed here
are my own, and no other party
necessarily agrees with them.

A full disclosure of my
professional interests is
on the author page.

I’m on Mastodon!