Bayesian Filtering with C#

Introducing nBayes, a new open source projectnbayes_logo which can be found here:

http://nbayes.codeplex.com/

nBayes is a simple implementation of the naive bayesian spam filter described by Paul Graham in his essay "A Plan for Spam" (http://www.paulgraham.com/spam.html). The API is very simple, there are just 3 classes that you need to be familiar with.

You can train the Index by adding entries to it, and then use an Analyer to categorize a new entry as belonging to one index or another. In the spam filtering example, one index would be the Spam, while the other would be the "not-Spam".

Sample Code

Index spam = Index.CreateMemoryIndex();
Index notspam = Index.CreateMemoryIndex();

// train the indexes
spam.Add(Entry.FromString("want some viagra?"));
spam.Add(Entry.FromString("cialis can make you larger"));
notspam.Add(Entry.FromString("Hello, how are you?"));
notspam.Add(Entry.FromString("Did you go to the park today?"));

Analyzer analyzer = new Analyzer();
CategorizationResult result = analyzer.Categorize(
     Entry.FromString("cialis viagra"),
     spam,
     notspam);

switch (result)
{
    case CategorizationResult.First:
        Console.WriteLine("Spam");
        break;
    case CategorizationResult.Undetermined:
        Console.WriteLine("Undecided");
        break;
    case CategorizationResult.Second:
        Console.WriteLine("Not Spam");
        break;
}

The example above uses an extremely small index of words … however, the reported result is indeed that it categorizes it as spam. Larger indexes are required to get better results. The sample project provided in the source code shows how to create two indexes by doing a search of twitter for two different terms. The top 100 results of that twitter API query will be trained into each respective index, and then it will ask you to type in a sample phrase. This phrase will be categorized into one of each index.

3 Comments »

  1. dylan Said,

    February 9, 2010 @ 7:03 pm

    thanks for sharing this – i really needed to get my head around this concept quickly, and your source is excellent. just stepping through the debugger now… so thanks again.

  2. scott Said,

    May 14, 2010 @ 11:20 am

    Hi,

    I am looking for a spam filter as well. Do you know where I could get a pretrained index file?

    Thanks

  3. Joel Martinez Said,

    May 18, 2010 @ 8:08 pm

    Well, here’s the thing about Bayesian filters. They are actually very tuned to each individual user (over time). To use an extreme example, content about Viagra may be spam to some, but for a pharmacist may be perfectly valid. So I’d suggest just starting your own index, and it will eventually tune itself :)

RSS feed for comments on this post

Leave a Comment