What the heck is this ?
This is a Java Program, which can predict the class of an email.
Classes predicted are {Spam,Sports,Business,Technology,Entertainment}.
For more details read :
How it works
Step 1: Read the training emails.
Step 2: Text Pre-processing.
- Tokenization using Stanford Parser.
- Stop word removal. : to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc.
- Normalize words. : Stanford Lemma.
- Then make instance based on top K term frequency.
Step 3: Training
- Design NBTree – Hybrid Decision Tree Naïve baiyes classifier.
- Train input on this, along with a Test Instance.
- Predict the category.
The source code also contains the implementation of custom designed NBTree & custom multinomial Naive Bayes.
Why NBTree ?
Model used is NBTree.
Problem How can we generate a classifier from an arbitrarily sized database of labeled instances, where attributes
are not necessarily independent?
Solutions
1 Naive-Bayes Classifiers (Cons: Assumes independence of attributes)
2 Decision-Trees (C4.5)(Cons: Fragmentation as number of splits becomes large)
3 NBTree
* Many attributes are relevant for classification
* Attributes are not necessarily independent
* Database is large
* Interpretability of classifier is important
Refrences:
R. Kohavi.
"Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid"
Build Information
Very Imp : Please Download "stanford-corenlp-3.3.1-models.jar"(http://nlp.stanford.edu/software/stanford-corenlp-full-2014-01-04.zip) As i was not able to upload this jar, because of size limit. Download and copy it inside lib folder. Then you should be able to Build this project,
There are two ways to run :
-
1. In ecclipse -
a).Import the folder "HybridEmailClassifier", inside ecclipse.
b).To do this, inside package explorer of eclipse right click and select import.
c).Under General , select existing Projects into workspace.
d).do check copy projects into existing workspace.
e).Once the project is imported, right click project and click on "Run configuration" inside Run as options.
f).In the arguments tab, make sure your are passing "production" as Program Arguments.
g).Finally run the com.anupam.hybrid.HybridMain class. This has the main method. - Using Ant -
a).You should have ant installed and configured.
b).Open cmd prompt.
c).cd to folder "HybridEmailClassifier".
d).compile the source code by simply typing ant in cmd prompt.
e).This compiles the files and generates a folder "build".
f).Now run the program by typing ant HybridMain.