Decision Forests - Revision history

Wikiuser: /* A parallel Decision Forest */

2016-03-19T10:17:15Z

A parallel Decision Forest

← Older revision		Revision as of 10:17, 19 March 2016
Line 8:		Line 8:
	<li>extractwords.py - contains the ExtractWords class, which is used for automatic feature generation.</li>		<li>extractwords.py - contains the ExtractWords class, which is used for automatic feature generation.</li>
	<ul><li>Reads a large number of emails and finds the frequency each word appears in spam or non-spam emails. Then those frequencies are subtracted from eachother. The words with the highest magnitudes then are used as features. (The numbers with large magnitudes should be those that are particularly spammy, or not) A new file -message.fts- is written, containing the features.</li></ul>		<ul><li>Reads a large number of emails and finds the frequency each word appears in spam or non-spam emails. Then those frequencies are subtracted from eachother. The words with the highest magnitudes then are used as features. (The numbers with large magnitudes should be those that are particularly spammy, or not) A new file -message.fts- is written, containing the features.</li></ul>
			<li>id3.py - My implementation of the id3 Decision tree algorithm.</li>
			<ul>
			<li>This one is a bit cryptic and hard to read. But there are good explanations of the algorithm out on the web. [https://www.youtube.com/watch?v=_XhOdSLlE5c This video] is part of a series of videos talking about decision trees and has a nice walkthrough of what the id3 algorithm is.</li>
			</ul>
			<li>parallelpredict.py - Uses id3.py and serialpredict.py to learn and classify email. Splits the data into different sets for each core in the cluster to look at and sends it out. Each core learns on the data. When making a prediction, each core receives the sample it is classifying, and makes its prediction. The prediction with the highest votes wins. For example, if we have 25 trees where 15 think the sample is spam and 10 do not, then we classify the sample as spam.</li>
			<li>serialpredict.py - for use as a time comparison to parallelpredict.py. Learns a single decision tree and classifies samples consecutively.</li>
	</ul>		</ul>

Wikiuser: /* Code on Github */

2016-03-19T10:00:49Z

Code on Github

← Older revision		Revision as of 10:00, 19 March 2016
Line 1:		Line 1:

	== ~~Code on Github~~ ==		== A parallel Decision Forest ==
	Here is a link to all of the code that I used for my email spam/ham classification algorithm - [https://github.com/brianjp93/email-classification/ github repo]		Here is a link to all of the code that I used for my email spam/ham classification algorithm - [https://github.com/brianjp93/email-classification/ github repo]

			Explanation of some files
			<ul>
			<li>emaildata.py - contains the EmailData class, whose methods are used to extract data from each individual email.</li>
			<li>extractwords.py - contains the ExtractWords class, which is used for automatic feature generation.</li>
			<ul><li>Reads a large number of emails and finds the frequency each word appears in spam or non-spam emails. Then those frequencies are subtracted from eachother. The words with the highest magnitudes then are used as features. (The numbers with large magnitudes should be those that are particularly spammy, or not) A new file -message.fts- is written, containing the features.</li></ul>
			</ul>

Wikiuser: Created page with " == Code on Github == Here is a link to all of the code that I used for my email spam/ham classification algorithm - [https://github.com/brianjp93/email-classification/ github..."

2016-03-19T09:50:38Z

Created page with " == Code on Github == Here is a link to all of the code that I used for my email spam/ham classification algorithm - [https://github.com/brianjp93/email-classification/ github..."

New page

== Code on Github ==
Here is a link to all of the code that I used for my email spam/ham classification algorithm - [https://github.com/brianjp93/email-classification/ github repo]