Data Mining: Difference between revisions
Aplstudent (talk | contribs) No edit summary |
Aplstudent (talk | contribs) No edit summary |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
To see an example of taking a dataset and applying machine learning techniques go to https://github.com/aplstudent/Data_Anylasis/blob/master/example.R | |||
Intro | Intro | ||
Line 10: | Line 12: | ||
With the growth of the amount of data generated by the internet machine learning has undergone a huge explosion as the size of data has become huge. There are applications using the algorithms that we use everyday. Face recognition, voice recognition, word prediction, spam detection, cancer detection, genetics, hand written digit recognition, text analytics. | With the growth of the amount of data generated by the internet machine learning has undergone a huge explosion as the size of data has become huge. There are applications using the algorithms that we use everyday. Face recognition, voice recognition, word prediction, spam detection, cancer detection, genetics, hand written digit recognition, text analytics. | ||
There are three type of learning. Supervised, unsupervised and | There are three type of learning. Supervised, unsupervised and re-enforcment. Supervised learning tasks are regression and classification, reinforcement learning is for stuff like the stock market, unsupervised is a really advanced topic. | ||
Overview of the process | Overview of the process | ||
Line 20: | Line 22: | ||
Collecting Data: | Collecting Data: | ||
There are many interesting data sets that are available for exploration Uci machine learning | There are many interesting data sets that are available for exploration Uci machine learning repository, kaggle, reddit/r/datasets ckan(government documents) government docs, library of congress, website api’s, | ||
clean the data: | |||
decide what to do with missing variables, explore how complete the data is. Create graphs and statistics that are relevant to the variables. Find the type of each variable and convert categorical to numeric. | |||
The model: | |||
There are lots of models out there that work with different degrees of success. There are linear models, clustering, networks, and trees. | |||
evaluation: | |||
create a separate set of data to use as cross validation in order to avoid overfitting the model to the data. There are many metrics, such as accuracy, recall, F1.It's best to choose one and stick with it. | |||
What to do when struggling: | What to do when struggling: | ||
R and python both have complete documentation so you can find information about any function in it. | R and python both have complete documentation so you can find information about any function in it. Also if you press "tab" after a dot | ||
ipython will show you the options you can write next. In Rstudio this is done automatically. There is also a very active community of | ipython will show you the options you can write next. In Rstudio this is done automatically. There is also a very active community of | ||
users and developers so information on everything is plentiful. | users and developers so information on everything is plentiful. | ||
Line 31: | Line 42: | ||
Python Tutorials and blogs. | Python Tutorials and blogs. | ||
If you've never touched python start by doing the first 20 | If you've never touched python start by doing the first 20 exercises of this: | ||
[[http://learnpythonthehardway.org/book/ex1.html LPTHW]] | [[http://learnpythonthehardway.org/book/ex1.html LPTHW]] | ||
Latest revision as of 06:13, 23 April 2016
To see an example of taking a dataset and applying machine learning techniques go to https://github.com/aplstudent/Data_Anylasis/blob/master/example.R
Intro
Data Science is a buzz word that means different things to different people. It can be the use of any amount of data and data analysis in a scientific process. It is also a growing field at the intersection of machine learning, statistics, information science and business.
History as I know it Machine learning is a sub field of artificial intelligence. It focuses on how to make algorithms that predict, classify outputs of a data set. This field utilizes the theory of discrete math, and graphs to provide a rigorous understanding and process for knowledge. Machine Learning itself is part of Data Mining, which is finding trends and patterns in large sets of data.
Applications
With the growth of the amount of data generated by the internet machine learning has undergone a huge explosion as the size of data has become huge. There are applications using the algorithms that we use everyday. Face recognition, voice recognition, word prediction, spam detection, cancer detection, genetics, hand written digit recognition, text analytics.
There are three type of learning. Supervised, unsupervised and re-enforcment. Supervised learning tasks are regression and classification, reinforcement learning is for stuff like the stock market, unsupervised is a really advanced topic.
Overview of the process
Get the data Explore the data Create model/score model Tweak things
Collecting Data: There are many interesting data sets that are available for exploration Uci machine learning repository, kaggle, reddit/r/datasets ckan(government documents) government docs, library of congress, website api’s,
clean the data: decide what to do with missing variables, explore how complete the data is. Create graphs and statistics that are relevant to the variables. Find the type of each variable and convert categorical to numeric.
The model: There are lots of models out there that work with different degrees of success. There are linear models, clustering, networks, and trees.
evaluation: create a separate set of data to use as cross validation in order to avoid overfitting the model to the data. There are many metrics, such as accuracy, recall, F1.It's best to choose one and stick with it.
What to do when struggling:
R and python both have complete documentation so you can find information about any function in it. Also if you press "tab" after a dot ipython will show you the options you can write next. In Rstudio this is done automatically. There is also a very active community of users and developers so information on everything is plentiful.
Python Tutorials and blogs. If you've never touched python start by doing the first 20 exercises of this: [LPTHW]
Then follow the scipy primer on this wiki
to practice python skills [Subreddit]
To learn machine learning: This progression of tutorials take you from a little knowledge of numpy to a modern neural net with all the bells and whistles. It'll make you feel amazing about what you can do with python. first tutorial : [NeuralNet] All: [[1]]
intro to working with datasets: https://www.kaggle.com/c/titanic