Data Mining: Difference between revisions

From APL_wiki
Jump to navigation Jump to search
Aplstudent (talk | contribs)
Data Mining landing page
 
Aplstudent (talk | contribs)
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
Every field is generating huge amounts of data. Data mining has been applied to the fields of Artificial Intelligence, Humanities, Biology, Physics, Marketing, Operations. The techniques of statistics, machine learning, programming, problem solving, information science, communication and visualization have combined to form the field of Data Science.  
To see an example of taking a dataset and applying machine learning techniques go to https://github.com/aplstudent/Data_Anylasis/blob/master/example.R


The goal of this project is to explore the data mining research field and acquire skills using python and R
Intro
The tutorials I think are helpful will be linked here. There will be a series of tutorials on a variety of useful packages linked to github.


Current Research project:
Data Science is a buzz word that means different things to different people. It can be the use of any amount of data and data analysis in a scientific process. It is also a growing field at the intersection of machine learning, statistics, information science and business.
Clustering and Data Visualization in the Chronicling America Archives.  
One can use the api (follow my code)
Or get a builk download of files
I am working on getting access to the entire collection


teaching modules to be added:
History as I know it
using APL resources to use ipython 2 notebook
Machine learning is a sub field of artificial intelligence. It focuses on how to make algorithms that predict, classify outputs of a data set. This field utilizes the theory of discrete math, and graphs to provide a rigorous understanding and process for knowledge. Machine Learning itself is part of Data Mining, which is finding trends and patterns in large sets of data.
Rstudio on windows
-packages
setting up iep and Anaconda on windows
-packages and pip
python and R
loading various formats data
web scraping( html/xml/css)
the power of pandas
starting with sklearn
introduction to theano and  
progression of models to make


Applications
With the growth of the amount of data generated by the internet machine learning has undergone a huge explosion as the size of data has  become huge. There are applications using the algorithms that we use everyday. Face recognition, voice recognition, word prediction, spam detection, cancer detection, genetics, hand written digit recognition, text analytics.
There are three type of learning. Supervised, unsupervised and re-enforcment. Supervised learning tasks are regression and classification, reinforcement learning is for stuff like the stock market, unsupervised is a really advanced topic. 
Overview of the process
Get the data
Explore the data
Create model/score model
Tweak things
Collecting Data:
There are many interesting data sets that are available for exploration Uci machine learning repository, kaggle, reddit/r/datasets ckan(government documents) government docs, library of congress, website api’s,
clean the data:
decide what to do with missing variables, explore how complete the data is. Create graphs and statistics that are relevant to the variables. Find the type of each variable and convert categorical to numeric.
The model:
There are lots of models out there that work with different degrees of success. There are linear models, clustering, networks, and trees.
evaluation:
create a separate set of data to use as cross validation in order to avoid overfitting the model to the data. There are many metrics, such as accuracy, recall, F1.It's best to choose one and stick with it. 


What to do when struggling:  
What to do when struggling:  


R and python both have complete documentation so you can find information about any function in it. ALso if you press "tab" after a dot
R and python both have complete documentation so you can find information about any function in it. Also if you press "tab" after a dot
ipython will show you the options you can write next. In Rstudio this is done automatically. There is also a very active community of  
ipython will show you the options you can write next. In Rstudio this is done automatically. There is also a very active community of  
users and developers so information on everything is plentiful.
users and developers so information on everything is plentiful.


Programming is a skill that becomes easier to learn the longer you do it. The process I follow to build understanding of the problem is as follows
start researching!
find an example of the function being used correctly by googling : "(name of package/function) and 'example')"
read a tutorial on how to use it: google: "(name of package/function) and 'tutorial')"
read the documentation for the package: google ("name of package/function" and "documentation")
use python help() and R ? to read documentation in environment.
start programming!
simplify your problem to make the smallest possible piece of code work
repeat research step if new problems come up.
build up the problem and keep checking code until finished. 




Python Tutorials and blogs.
Python Tutorials and blogs.
If you've never touched python start by doing the first 20 exercies of this:
If you've never touched python start by doing the first 20 exercises of this:
[[http://learnpythonthehardway.org/book/ex1.html LPTHW]]
[[http://learnpythonthehardway.org/book/ex1.html LPTHW]]



Latest revision as of 06:13, 23 April 2016

To see an example of taking a dataset and applying machine learning techniques go to https://github.com/aplstudent/Data_Anylasis/blob/master/example.R

Intro

Data Science is a buzz word that means different things to different people. It can be the use of any amount of data and data analysis in a scientific process. It is also a growing field at the intersection of machine learning, statistics, information science and business.

History as I know it Machine learning is a sub field of artificial intelligence. It focuses on how to make algorithms that predict, classify outputs of a data set. This field utilizes the theory of discrete math, and graphs to provide a rigorous understanding and process for knowledge. Machine Learning itself is part of Data Mining, which is finding trends and patterns in large sets of data.

Applications

With the growth of the amount of data generated by the internet machine learning has undergone a huge explosion as the size of data has become huge. There are applications using the algorithms that we use everyday. Face recognition, voice recognition, word prediction, spam detection, cancer detection, genetics, hand written digit recognition, text analytics.

There are three type of learning. Supervised, unsupervised and re-enforcment. Supervised learning tasks are regression and classification, reinforcement learning is for stuff like the stock market, unsupervised is a really advanced topic.

Overview of the process

Get the data Explore the data Create model/score model Tweak things

Collecting Data: There are many interesting data sets that are available for exploration Uci machine learning repository, kaggle, reddit/r/datasets ckan(government documents) government docs, library of congress, website api’s,

clean the data: decide what to do with missing variables, explore how complete the data is. Create graphs and statistics that are relevant to the variables. Find the type of each variable and convert categorical to numeric.

The model: There are lots of models out there that work with different degrees of success. There are linear models, clustering, networks, and trees.

evaluation: create a separate set of data to use as cross validation in order to avoid overfitting the model to the data. There are many metrics, such as accuracy, recall, F1.It's best to choose one and stick with it.

What to do when struggling:

R and python both have complete documentation so you can find information about any function in it. Also if you press "tab" after a dot ipython will show you the options you can write next. In Rstudio this is done automatically. There is also a very active community of users and developers so information on everything is plentiful.


Python Tutorials and blogs. If you've never touched python start by doing the first 20 exercises of this: [LPTHW]

Then follow the scipy primer on this wiki

to practice python skills [Subreddit]

To learn machine learning: This progression of tutorials take you from a little knowledge of numpy to a modern neural net with all the bells and whistles. It'll make you feel amazing about what you can do with python. first tutorial : [NeuralNet] All: [[1]]

intro to working with datasets: https://www.kaggle.com/c/titanic