Data Mining: Difference between revisions

From APL_wiki
Jump to navigation Jump to search
Aplstudent (talk | contribs)
Data Mining landing page
 
Aplstudent (talk | contribs)
No edit summary
Line 1: Line 1:
Every field is generating huge amounts of data. Data mining has been applied to the fields of Artificial Intelligence, Humanities, Biology, Physics, Marketing, Operations. The techniques of statistics, machine learning, programming, problem solving, information science, communication and visualization have combined to form the field of Data Science.
Intro


The goal of this project is to explore the data mining research field and acquire skills using python and R
Data Science is a buzz word that means different things to different people. It can be the use of any amount of data and data analysis in a scientific process. It is also a growing field at the intersection of machine learning, statistics, information science and business.  
The tutorials I think are helpful will be linked here. There will be a series of tutorials on a variety of useful packages linked to github.  


Current Research project:
History as I know it
Clustering and Data Visualization in the Chronicling America Archives.
Machine learning is a sub field of artificial intelligence. It focuses on how to make algorithms that predict, classify outputs of a data set. This field utilizes the theory of discrete math, and graphs to provide a rigorous understanding and process for knowledge. Machine Learning itself is part of Data Mining, which is finding trends and patterns in large sets of data.
One can use the api (follow my code)
Or get a builk download of files
I am working on getting access to the entire collection


teaching modules to be added:
Applications
using APL resources to use ipython 2 notebook
Rstudio on windows
-packages
setting up iep and Anaconda on windows
-packages and pip
python and R
loading various formats data
web scraping( html/xml/css)
the power of pandas
starting with sklearn
introduction to theano and
progression of models to make


With the growth of the amount of data generated by the internet machine learning has undergone a huge explosion as the size of data has  become huge. There are applications using the algorithms that we use everyday. Face recognition, voice recognition, word prediction, spam detection, cancer detection, genetics, hand written digit recognition, text analytics.
There are three type of learning. Supervised, unsupervised and reenforcment. Supervised learning tasks are regrssopm amd classification, reinforcement learning is for stuff like the stock market, unsupervised is for other things.
Overview of the process
Get the data
Explore the data
Create model/score model
Tweak things
Collecting Data:
There are many interesting data sets that are available for exploration Uci machine learning repossitory, kaggle, reddit/r/datasets ckan(government documents) government docs, library of congress, website api’s,


What to do when struggling:  
What to do when struggling:  
Line 31: Line 28:
users and developers so information on everything is plentiful.
users and developers so information on everything is plentiful.


Programming is a skill that becomes easier to learn the longer you do it. The process I follow to build understanding of the problem is as follows
start researching!
find an example of the function being used correctly by googling : "(name of package/function) and 'example')"
read a tutorial on how to use it: google: "(name of package/function) and 'tutorial')"
read the documentation for the package: google ("name of package/function" and "documentation")
use python help() and R ? to read documentation in environment.
start programming!
simplify your problem to make the smallest possible piece of code work
repeat research step if new problems come up.
build up the problem and keep checking code until finished. 





Revision as of 06:30, 14 April 2016

Intro

Data Science is a buzz word that means different things to different people. It can be the use of any amount of data and data analysis in a scientific process. It is also a growing field at the intersection of machine learning, statistics, information science and business.

History as I know it Machine learning is a sub field of artificial intelligence. It focuses on how to make algorithms that predict, classify outputs of a data set. This field utilizes the theory of discrete math, and graphs to provide a rigorous understanding and process for knowledge. Machine Learning itself is part of Data Mining, which is finding trends and patterns in large sets of data.

Applications

With the growth of the amount of data generated by the internet machine learning has undergone a huge explosion as the size of data has become huge. There are applications using the algorithms that we use everyday. Face recognition, voice recognition, word prediction, spam detection, cancer detection, genetics, hand written digit recognition, text analytics.

There are three type of learning. Supervised, unsupervised and reenforcment. Supervised learning tasks are regrssopm amd classification, reinforcement learning is for stuff like the stock market, unsupervised is for other things.

Overview of the process

Get the data Explore the data Create model/score model Tweak things

Collecting Data: There are many interesting data sets that are available for exploration Uci machine learning repossitory, kaggle, reddit/r/datasets ckan(government documents) government docs, library of congress, website api’s,

What to do when struggling:

R and python both have complete documentation so you can find information about any function in it. ALso if you press "tab" after a dot ipython will show you the options you can write next. In Rstudio this is done automatically. There is also a very active community of users and developers so information on everything is plentiful.


Python Tutorials and blogs. If you've never touched python start by doing the first 20 exercies of this: [LPTHW]

Then follow the scipy primer on this wiki

to practice python skills [Subreddit]

To learn machine learning: This progression of tutorials take you from a little knowledge of numpy to a modern neural net with all the bells and whistles. It'll make you feel amazing about what you can do with python. first tutorial : [NeuralNet] All: [[1]]

intro to working with datasets: https://www.kaggle.com/c/titanic