What are some good class projects for machine learning using MapReduce?

What are some good class projects for machine learning using MapReduce?

We are looking for a (not necessarily academic) class project for a class where we are learning to implement various Machine Learning algorithms over the MapReduce framework using AWS. Should meet the following criteria:
1. Low time spent on cleaning the data
2. Parallelizable to ~3 people (eg. each person could try a different method and then combine it into an ensemble)
3. Duration: ~2-3 hours/(week * student) * 2/3 students/group * 4 weeks * 7 groups
4. Have a verifyable sub-result(s) (so that students have a way of knowing they are on the right track)
5. (Optional) Have an open ended question that could possibly be pursued by the more enthusiastic folks.













Try implementing some ML algorithms not yet covered in Apache Mahout: What are some important algorithms not yet covered in Mahout? , and What are the top 10 data mining or machine learning algorithms?

See open items: https://cwiki.apache.org/conflue... , you can also ask on the Mahout mailing list.

1) Matrix Decomposition routines (QR, Cholesky etc) 

2) Decision Trees with ID3, C4.5 or other heuristic (https://issues.apache.org/jira/b... ). This is one of ther most popular algorithms in data mining with countless applications. 
Tutorials: Decision Trees: What are some good resources for learning about decision trees?

Note: It looks like Mahout has a partial implementation of random decision forest, you may be able to use it to test your code (if questions arise please ask on Mahout mailing list, the community there is very helpful):
https://cwiki.apache.org/MAHOUT/...
https://cwiki.apache.org/MAHOUT/...
https://cwiki.apache.org/MAHOUT/...

3) Linear Regression https://cwiki.apache.org/conflue... , Ordinary Least Squares or other linear least squares methods: http://en.wikipedia.org/wiki/Ord... also see Matlab statistics toolbox for ideas: http://www.mathworks.com/help/to...

4) Gradient Descent and other optimization and linear programming algorithms, seeConvex Optimization: What are some good resources for learning about distributed optimization? , What are some fast gradient descent algorithms? , Matlab optimization toolbox: http://www.mathworks.com/help/to... Convex Optimization: Which optimization algorithms are good candidates for parallelization with MapReduce?

5) AdaBoost and other meta-algorithms: http://en.wikipedia.org/wiki/Ada...

6) SVM: https://issues.apache.org/jira/b... , https://issues.apache.org/jira/b... ,https://issues.apache.org/jira/b... , Support Vector Machines: What is the best way to implement an SVM using Hadoop?

7) Vector space models http://en.wikipedia.org/wiki/Vec...

8) Hidden Markov Models - an extremely popular method in NLP & bioinformatics. See Hidden Markov Models: What are some good resources for learning about Hidden Markov Models? and https://issues.apache.org/jira/b... ,https://issues.apache.org/jira/b... , http://www.mendeley.com/c/424264...

9) Slope One by Daniel Lemirehttp://en.wikipedia.org/wiki/Slo... or otherCollaborative Filtering algorithms. See Mahout in Action by Sean Owen:http://www.manning.com/owen/

10) DFT/FFT, Wavelets, z-transform, other popular signal and image processing transforms, see Matlab Signal Processing toolbox: http://www.mathworks.com/help/to... , Image Processing toolbox: http://www.mathworks.com/help/to...Wavelet Toolbox http://www.mathworks.com/help/to... also see OpenCV catalog:http://opencv.willowgarage.com/w... 

11) PageRank, here is a good tutorial: http://michaelnielsen.org/blog/u...

12) Build an eigensolver: http://www.cs.cmu.edu/~ukang/pap...

13) For a wealth of open ended problems see Programming Challenges: What are some good "toy problems" in data science?

Notes:
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章