What are some good class projects for machine learning using MapReduce?
2. Parallelizable to ~3 people (eg. each person could try a different method and then combine it into an ensemble)
3. Duration: ~2-3 hours/(week * student) * 2/3 students/group * 4 weeks * 7 groups
4. Have a verifyable sub-result(s) (so that students have a way of knowing they are on the right track)
5. (Optional) Have an open ended question that could possibly be pursued by the more enthusiastic folks.
Try implementing some ML algorithms not yet covered in Apache Mahout: What are some important algorithms not yet covered in Mahout? , and What are the top 10 data mining or machine learning algorithms?
See open items: https://cwiki.apache.org/conflue... , you can also ask on the Mahout mailing list.1) Matrix Decomposition routines (QR, Cholesky etc)
- Numerical Recipes: http://www.nr.com/
- Matrix factorization algorithms: http://bickson.blogspot.com/2011...
2) Decision Trees with ID3, C4.5 or other heuristic (https://issues.apache.org/jira/b... ). This is one of ther most popular algorithms in data mining with countless applications.
Tutorials: Decision Trees: What are some good resources for learning about decision trees?
Note: It looks like Mahout has a partial implementation of random decision forest, you may be able to use it to test your code (if questions arise please ask on Mahout mailing list, the community there is very helpful):
https://cwiki.apache.org/MAHOUT/...
https://cwiki.apache.org/MAHOUT/...
https://cwiki.apache.org/MAHOUT/...
3) Linear Regression https://cwiki.apache.org/conflue... , Ordinary Least Squares or other linear least squares methods: http://en.wikipedia.org/wiki/Ord... also see Matlab statistics toolbox for ideas: http://www.mathworks.com/help/to...
4) Gradient Descent and other optimization and linear programming algorithms, seeConvex Optimization: What are some good resources for learning about distributed optimization? , What are some fast gradient descent algorithms? , Matlab optimization toolbox: http://www.mathworks.com/help/to... Convex Optimization: Which optimization algorithms are good candidates for parallelization with MapReduce?
5) AdaBoost and other meta-algorithms: http://en.wikipedia.org/wiki/Ada...
6) SVM: https://issues.apache.org/jira/b... , https://issues.apache.org/jira/b... ,https://issues.apache.org/jira/b... , Support Vector Machines: What is the best way to implement an SVM using Hadoop?
7) Vector space models http://en.wikipedia.org/wiki/Vec...
8) Hidden Markov Models - an extremely popular method in NLP & bioinformatics. See Hidden Markov Models: What are some good resources for learning about Hidden Markov Models? and https://issues.apache.org/jira/b... ,https://issues.apache.org/jira/b... , http://www.mendeley.com/c/424264...
9) Slope One by Daniel Lemire: http://en.wikipedia.org/wiki/Slo... or otherCollaborative Filtering algorithms. See Mahout in Action by Sean Owen:http://www.manning.com/owen/
10) DFT/FFT, Wavelets, z-transform, other popular signal and image processing transforms, see Matlab Signal Processing toolbox: http://www.mathworks.com/help/to... , Image Processing toolbox: http://www.mathworks.com/help/to...Wavelet Toolbox http://www.mathworks.com/help/to... also see OpenCV catalog:http://opencv.willowgarage.com/w...
11) PageRank, here is a good tutorial: http://michaelnielsen.org/blog/u...
12) Build an eigensolver: http://www.cs.cmu.edu/~ukang/pap...
13) For a wealth of open ended problems see Programming Challenges: What are some good "toy problems" in data science?
Notes:
- See Jimmy Lin's book Data-Intensive Text Processing with MapReduce for some good tips: http://www.umiacs.umd.edu/~jimmy... and Tom White's great book on Hadoop: http://www.hadoopbook.com/
- Map-Reduce for Machine Learning on Multicore by Chu et al.: www-cs.stanford.edu/~ang/papers/...
- Muthu Muthukrishnan's MapReduce resources: http://www.cs.rutgers.edu/~muthu...
- Top 10 algorithms in data mining: http://www.mendeley.com/research...
- Large Data Logistic Regression (with example Hadoop code): http://www.win-vector.com/blog/2...
- A Comparison of Eight MapReduce Languages: http://www.dataspora.com/2011/04...
- Seven data-mining algorithms which are 200-400x faster on GPUs: http://www.smedirector.com/2010/... via Michael E Driscoll
- RecLab Core by Darren Erik Vengroff: http://code.richrelevance.com/re...
- Amund Tveit's links: http://atbrox.com/2011/05/16/map...
- Jeff Hammerbacher's links: http://www.mendeley.com/groups/1...
- MR bibliography I've compiled a while back: http://www.columbia.edu/~ak2834/...
- Scaling up machine learning: http://www.cs.umass.edu/~ronb/sc...
- Machine Learning: What are some good learning projects to teach oneself about machine learning?
- Implement the sequential version first, then parallelize with either Hadoop, or one of the alternatives (What are some promising open-source alternatives to Hadoop MapReduce for map/reduce?), or a self-made runtime; always abstract the MR logic away from the DFS. One of your teams could build a simple MapReduce engine; we did this for a term project (using an experimental language called X10) and it was fun.