wiki:SummerOfCodeIdeas

Ideas for Google Summer of Code 2014

These are a list of ideas compiled by mlpack developers; they range from simpler code maintenance tasks to difficult machine learning algorithm implementation, which means that there are suitable ideas for a wide range of student abilities and interests. The necessary knowledge: sections can often be replaced with "willing to learn" for the easier projects, and for some of the more difficult problems, a full understanding of the description statement and some coding knowledge is sufficient.

In addition, these are not all of the possible projects which could be undertaken; new, interesting ideas are also suitable.

At the bottom are a list of projects from 2013 that aren't really applicable to this year, or were done.

For more information on any of these projects, visit #mlpack in freenode (IRC) or email  mlpack@cc.gatech.edu (see also  http://lists.cc.gatech.edu/mailman/listinfo/mlpack ).

If you are looking for how to get started with mlpack, see the  tutorials page and try to compile and run simple mlpack programs. Then, you could look at the  list of bugs and maybe you can find an easy and interesting bug to solve.

For an application template, see the  Melange page.

Implement tree types

description: mlpack focuses heavily on dual-tree algorithms, which each rely on a type of space tree (data structure). These tree types include kd-trees, ball trees, cover trees, vantage point trees, R trees, R* trees, metric trees, principal axis trees, k-means trees, random projection trees, Bregman ball trees, UB trees, R+ trees, Hilbert trees, X trees, segment trees, interval trees, range trees, and others. However mlpack only really implements kd-trees and cover trees (and arguably ball trees although support isn't very stable). This project involves implementing other tree types for mlpack. A good project will select a handful of tree types and implement them (with tests and documentation) over the course of the summer. The trees must be implemented according to  this example interface so that they work with any type of dual-tree algorithm. In addition, this project could possibly contain a research component -- benchmarking runtimes of dual-tree algorithms when different types of trees are used.

difficulty: 3/10

deliverable: Implemented tree types and proof (via tests) that the tree types work with mlpack's dual-tree algorithms

necessary knowledge: Data structures, C++, knowledge of memory management

relevant tickets: #190, #285, #282, #234, #235, #301

Improvement of automatic benchmarking system (collaboration with Shogun)

description: Last year, the  automatic benchmarking system was created as a Summer of Code project and there was sufficient interest from other libraries (including Shogun) to move it to  its own Github repository. However, in some ways, the system could be improved: it only runs on one system, it does not restrict CPU resources (so that different benchmarking systems can perform the same), and it does not compare the accuracy of implementations. Any of these three problems could be approached and the system could be improved, so that a centralized benchmarking system that worked across many libraries could be set up. This would be useful not just to mlpack but the greater machine learning community as a whole. In some ways this project idea is open-ended, and potential students can suggest their own improvements to the system. As mentioned in the title this is a joint project with the shogun guys, so take a look at their  description as well.

difficulty: 5/10

deliverable: an improved benchmarking system (improved in the ways laid out in the student's proposal)

necessary knowledge: familiarity with Python, a basic knowledge of machine learning techniques and a willingness to learn about other machine learning techniques.

relevant tickets: none are open at this time

More diverse build slaves (OS X and Windows)

description: mlpack has an automatic build server (Jenkins) running at  http://big.cc.gt.atl.ga.us:8080/ and also has two Windows 7 build slaves (x32 and x64). However, Jenkins is not configured properly to build mlpack on Windows. Therefore, the correct Jenkins build rules for Windows systems need to be determined and put in place, so that each revision of mlpack can be built and tested on Windows. In addition, a build server for OS X needs to be set up. The hardware can be obtained but the installation of necessary packages and setup will need to be done.

difficulty: 5/10

deliverable: Windows and OS X build slaves integrated into Jenkins builds

necessary knowledge: Windows familiarity, Visual Studio, some C++ (for debugging compilation errors)

relevant tickets: none are open at this time

Profiling for further optimization

description: mlpack could run even faster if it used profiling information during compilation. This entails adding extra build steps to the build process: first, to run a subset of mlpack programs with profiling information, and second, to rebuild mlpack using that profiling information. The difficulty here is choosing datasets which reflect general characteristics of the datasets which users will run mlpack methods with. If an unusual dataset is chosen, then mlpack will be compiled to run very quickly on that type of dataset -- but it will not run as well on other datasets. Another issue here is that the profiling process should not take extremely long (that is, longer than an hour or so), so we cannot choose a *huge* variety of large datasets.

deliverable: a 'make profile' build option which performs the profiling as described above

difficulty: 6/10

necessary knowledge: some machine learning familiarity, some CMake scripting

relevant tickets: #20

Automatic bindings

description: A better alternative to writing bindings by hand would be to have some script which processed machine learning method information and automatically generated bindings for MATLAB, Python, R, and perhaps other languages. SWIG, unfortunately, is not really an option for this. This project is somewhat exploratory and may not end up producing results if nothing is available, so this project will be research-intensive. In all likelihood the project will involve building a binding generator which can take in mlpack executable code (such as src/mlpack/methods/neighbor_search/allknn_main.cpp) and can produce a binding to another language that gives the same options as the command-line interface.

deliverable: automatic binding generators for mlpack methods which are easy to maintain

difficulty: 6/10

necessary knowledge: a working knowledge of what bindings are and a good ability to do research on software

relevant tickets: #208

Simulated annealing optimizer

description: mlpack implements a couple of optimizers: L-BFGS, the augmented Lagrangian algorithm, and LRSDP (these are found in src/mlpack/core/optimizers/). Often these optimizers can be plugged into various machine learning algorithms. Simulated annealing, or the Metropolis-Hastings algorithm, could be used as an optimizer for these tasks (although it may be slow); it can avoid local minima, which is useful.

deliverable: implemented simulated annealing optimizer which fits with all existing mlpack algorithms that use optimizers

difficulty: 7/10

necessary knowledge: a working knowledge of optimization techniques and simulated annealing (see the paper in Nature and maybe Hajek's convergence proof, `Cooling schedules for optimal annealing'), a good knowledge of C++ (and templates), and familiarity with the general API design of mlpack

relevant tickets: none open at this time

Fast k-centers Algorithm & Implementation

description: The k-centers problem is a fundamental part of many machine learning algorithms (See Cortes & Scott, "Scalable Sparse Approximation of a Sample Mean", 2013 for a recent use of it). The basic problem is: given a set of points and integer k, choose k points that minimize the maximum distance between a point not in the set and it's nearest neighbor in the set. I think the ideas underlying the Dual-Tree Boruvka MST algorithm in MLPACK could be extended to provide an efficient implementation for this problem. This one is more research / algorithm oriented, but would definitely require some implementation to evaluate the idea.

deliverable: working, tested k-centers implementation; paper comparing it to basic Gonzales algorithm and other implementations

difficulty: 7/10, but biased by being really related to my (Bill's) thesis work

necessary knowledge: some understanding of geometry and spatial data structures, willingness to dive into some literature on the topic, basic C++

relevant tickets: none open at this time

AdaBoost implementation

description: AdaBoost is a well-known boosting algorithm (or ensemble algorithm) that uses an ensemble of weak classifiers to build a strong classifier. This would be useful to have in mlpack, but mlpack does not have very many weak classifiers. So, this project will entail implementing some weak classifiers, then implementing the AdaBoost algorithm in a way that is able to use any weak classifier (or set of weak classifiers) in mlpack. This means there will be some API design: the exact API of an mlpack weak classifier will need to be developed and documented.

deliverable: implemented AdaBoost algorithm with a few weak classifiers, and tests to verify their correctness

difficulty: 8/10

necessary knowledge: a knowledge of the AdaBoost algorithm, familiarity with the mlpack codebase and design decisions, familiarity with C++ and templates

relevant tickets: none open at this time

Collaborative filtering package improvements

description: Collaborative filtering is a popular technique used in recommender systems. As of the end of GSoC 2013, mlpack now has a collaborative filtering implementation by Mudit Raj Gupta; however, it only uses NMF as a decomposition method. This project would entail researching other fast algorithms for collaborative filtering (such as QUIC-SVD) and implementing them in mlpack according to existing mlpack coding standards, and also in such a way that they can be used with the existing collaborative filtering framework. See also the cf page, which details Mudit's work from last year.

deliverable: working, tested collaborative filtering implementation with extensive documentation

difficulty: 8/10

necessary knowledge: in-depth C++ knowledge, machine learning familiarity, familiarity with reading research whitepapers

relevant tickets: #306

Improvement of tree traversers

description: Many of mlpack's machine learning methods are dual-tree algorithms. These dual-tree algorithms are abstracted in such a way that they all use the same tree traversal methods. If these traversers could be improved, this would result in runtime gains for all tree-based mlpack methods. This requires a lot of abstract thought in very weird ways and often debugging and profiling tree traversers is fraught with sadness.

deliverable: demonstrably improved runtime for tree-based methods

difficulty: 9/10

necessary knowledge: very in-depth C++ knowledge and understanding of tree structures; familiarity with machine learning methods is helpful

relevant tickets: #243

LMNN with LRSDP implementation

description: mlpack has a working LRSDP (low-rank semidefinite program) implementation, which gives faster solutions to SDPs. This could give good speedups for existing SDP-based machine learning methods, such as LMNN (large margin nearest neighbor). This project should only entail the expression of LMNN as a low-rank SDP and then the implementation should be straightforward because the mlpack LRSDP API is straightforward. The difficulty is still quite high, though, because debugging LRSDPs is complex.

deliverable: working, tested LMNN implementation with extensive documentation

difficulty: 9/10

necessary knowledge: in-depth C++ knowledge, understanding of semidefinite programs; familiarity with LMNN is helpful

relevant tickets: none open at this time

Fixes to MVU and low-rank semidefinite programs

description: This project is not for the faint of heart. For some time now, MVU (maximum variance unfolding), a dimensionality reduction technique, has not been converging even on simple datasets. mlpack's implementation of MVU uses LRSDP (low-rank semidefinite programs); there is not another existing implementation of MVU+LRSDP. Many tears will be shed trying to debug this. A good approach will probably be to compare mlpack MVU results on exceedingly simple problems with other MVU implementation results. The final outcome of this project may not even be a successful converging algorithm but instead more information on what is going wrong.

deliverable: working MVU implementation, or, further details and information on the problem

difficulty: 10/10

necessary knowledge: understanding of convexity and duality, knowledge of semidefinite programs, incredible determination and perseverance

Old ideas for GSoC 2013

Refinement of MATLAB bindings

description: Currently, there are MATLAB bindings contributed by Patrick Mason, but they are not comprehensive; there are some mlpack methods which do not have MATLAB bindings. This project entails developing the rest of the mlpack MATLAB bindings and standardizing them. Detailed knowledge of the machine learning methods is not necessary for this project -- only a willingness to read some documentation. Regardless, someone working on this project can expect to, over the course of the project, become at least somewhat familiar with state-of-the-art machine learning methods and how they work. Each binding consists of a MATLAB script which provides documentation and handles input, and a MEX file which passes the input to mlpack methods.

deliverable: MATLAB bindings for all mlpack methods (about twenty) with standardized APIs

difficulty: 2/10

necessary knowledge: MATLAB familiarity and some C++

relevant tickets: #252, #208

why not 2014? Realistically, the automatic bindings project is far more useful and interesting, and reduces the maintenance load of bindings significantly.

Python bindings and R bindings for mlpack

description: Similar to the MATLAB bindings, Python and R bindings would be useful for mlpack users who don't know C++ or would prefer to use Python or R to do large-scale machine learning. This project entails developing bindings for Python and R which have consistent API and documentation. Detailed knowledge of the machine learning methods is not necessary -- only a willingness to read some documentation. Regardless, someone working on this project can expect to, over the course of the project, become at least somewhat familiar with state-of-the-art machine learning methods and how they work. Because no previous work has been done on these bindings, an investigation into options for bindings and the necessary CMake configuration to build the scripts.

deliverable: Python bindings or R bindings (or both) for all mlpack methods (about twenty) with standardized APIs

difficulty: 3/10

necessary knowledge: Python knowledge, R knowledge, and some C++; CMake knowledge would be helpful but is not necessary

relevant tickets: #208

why not 2014? Like the MATLAB bindings, this project is superseded by automatic bindings.

Automatic benchmarking of mlpack methods

description: For widespread adoption of mlpack to happen, it is very important that relevant and up-to-date benchmarks are available. This project entails writing support scripts which will run mlpack methods on a variety of datasets (from small to large) and produce runtime numbers. The benchmarking scripts will also run the same machine learning methods from other machine learning libraries (see CompetingLibraries) and then produce runtime graphs. This can be integrated into Jenkins so that benchmarks are auto-generated nightly, which could be very helpful in informing developers which of their changesets have caused speedups or slowdowns.

deliverable: an automated benchmarking system built into Jenkins

difficulty: 7/10

necessary knowledge: bash scripting (or other scripting languages; Python could work), some machine learning familiarity

relevant tickets: #19, #94

why not 2014? This project was completed as part of GSoC 2013 bv Marcus Edel. It's very cool, see  http://www.mlpack.org/benchmark.html .

Packaging in Debian (and Ubuntu)

description: mlpack is packaged in Fedora and RHEL, but currently not in Debian due to the stringent packaging requirements. To complete this project, mlpack would need to be properly packaged for Debian and Ubuntu and submitted in the proper manner. This would involve a package review process and probably would result in the project's undertaker becoming a Debian contributor or Debian developer. Because this process is generally somewhat slow and depends on other people's input, it may be best combined with another simple, low-difficulty project.

deliverable: mlpack in Debian or Ubuntu repositories (or at least an in-motion process)

difficulty: 2/10 but includes a lot of reading documentation

necessary knowledge: Linux familiarity, knowledge of how the open-source community works, and shell scripting

relevant tickets: #100, #101, #102

why not 2014? This is far too little work for a good GSoC project. It could be part of another project, though.