mlpack IRC logs, 2018-06-09

Logs for the day 2018-06-09 (starts at 0:00 UTC) are shown below.

June 2018
--- Log opened Sat Jun 09 00:00:55 2018
02:24 -!- vivekp [~vivek@unaffiliated/vivekp] has quit [Ping timeout: 245 seconds]
04:45 -!- vivekp [~vivek@unaffiliated/vivekp] has joined #mlpack
05:49 -!- ImQ009 [~ImQ009@unaffiliated/imq009] has joined #mlpack
06:56 -!- ImQ009 [~ImQ009@unaffiliated/imq009] has quit [Ping timeout: 240 seconds]
07:25 -!- ImQ009 [~ImQ009@unaffiliated/imq009] has joined #mlpack
08:10 -!- manish7294 [8ba78995@gateway/web/freenode/ip.] has joined #mlpack
08:21 -!- travis-ci [] has joined #mlpack
08:21 < travis-ci> manish7294/mlpack#22 (lmnn - 7c6c8a4 : Manish): The build has errored.
08:21 < travis-ci> Change view :
08:21 < travis-ci> Build details :
08:21 -!- travis-ci [] has left #mlpack []
08:31 < manish7294> rcurtin: I have a bit different observation here. I agree that, the recalculation of arma::find and arma::unique is inefficient but I don't think it is the major troublemaker. I have implemented the precalculation as you advised, but it didn't have much affect. I have made quite a number of run by imposing timer on almost each and every part. Here is the observation:
09:22 -!- manish7294 [8ba78995@gateway/web/freenode/ip.] has quit [Quit: Page closed]
09:31 -!- travis-ci [] has joined #mlpack
09:31 < travis-ci> manish7294/mlpack#23 (lmnn - 65920ee : Manish): The build passed.
09:31 < travis-ci> Change view :
09:31 < travis-ci> Build details :
09:31 -!- travis-ci [] has left #mlpack []
09:55 < jenkins-mlpack> Project docker mlpack nightly build build #344: STILL UNSTABLE in 2 hr 41 min:
11:27 -!- wenhao [731bc65e@gateway/web/freenode/ip.] has joined #mlpack
13:20 -!- sulan_ [] has joined #mlpack
16:15 -!- witness_ [uid10044@gateway/web/] has joined #mlpack
16:19 -!- travis-ci [] has joined #mlpack
16:19 < travis-ci> mlpack/mlpack#5041 (mlpack-3.0.2 - 0ace41c : Ryan Curtin): The build passed.
16:19 < travis-ci> Change view :
16:19 < travis-ci> Build details :
16:19 -!- travis-ci [] has left #mlpack []
16:36 < rcurtin> manish7294: right, ok. what were the results of each of those timers?
16:36 < rcurtin> note also, in my runs, I used 5k points randomly selected from covertype, not the whole dataset
16:55 -!- manish7294 [8ba78995@gateway/web/freenode/ip.] has joined #mlpack
16:58 < manish7294> rcurtin: Sorry, I don't have accurate data I got on full covertype but approximately it was - computing_neighbors - 5 hrs 30 mins, whole impostors() - 7 hrs 10 mins, the cost and gradient calculation part - 7 hrs 10 mins, total - 13 hrs 30 mins, optimizer - sgd, k =12, batch size = 500
16:59 < manish7294> I also made run on first 5k points of covertype-small dataset and timings were more or less similar.
17:16 < rcurtin> ah, I did it with k=3, let's see if k=12 produces something more similar to yours
17:18 < rcurtin> actually, I realize I did it with default k, which is 1
17:20 < rcurtin> right, now I see results similar to yours
17:20 < rcurtin> but even so, with my timings, 21.6s is spent in Impostors() but only 5s is spent in the computing_neighbors timer
17:21 < rcurtin> I know we are on different systems so the optimizations could be somewhat different. also I am using the results before precalculation... so let's update and see what it does
17:21 < manish7294> Please try putting a timer over knn train and search.
17:21 < manish7294> and a timer over whole impostors
17:21 < rcurtin> that should be equivalent to tree_building + computing_neighbors, which was 0.126s + 5.33s, but yes, let me do that
17:25 < rcurtin> oh, interesting, now that timer is basically the same as the whole impostors timer
17:25 < manish7294> Ya, that's what I am hung up on
17:25 < rcurtin> hmm. it seems like maybe there is something that is not being timed by NeighborSearch, let me take a quick glance
17:25 < manish7294> sure
17:26 < rcurtin> ah! I see what it is
17:26 < rcurtin> if you look in neighbor_search_impl.hpp, the 'tree_building' timer is started _only_ for the query tree building in the Search() method
17:26 < rcurtin> so the building of the reference tree in Train() is not counted
17:27 < rcurtin> so then, it seems like a big bottleneck is the continual re-training of the reference tree
17:27 < rcurtin> do you remember this idea of using a tree with a MahalanobisDistance metric that I wrote a bit about? it seems like that could be useful here
17:28 < rcurtin> either that, or simply reducing the number of times we call Impostors() could be helpful
17:28 < manish7294> maybe we can do both
17:29 < rcurtin> right, I think that could also be helpful
17:29 < rcurtin> for acceleration of the rest of EvaluateWithGradient(), I think that by caching the results from Impostors(), we could avoid the calls to metric.Evaluate()
17:30 < rcurtin> I suspect that will be a lot of the runtime
17:30 < rcurtin> and I think that would be a fairly simple optimization
17:30 < manish7294> for reducing impostors call, does the earlier bounding idea sounds reasonable --- mentioned on PR
17:30 < manish7294> We can try that
17:31 < rcurtin> I believe that your idea will work, my only concern is that the overhead of calculating || (M_{t + 1} - M_t) p_i ||^2 for each p_i might cost more than just performing the search
17:32 < rcurtin> I guess if we keep oldCoordinates, it is just || p_i(t - 1) - p_i(t) ||^2 (sorry for the confusing notation, I wanted to use both i and t)
17:32 < rcurtin> but I do think it could provide some speedup, and maybe would not be too hard to implement. if you want to try and see what happens, I believe it would be worthwhile
17:32 < manish7294> correct
17:32 < rcurtin> I did not have time yesterday to finish writing my bound, unfortunately
17:33 < rcurtin> but I think that the bound I am coming up with could be used in tandem with your idea
17:33 < rcurtin> basically I want to calculate || M_{t + 1} - M_t ||^2 (or something like this that does not depend on the points), and if that quantity is small enough we skip the entire Impostors() step
17:33 < rcurtin> however, it would definitely be possible to do this check (once I figure out the bound), and if the check fails, then your idea can be done to reduce the number of points we need to find neighbors for
17:33 < rcurtin> so both could work in tandem
17:34 < rcurtin> do you think that would sound reasonable?
17:34 < manish7294> Right, I got that feeling from your idea but couldn't see beyond that
17:34 < manish7294> ya sure
17:35 < manish7294> and did you find something about divergence?
17:37 < rcurtin> not yet, I don't have any solution there. we could try adding a regularization parameter, but I haven't seen any issues with the implementation itself
17:37 < rcurtin> I'd like to take a look at the MATLAB solution and see how they solved the problem (or avoided it)
17:38 < rcurtin> I see, they have an adaptive learning rate:
17:38 < rcurtin> % Update learning rate
17:38 < rcurtin> if prev_C > C
17:38 < rcurtin> eta = eta * 1.01;
17:38 < rcurtin> else
17:38 < rcurtin> eta = eta * .5;
17:38 < rcurtin> end
17:38 < manish7294> Here sgd fails
17:39 < rcurtin> so it's a little like momentum... a little. if the step gives an improvement, increase step size by a factor of 1.01; if the step makes things bad, halve the step size
17:39 < rcurtin> I wonder if some of the cool optimizers Marcus has implemented could be used in place of SGD
17:39 < rcurtin> there are a couple in there with adaptive learning rates if I remember right
17:40 < manish7294> zoq: Please help out with optimizers having adaptive learning rate.
17:41 < manish7294> as you said earlier, we can try penalty term too.
17:41 < rcurtin> I think BigBatchSGD could be one way to try
17:41 < rcurtin> agreed, I think a penalty term is no problem
17:42 < rcurtin> if you'd like to try that, go ahead---basically just add || L ||^2 to the objective
17:42 < rcurtin> that or b * || L ||^2 for some user-specified parameter b, then you could play around with it and see if you find a good value
17:42 < manish7294> I think gradient will also be having some change
17:42 < rcurtin> right, that will be just + b*L
17:43 < manish7294> Okay I will try with these ideas.
17:45 < rcurtin> yeah, maybe it will help, let me know how it turns out :)
17:46 < manish7294> In one word - awesome :)
17:47 < manish7294> great results without any divergence at step size 50
17:47 < manish7294> on iris
17:47 < manish7294> which was earlier diverging on even 0.1
17:47 < rcurtin> ah, that is with the regularization penalty?
17:48 < manish7294> yes
17:48 < manish7294> tried with vc2 too
17:49 < rcurtin> great, that is nice to hear. I expect maybe with some larger datasets there may need to be some amount of tuning of how much regularization to apply
17:49 < rcurtin> hmm, I just realized, another thing we could do is not apply a regularization but force a normalization
17:50 < rcurtin> it gives us identical KNN results if we use some transformation matrix L, or if we use 2*L
17:50 < rcurtin> thus, it would be reasonable to, after each step, simply normalize the matrix L so that ||L||^2 = 1
17:50 < rcurtin> but I think the regularization you are doing now is just fine, it accomplishes roughly the same thing
17:50 < manish7294> that sounds promising
17:51 < manish7294> will try that too
17:54 < rcurtin> the MATLAB implementation does not seem to normalize the learned matrix
17:55 < rcurtin> but like I said, I think it is a minor issue. the penalty should be another completely sufficient way of doing it, and the runtime cost of computing either b*|| L ||^2 or b*L should be pretty insignificant
17:55 < rcurtin> (as compared to the rest of the calculation)
18:04 < manish7294> This may sound stupid but I made a deadly mistake, I didn't notice lbfgs flag was set to true this whole time in above runs and got our hopes high. Sorry for that, I should be more careful :(
18:04 < rcurtin> no worries, but it sounds like the regularization works for L-BFGS
18:04 < rcurtin> I would expect that the amount of penalty to apply (the 'b') would need to be tuned
18:05 < manish7294> no, I hust made change in batch evaluatewithgradient
18:05 < manish7294> moreover lbfgs doesn't suffer from this problem
18:06 -!- killer_bee[m] [killerbeem@gateway/shell/] has quit [Ping timeout: 256 seconds]
18:06 < manish7294> *just
18:07 -!- prakhar_code[m] [prakharcod@gateway/shell/] has quit [Ping timeout: 260 seconds]
18:07 < manish7294> maybe I should try normalization.
18:12 -!- vivekp [~vivek@unaffiliated/vivekp] has quit [Ping timeout: 264 seconds]
18:20 -!- manish7294 [8ba78995@gateway/web/freenode/ip.] has quit [Quit: Page closed]
18:41 -!- manish7294 [8ba78995@gateway/web/freenode/ip.] has joined #mlpack
18:46 -!- sulan_ [] has quit [Read error: Connection reset by peer]
18:47 -!- sulan_ [] has joined #mlpack
18:59 < manish7294> rcurtin: Hopefully this time I didn't make any mistake, normalizing coordinates matrix by diving it with its norm helped avoiding the divergence. But it comes up with the issue on how to update coordinates, currently I made a change in sgd itself for testing purpose and I think optimization time has also increased a bit.
19:07 -!- manish7294 [8ba78995@gateway/web/freenode/ip.] has quit [Quit: Page closed]
20:18 -!- sulan_ [] has quit [Quit: Leaving]
20:33 -!- prakhar_code[m] [prakharcod@gateway/shell/] has joined #mlpack
20:47 -!- ImQ009 [~ImQ009@unaffiliated/imq009] has quit [Quit: Leaving]
20:55 -!- witness_ [uid10044@gateway/web/] has quit [Quit: Connection closed for inactivity]
21:08 -!- killer_bee[m] [killerbeem@gateway/shell/] has joined #mlpack
--- Log closed Sun Jun 10 00:00:57 2018