mlpack IRC logs, 2018-07-03

Logs for the day 2018-07-03 (starts at 0:00 UTC) are shown below.

July 2018
--- Log opened Tue Jul 03 00:00:30 2018
01:10 -!- vivekp [~vivek@unaffiliated/vivekp] has joined #mlpack
03:01 -!- vivekp [~vivek@unaffiliated/vivekp] has quit [Ping timeout: 256 seconds]
03:06 -!- vivekp [~vivek@unaffiliated/vivekp] has joined #mlpack
04:02 -!- vivekp [~vivek@unaffiliated/vivekp] has quit [Ping timeout: 260 seconds]
04:05 -!- vivekp [~vivek@unaffiliated/vivekp] has joined #mlpack
07:14 -!- ImQ009 [~ImQ009@unaffiliated/imq009] has joined #mlpack
09:30 -!- witness_ [uid10044@gateway/web/] has joined #mlpack
10:21 < jenkins-mlpack> Project docker mlpack nightly build build #368: SUCCESS in 3 hr 7 min:
13:25 -!- sumedhghaisas [68842d55@gateway/web/freenode/ip.] has joined #mlpack
13:29 < ShikharJ> sumedhghaisas: You were right, Tensorflow took about 4.5 hours on multiple cores (and 11 hours on single core aggregate).
13:30 < sumedhghaisas> ShikhaJ: wow...
13:30 < sumedhghaisas> that big a difference?
13:31 < ShikharJ> sumedhghaisas: Because it was utilizing upto 8 threads at a time.
13:31 < sumedhghaisas> Maybe we should parallelize our operations as well
13:32 < rcurtin> I suspect OpenBLAS will do a pretty good job of that, but I guess we have to see the timing to know
13:32 < sumedhghaisas> ShikhaJ: How long does our code take currently?
13:33 < ShikharJ> sumedhghaisas: I have only tested on single core, and that takes about 6.5 ~ 7 hours. I have just tmux'd a build with openBLAS. I'll monitor that and let you know.
13:33 < sumedhghaisas> rcurtin: True, OpenBLAS is a library or a framework? I mean does it give a framework to parallelize or takes care of everything on its own?
13:33 < ShikharJ> rcurtin: Yeah, OpenBLAS is running good currently, though the load average is not as high as in tensorflow, but still.
13:34 < rcurtin> is this on one of the build systems?
13:34 < rcurtin> ah, yeah, I see it is running on savannah
13:34 < rcurtin> you're right, it looks like it is not parallelizing perfectly
13:35 < ShikharJ> rcurtin: Yeah, the load average that I saw for tensorflow ranged from 2.5 to 2.9 ~ 3.0 but mostly in the lower range.
13:35 < rcurtin> sumedhghaisas: OpenBLAS is just a BLAS replacement that uses OpenMP internally, so if you are doing big matrix multiplications it will parallelize
13:35 < rcurtin> ShikharJ: another note (and I can kill it if you want), I just started the benchmark checkout job on that system so it builds a handful of libraries for the benchmarking system
13:36 < ShikharJ> rcurtin: Which process is that?
13:36 < rcurtin> actually let me kill it
13:37 < ShikharJ> rcurtin: I'll restart the build then?
13:37 < rcurtin> ok, stopped now... I think the effect on your runtime will be small (like 1-5% at most, probably less since it only went on for like a minute)
13:37 < rcurtin> nah, no need, I think
13:37 < rcurtin> another idea for parallelism---last year a different Shikhar (Bhardwaj) implemented a parallel SGD variant; it might be possible to use that here
13:38 < rcurtin> only problem is, that worked best with objective functions where the gradient is sparse, so it might not work here
13:39 < rcurtin> but when I think about it, I guess it should not be too hard to parallelize the FNN class... I wonder if OpenMP tasks could be used for something like this
13:39 < ShikharJ> rcurtin: But we'll need to test on batches of size 50, to make the comparison. I'm not sure if ParallelSGD would take batch sizes into consideration.
13:39 < rcurtin> ShikharJ: you're right, it only supports a batch size of 1 right now
13:42 < zoq> We could save some time if we implement the EvaluateWithGradient function.
13:43 < zoq> That way we could save another Forward pass in each iteration.
13:44 < ShikharJ> rcurtin: From what I see on htop, big matrix multiplications are only forming a small part of the total time, so I don't think using openBLAS would be of much benefit here. The load average is barely above 1.
13:45 < rcurtin> zoq: you're right, that could be a big savings
13:45 < rcurtin> ShikharJ: I think you are right; as the batch size / dimensionality gets larger, the matrix multiplications will get larger also, but perhaps at this size you are right, OpenBLAS is only marginally helpful
13:49 < ShikharJ> rcurtin: Yeah, parallelizing the FFN layer sounds like a good plan, but I'm not sure how long that would take, I'll look into zoq's idea of implementing EvaluateWithGradient Function today.
13:52 < zoq> ShikharJ: Implementing the EvaluateWithGradient function should be straightforward, we can use the current Gradient function and return the loss.
13:53 < ShikharJ> zoq: Yeah, do we need separate Evaluate and Gradient functions is we have a single EvaluateWithGradient? I think not but I'm not sure.
13:54 -!- vivekp [~vivek@unaffiliated/vivekp] has quit [Ping timeout: 248 seconds]
13:55 < zoq> ShikharJ: If a class doesn't implement EvaluateWithGradient it will combine the Evaluate and Gradient function, since we call the Evaluate function in the Gradient step anyway we don't need an extra Evaluate call.
13:55 < zoq> ShikharJ: In case of the FFN class, we need both functions, since in some cases you just like to perform the forward step.
13:56 < zoq> If EvaluateWithGradient is implemented it will prioritize it over the Evaluate/Gradient combination.
13:56 -!- vivekp [~vivek@unaffiliated/vivekp] has joined #mlpack
13:57 < ShikharJ> zoq: I see. I'll get back on this later today :)
13:58 < zoq> ShikharJ: Great, excited to see the timings afterwards.
14:36 -!- travis-ci [] has joined #mlpack
14:36 < travis-ci> mlpack/mlpack#5215 (master - c6f9db4 : Ryan Curtin): The build has errored.
14:36 < travis-ci> Change view :
14:36 < travis-ci> Build details :
14:36 -!- travis-ci [] has left #mlpack []
15:05 -!- witness_ [uid10044@gateway/web/] has quit [Quit: Connection closed for inactivity]
15:49 < ShikharJ> zoq: Should I remove Evaluate and Gradient functions if I have implemented EvaluateWithGradient?
15:51 < zoq> ShikharJ: Let us keep both for now.
16:02 < rcurtin> yeah, some optimizers will make use of any of those three functions separately
16:21 < ShikharJ> zoq: This was a great idea, I can already see the reduction in a number of function calls.
16:27 < zoq> ShikharJ: This definitely effects the train time.
16:28 < zoq> ShikharJ: So thanks for taking a closer look into this one as well.
16:48 -!- vivekp [~vivek@unaffiliated/vivekp] has quit [Read error: Connection reset by peer]
16:50 < ShikharJ> zoq: I have pushed the changes as a new commit in WGAN PR. Can you review them? Once I can ascertain they are correct, I'll tmux a build.
16:51 -!- vivekp [~vivek@unaffiliated/vivekp] has joined #mlpack
17:03 < zoq> ShikharJ: Just did a quick look, and this looks good to me, will have to check the other lines as well, but I guess this is ready to be tested.
17:05 < ShikharJ> zoq: So do you think I should cancel the current OpenBLAS build?Or should I let it run, and then run the build with the current code to compare?
17:07 < zoq> ShikharJ: Would be interesting to get the OpenBLAS results as well, but your choice.
17:08 < ShikharJ> zoq: Also, do you think I can split the gan_impl.hpp file into gan_impl.hpp, wgan_impl.hpp and wgangp_impl.hpp?
17:08 < zoq> ShikharJ: I think that is reasonable, perhaps create another folder?
17:08 < ShikharJ> zoq: The last two would be having the evaluate, gradient and evaluatewithgradient functions of their respective classes?
17:09 < ShikharJ> zoq: Yes that would make sense.
17:09 < zoq> ShikharJ: Yeah, I think that would improve the readability.
17:10 < ShikharJ> zoq: I'll let the build run, just to see how much time we can save from using OpenBLAS. Though I don't expect it to be much, maybe 20~30 minutes. But that would also give us a concrete time to improve upon.
17:12 < zoq> ShikharJ: Agreed, even 30 minutes for free (we don't have to change anything), would be good to have.
18:12 < ShikharJ> rcurtin: Did you spawn a benchmark job recently?
18:13 < ShikharJ> rcurtin: There seems to be a jenkins process running, that is taking up all the cores.
18:32 < rcurtin> ShikharJ: I thought I killed it!
18:33 < rcurtin> hmm, the benchmark job is not doing anything
18:33 < rcurtin> the build appears to be taking place in /home/ShikharJ/, are you sure it's not something you were running?
18:34 < ShikharJ> rcurtin: I am running a new build now, but back then there were upto 6 jobs running under jenkins name.
18:35 < rcurtin> I am not sure what it would have been; I restarted the benchmark checkout job, however I immediately killed it on savannah, so there should not have been any issue
18:36 < rcurtin> let me just mark the system offline so that Jenkins doesn't try to use it
18:36 < rcurtin> let me know when you're done with the runs and I'll bring it back online... hopefully there are no more problems, sorry about that
18:37 < ShikharJ> rcurtin: Yeah, this build should finish in less than 6.5 hours, so after that period.
18:39 -!- travis-ci [] has joined #mlpack
18:39 < travis-ci> mlpack/mlpack#5220 (master - 255bc42 : Ryan Curtin): The build has errored.
18:39 < travis-ci> Change view :
18:39 < travis-ci> Build details :
18:39 -!- travis-ci [] has left #mlpack []
19:01 -!- ImQ009 [~ImQ009@unaffiliated/imq009] has quit [Quit: Leaving]
23:11 -!- witness_ [uid10044@gateway/web/] has joined #mlpack
--- Log closed Wed Jul 04 00:00:31 2018