mlpack IRC logs, 2018-07-02

Logs for the day 2018-07-02 (starts at 0:00 UTC) are shown below.

July 2018
--- Log opened Mon Jul 02 00:00:28 2018
00:12 -!- witness_ [uid10044@gateway/web/] has quit [Quit: Connection closed for inactivity]
03:03 -!- witness_ [uid10044@gateway/web/] has joined #mlpack
04:41 -!- lozhnikov [] has quit [Ping timeout: 268 seconds]
04:43 -!- lozhnikov [] has joined #mlpack
06:36 -!- ImQ009 [~ImQ009@unaffiliated/imq009] has joined #mlpack
07:02 -!- witness_ [uid10044@gateway/web/] has quit [Quit: Connection closed for inactivity]
09:05 < Atharva> sumedhghaisas: I made a VAE model and tried to train it on MNIST. I had some doubts, when will you be free?
09:13 < ShikharJ> zoq: I have updated the PR again, rcurtin, I would appreciate if you could please review it as well, as we are refactoring the design of the GAN module.
10:19 < jenkins-mlpack> Project docker mlpack nightly build build #367: SUCCESS in 3 hr 5 min:
11:56 < sumedhghaisas> Atharva: Hi Atharva
11:56 < sumedhghaisas> The last meeting I have today will end at 17:00 BST.
11:56 < sumedhghaisas> Will it be possibly after that?
11:57 < Atharva> sumedhghaisas: Yeah sure Sumedh.
11:57 < sumedhghaisas> also is the NormalDistribution error solved>?
11:58 < Atharva> Yes :), I pushed the latest code.
11:58 < sumedhghaisas> Atharva: Great. Just for curiosity. What was the problem with SoftPlus
11:59 < Atharva> t was due to the fact that the approximate jacobian was calculated w.r.t. the standard deviation and logProbBackward was w.r.t. pre standard deviation. I tried perturbing pre standard deviation and the test passed.
11:59 < sumedhghaisas> ahh.... Yes.
12:00 < Atharva> Although, I had to add some functions.
12:01 < sumedhghaisas> Good catch :)
12:03 < Atharva> You telling me to try it without softplus worked :)
12:14 < rcurtin> ShikharJ: I'll see if I can make time to take a look
12:28 < ShikharJ> rcurtin: Sure, no hurry :)
14:02 < Atharva> zoq: Are you there?
14:09 < Atharva> zoq: Sorry to bother, I got that figured out. :)
14:15 -!- vivekp [~vivek@unaffiliated/vivekp] has joined #mlpack
14:23 -!- vivekp [~vivek@unaffiliated/vivekp] has quit [Ping timeout: 276 seconds]
14:25 -!- vivekp [~vivek@unaffiliated/vivekp] has joined #mlpack
14:49 -!- wenhao [731bcd84@gateway/web/freenode/ip.] has joined #mlpack
15:40 -!- wenhao [731bcd84@gateway/web/freenode/ip.] has quit [Ping timeout: 260 seconds]
16:32 < sumedhghaisas> Atharva: Hey Atharva
16:32 < sumedhghaisas> you there?
16:32 < Atharva> Yes
16:32 < Atharva> Hi Sumedh
16:32 < sumedhghaisas> great! :)
16:33 < sumedhghaisas> so... what up?
16:33 < sumedhghaisas> hows it going?
16:33 < Atharva> I put everything together in a local branch and started buiding a model.
16:34 < Atharva> For some reason, the reconstruction loss isn't decreasing at all
16:34 < Atharva> It just keeps fluctuating around some value, which is negative.
16:34 < Atharva> So that leads to another doubt
16:34 < Atharva> We have used log probability as the error, should we be using negative log probability?
16:40 < sumedhghaisas> hmmm... okay for that I have to take a look at ReconstructionLoss :)
16:40 < sumedhghaisas> give me a minute
16:43 < sumedhghaisas> aha
16:43 < sumedhghaisas> Atharva: return dist->LogProbability(std::move(target));
16:43 < sumedhghaisas> yes... here its should be negative
16:44 < sumedhghaisas> loss = Negative Log Likelihood
16:45 < sumedhghaisas> Also is there a PR where I can see the code you are using to train?
16:45 < Atharva> sumedhghaisas: Okaie, and also the corresponding changes in `Backward()` function
16:45 < sumedhghaisas> yup
16:45 < Atharva> No, haven't pushed that yet
16:45 < sumedhghaisas> no worries
16:46 < sumedhghaisas> Also could you rebase the ReconstructionLoss PR on the NormalDistribution one?
16:46 < sumedhghaisas> maybe I have already mentioned this... if I have ignore this one :)
16:46 < Atharva> I already did :)
16:47 < sumedhghaisas> ahh... great
16:48 < Atharva> Another thing, first I wasn't using `Sequential<>` layers for the encoder and decoder. Now I am but am facing some issues.
16:48 < rcurtin> ShikharJ: I had a question for you, I've been seeing comments that now we are able to train a GAN in ~7 hours or so, and the impression that I get is that this is pretty fast
16:48 < rcurtin> on the whole MNIST dataset that is
16:49 < rcurtin> do you happen to know how long the same kind of training would take with other toolkits like TensorFlow or CNTK or whatever else?
16:49 < sumedhghaisas> Atharva: ahh sorry, I wanted you to rebase NormalDistribution on latest master and Reconstruction loss on latest NormalDistribution rather than merging master in reconstruction loss
16:49 < rcurtin> if you don't it's ok, of course, but I would love to be able to tell people "mlpack is really fast even when only on the CPU, you can train GANs very quickly", but I want to make sure I am not saying something wrong :)
16:50 < Atharva> Oh sorry, but I did rebase the Reconstuction loss branch on the NormalDistribution PR branch. I haven't merged master in reconstruction loss
16:50 < sumedhghaisas> rcurtin, ShikharJ: 7 hours? huh? what kind of GAN is this?
16:50 < Atharva> Maybe I am doing something wrong with git here.
16:50 < ShikharJ> rcurtin: The O'Reilly blog mentions about the network taking about a full day on a desktop cpu using Tensorflow. We can still expect about 1.5x to 2.0x speed if we take into consideration a desktop environment.
16:50 < rcurtin> sumedhghaisas: I have no idea, I don't know the details, I've only seen some posts and comments on IRC
16:51 < ShikharJ> sumedhghaisas: Just the basic GAN.
16:51 < ShikharJ> rcurtin: I should mention, that with the Stochastic method, this still takes about 3 days for the model to converge.
16:52 < sumedhghaisas> ShikharJ: How many layers in generative and discriminative modules? Maybe we can estimate how much time tensorflow takes
16:52 < sumedhghaisas> Atharva: yes... git still amazes me sometimes :)
16:52 < Atharva> rcurtin: What addtional tags do I need to compile a file with to see debugging symbols. I built mlpack with DEBUG and ARMA_EXTRA_DEBUG on.
16:53 < ShikharJ> rcurtin: The basic implementation and the number of layers is more or less the same as in the O'Reilly implementation.
16:54 < rcurtin> ShikharJ: ok, thanks for the clarification. I am not a GAN expert but that gives me enough to work with to understand the comparison
16:54 < Atharva> sumedhghaisas: I didn't merge master into ReconstructionLoss PR, can you tell me what led you to think the same so that I can correct my mistake. What I did was I rebased the Recon loss PR branch on the Normal Dist PR branch.
16:54 < ShikharJ> rcurtin: A certain reason behind the speedup may also be because we're choosing big hyperparameters for the vanilla GAN implementation, so our network learns faster.
16:54 < rcurtin> Atharva: that should be everything you need
16:55 < sumedhghaisas> Atharva: I saw the commit history here
16:55 < sumedhghaisas>
16:55 < Atharva> rcurtin: So when compiling a sample file, say vae.cpp, I don't need to use any extra tags other than lmlpack and larmadillo?
16:56 < rcurtin> ShikharJ: right, that makes sense. it could be interesting to do an exact comparison at some point, but I don't think it is very high priority
16:56 < rcurtin> TensorFlow uses Eigen internally, whereas Armadillo uses whatever BLAS replacement is available, so on the CPU, I would not be surprised if OpenBLAS can outperform Eigen, and this would be a big part of the speedup
16:56 < ShikharJ> rcurtin: Also, we have little to no copying in our routines, which utilize basic matrices. Tensorflow, as I remember only takes 4-d tensors which might be slow, specially if copying is there.
16:56 < rcurtin> plus maybe our implementation has less overhead because it is less complex, but I don't know how much overhead will factor into it
16:56 < rcurtin> right, the copying can be super painful if that's happening
16:57 < rcurtin> Atharva: you'll need to make sure that you compiled mlpack itself with -DDEBUG=ON and -DARMA_EXTRA_DEBUG=ON
16:57 < rcurtin> and then when you run, e.g., g++ -o vae vae.cpp ...
16:57 < rcurtin> you'll want to do it as
16:57 < sumedhghaisas> rcurtin: Also, tensorflow has less lazy evaluation than Armadillo
16:57 < Atharva> sumedhghaisas: Yes, the first 7 commits on that page are from the Normal Dist PR which appear here because I rebased on it. Only the last two commits are from Recon Loss PR
16:57 < rcurtin> g++ -g -DDEBUG -DARMA_EXTRA_DEBUG -o vae vae.cpp -lmlpack -larmadillo
16:58 < Atharva> rcurtin: Thanks!
16:59 < sumedhghaisas> Atharva: Yes, but the last commit is the Merge from master into ReconstructionLoss
16:59 < rcurtin> sure, hope it helps, let me know if there are any issues :)
16:59 < Atharva> sumedhghaisas: Yes, it was because I had to resolve some merge conflicts.
17:01 < Atharva> sumedhghaisas: Sorry for the silly confusion.
17:01 < sumedhghaisas> rcurtin: A quick question only if you have time, Do you think if we use Armadillo internal lazy classes in our forward passes, we would get a speedup? Cause we actually do not need to evaluate the expression until the last layer
17:01 < ShikharJ> sumedhghaisas: How can we estimate the time to be taken by Tensorflow from the layers?
17:02 < rcurtin> sumedhghaisas: I guess it could be possible, it would depend a lot on the network. if the network was just a bunch of chained linear layers I don't think we'd get any speedup
17:02 < rcurtin> since Armadillo would have to do all those multiplications sequentially anyway
17:02 < sumedhghaisas> Atharva: no worries, for linear history prefer rebase where-ever possible.
17:02 < sumedhghaisas> Yeah, especially non-linearities will give huge speedup
17:03 < ShikharJ> rcurtin: I'm practically free from my planned goals, so if you want me to benchmark it, I can try.
17:03 < rcurtin> hmm, are you sure it would? if I write, e.g., arma::log(A * B), there's not much speedup there that Armadillo could give us
17:03 < rcurtin> I guess, internally Armadillo could avoid a temporary matrix C = A * B for that
17:04 < rcurtin> but I'm not sure if it does; I think Armadillo does a good job of avoiding a lot of temporaries, but there are a lot of optimizations that could be done
17:04 < sumedhghaisas> ShikharJ: I train lot of VAEs for prototyping, and that too on MNIST. So maybe I can estimate if you can give me the number of layers, and what kind of layers. :)
17:04 < rcurtin> ShikharJ: sure, up to you. if we can come up with some post showing that mlpack is faster than other toolkits on the CPU, this is an exciting result (although, admittedly, most people want to use the GPU these days)
17:05 < sumedhghaisas> rcurtin: arma::log(A * B) evaluated lazily would save a copy, for 1000 iterations, 1000 copies
17:06 < ShikharJ> rcurtin: Then there's people like me who are students and only have a decent GPU, who can't afford AWS or GCP :P
17:06 < ShikharJ> sumedhghaisas: Take a look here:
17:06 < rcurtin> right, definitely, but if the cost of the copy is O(N^2) and the multiplication is O(N^3), that limits the speedup we could have, so it may be more minor than we might hope
17:06 < Atharva> rcurtin: If we use bandicoot for Armadillo, how fast does it get? Does the speed increase proportionally to the speed increase of say tensorflow from cpu to gpu?
17:07 < ShikharJ> sumedhghaisas: The blog was written in mid 2017, so I don't think a lot of difference in computing power has occurred in that time.
17:07 < rcurtin> Atharva: I would hope. there is still a lot of work to be done on bandicoot, and right now it wraps clblas, not nvidia's cublas
17:07 < rcurtin> so I think that it will not be as fast as cublas, but I am hoping to work with Conrad in the upcoming months on finishing the library and providing support for cublas
17:08 < rcurtin> that said, I am not sure of the speed difference between cublas and clblas
17:08 < sumedhghaisas> ShikharJ: So 4 fayer feedforward network? how much time they say tensorflow takes? I think it should take less than 7 hours...
17:09 < ShikharJ> rcurtin: Is there a benchmark utility in mlpack that I can use? I sure have read somewhere of a benchmark architecture in mlpack.
17:09 < ShikharJ> sumedhghaisas: "If you want to run this code yourself, prepare to wait: it takes about 3 hours on a fast GPU, but could take ten times that long on a desktop CPU."
17:09 < sumedhghaisas> ShikharJ: ahh wait... I use CPU optimized tensorflow
17:09 < ShikharJ> Directly from the blog
17:09 < rcurtin> ShikharJ: is the project
17:10 < rcurtin> basically you write a Python script to run whatever you're planning to benchmarks and extract certain metrics like runtime, accuracy, etc. from it
17:10 < sumedhghaisas> ShikharJ: no way... For some code, especially feed forward layers I have noticed GPU takes longer than CPU
17:10 < rcurtin> it would take a little while to write a full script for the GANs, so talk with Marcus and decide if it's something you want to do, otherwise I think it would be interesting just to compare with a simple run or two
17:10 < sumedhghaisas> ShikharJ: Thats due to GPU positioning overhead
17:11 < sumedhghaisas> Atharva: Sorry you were asking me some doubt about SequentialLayer?
17:11 < ShikharJ> sumedhghaisas: Hmm, interesting.
17:12 < sumedhghaisas> ShikharJ: My estimate, assuming you build tensorflow on your machine with optimizations, which is fair since MLPack is also built, it shouldn't take way more than 3 hours
17:15 < sumedhghaisas> rcurtin: Surely the speedup from CPU to GPU is much more than lazy evaluation
17:15 < ShikharJ> sumedhghaisas: I think the best way to check would be to run it in the same environment as we do our GAN implementation.
17:16 < ShikharJ> rcurtin: Could we install tensorflow on savannah? Maybe I can tmux a build and compare the runtimes?
17:16 < rcurtin> sure, would you like me to do it with pip?
17:16 < Atharva> sumedhghaias: It wasn't about exactly about `Sequential` layer, the code is just failing with some matrix size miss matches. I will see where that is coming from and let you know.
17:16 < sumedhghaisas> rcurtin: but the lazy speedup will be there even if we move to GPU
17:17 < sumedhghaisas> Atharva: Sure
17:17 < rcurtin> ShikharJ: actually in that case, probably better to install with pip install --local
17:17 < rcurtin> sumedhghaisas: right, agreed. I'd be interested in seeing how much speedup we could get first, since rearchitecting the whole system could be really difficult
17:18 < ShikharJ> rcurtin: I don't think I have the necessary privileges to install any packages, so please go ahead.
17:18 < Atharva> rcurtin: I will try to use bandicoot even if it's experimental now and see how fast it can get.
17:18 < rcurtin> ShikharJ: I think you could use pip3 though
17:19 < rcurtin> Atharva: sure, give it a shot, but I don't know if it has enough functionality implemented to be a full substitute yet
17:19 < rcurtin> so compilation may fail
17:19 < Atharva> Oh, I will let you know what happens.
17:20 < sumedhghaisas> rcurtin: Agreed. If I use 'auto' that should prefer the internal Armadillo class rather than arma::mat right?
17:20 < rcurtin> sumedhghaisas: usually, but auto can cause very weird things to happen so don't be surprised if there are problems :(
17:21 < sumedhghaisas> rcurtin: yeah :( I just don't wanted a way to infer the classes rather than opening the GLUE architecture of Armadillo again
17:22 < sumedhghaisas> *remove don't :)
17:22 < rcurtin> right, agreed, I see what you mean. well, see if 'auto' works... :)
17:22 < rcurtin> if we can get the correct Armadillo internal type back and manage to pass that through the different layers, maybe heavy modification of the abstractions is not necessary
17:22 < rcurtin> which would be really nice :)
17:23 < sumedhghaisas> rcurtin: Yes thats precisely what I was thinking. We anyway templatize input and output
17:23 < rcurtin> right, I think it could work
17:25 < sumedhghaisas> I always think the same about template substitution until g++ proves me how difficult life could be
17:26 < rcurtin> heh, same...
17:32 < ShikharJ> sumedhghaisas: What did you say the expected time was? 3 hours?
17:37 < sumedhghaisas> ShikharJ: Little more, but will be within 1.5 times
17:37 < sumedhghaisas> ShikharJ: Maybe will need to adjust the batch size accordingly
17:38 < ShikharJ> sumedhghaisas: Let's see, I'll tmux a build to check.
17:38 < sumedhghaisas> although, the best way to estimate would be from single update, if we know how many updates it usually takes to converge
17:38 < ShikharJ> sumedhghaisas: By default, both the implementations take single step updates.
17:39 < sumedhghaisas> try single batch update and measure time
17:39 < sumedhghaisas> yes... but how many iterations of the single step update?
17:41 < ShikharJ> sumedhghaisas: We had a batch size of 50 for 100,000 iterations in the tensorflow implementation. So that's almost 71 full passes, but I guess t converges much earlier.
17:41 < ShikharJ> sumedhghaisas: We also pre-train the discriminator for 300 iterations on 50 sized batches.
17:42 < ShikharJ> *it
17:45 < sumedhghaisas> 300 iterations pretraining is nothing compared to 100,000 iterations, so we need to know how much time each iteration takes in tensorflow and MLPACK
17:45 < sumedhghaisas> convergence is also a property of optimizer, so lets take that out of the equation, lets only compare single step update
17:46 < sumedhghaisas> would be much faster as well
17:47 < ShikharJ> sumedhghaisas: Hmm, probably I can do that on my system, I'll share what I find later today!
17:48 < sumedhghaisas> ShikharJ: Sure thing!
17:49 < ShikharJ> sumedhghaisas: Is tensorflow multi-threaded?
17:50 < ShikharJ> sumedhghaisas: I am seeing almost all the cores occupied :/
17:52 < sumedhghaisas> ShikharJ: hmm... now that I have not sure about.
17:52 < sumedhghaisas> But I think yes,
17:52 < sumedhghaisas> tensorflow prefers parallel execution where-ever the graph lets it
17:53 < sumedhghaisas> I have heard this a lot of times...
17:53 < ShikharJ> sumedhghaisas: If it is then there's no point in the conparison, because mlpack isn't :( We'll have to run tensorflow on a single core to make the comparison fair.
17:54 < rcurtin> ShikharJ: or use OpenBLAS, which will do parallel matrix multiplication when possible
17:55 < ShikharJ> rcurtin: How can I do that? What would I have to change?
17:55 < rcurtin> a way to check (maybe not the best way) is just to do 'ldd'
17:55 < rcurtin> and that will be linked to some BLAS library...,, something like this
17:55 < rcurtin> which one it links to tells you which one it's using
17:56 < rcurtin> if OpenBLAS is installed on the system, then Armadillo should be automatically using it anyway
17:56 < rcurtin> but I always use ldd to check... seeing exactly what it's linked against removes most doubts about what it could be
17:56 < ShikharJ> rcurtin: Where would I find file?
17:57 < rcurtin> in your build directory under lib/
17:57 < rcurtin> or if you are building a standalone program, you can also just use ldd on that
17:58 < rcurtin> looks like on savannah, openblas is not installed. I should install it on all five benchmarking systems, but I won't interrupt if you are doing anything with them now
17:59 < ShikharJ> rcurtin: Hmm, only libblas is installed as far as I can see.
17:59 < rcurtin> right, so if you like I can just go ahead and install now, then you can rebuild and relink
17:59 < rcurtin> (or rather just rebuild, that will do the linking too of course)
17:59 < ShikharJ> rcurtin: Sure, let's even the odds a bit :P
17:59 < sumedhghaisas> rcurtin, ShikharJ: Although I am not sure they are equivalent, OpenBLAS does the same operation on multiple GPU right? Tensorflow I think finds whole operations that can be parallelized
18:01 < rcurtin> ShikharJ: ok, installed on all 5 benchmarking systems
18:01 < rcurtin> sumedhghaisas: I agree, they are not exactly equivalent, but in both cases I'd say each should be making pretty full use of all available CPU cores
18:01 < rcurtin> (I think you meant CPU not GPU there)
18:02 < rcurtin> I would say it is as fair a comparison as we can get for TensorFlow and mlpack in most "real life" use cases... people won't generally want to restrict their usage to one core
18:03 < ShikharJ> rcurtin: Agreed. Currently Tensorflow is taking up all the cores, so let me see how long that takes, and probably, then I can spawn an mlpack build.
18:03 < rcurtin> right
18:04 < ShikharJ> rcurtin: Meanwhile, I'll try and get some statistics on the single iteration timings :)
18:08 -!- vivekp [~vivek@unaffiliated/vivekp] has quit [Ping timeout: 260 seconds]
18:54 -!- witness_ [uid10044@gateway/web/] has joined #mlpack
19:50 < zoq> ShikharJ: I guess, if we like to test out some ideas it might be usefull to write a simple script for the benchmark system. That would allow us to easily track what approach worked and which one didn't. Let me know what you think.
20:13 -!- sumedhghaisas [68842d55@gateway/web/freenode/ip.] has quit [Ping timeout: 260 seconds]
20:20 < ShikharJ> zoq: Seems like an interesting idea, but first I'd like to spend the time on our RBM implementation. If this takes too long, I don't wish to keep those goals on hold. Is it fine by you?
20:38 < zoq> ShikharJ: absolutely
20:41 < ShikharJ> rcurtin: Hmm, interesting, for tensorflow it seems that the single iteration is a lot faster, but still, the network takes much longer to reach convergence.
20:43 < ShikharJ> rcurtin: The one beautiful thing about htop is that it aggregates the cpu and time elapsed by a single process, over all its threads. Tensorflow has already crossed the 7 hour mark (as it would on a single core system I believe).
20:44 < ShikharJ> rcurtin: It's been three hours into training.
20:49 < zoq> ShikharJ: Might be interesting to test TensorFlow Lite + TensorFlow Mobile as well, probably not at this point but we should keep that in mind.
20:55 < ShikharJ> zoq: Good point
20:59 < rcurtin> ShikharJ: yeah, that sounds reasonable so far. we should double-check for any copies that are happening during the mlpack training
20:59 < rcurtin> that can be a big source of slowdowns
21:01 < ShikharJ> rcurtin: Agreed. I'll take a thorough look tomorrow and update any changes in the WGAN PR itself.
21:10 < rcurtin> sometimes ARMA_EXTRA_DEBUG can be helpful here
21:42 -!- witness_ [uid10044@gateway/web/] has quit [Quit: Connection closed for inactivity]
21:52 -!- ImQ009 [~ImQ009@unaffiliated/imq009] has quit [Quit: Leaving]
22:28 < ShikharJ> rcurtin: It took about 11 hours (single core aggregate), and about 4.5 hours (real time multi-threaded aggregate) to train the complete model, which corresponds to a 240 - 250 % CPU usage (which is along the lines of what I saw on htop). So the baseline improvement of 1.57x on single core is there.
22:29 < ShikharJ> rcurtin: I couldn't keep my eyes off the htop screen, I was watching it that anxiously :P
22:32 < ShikharJ> rcurtin: I'll tmux a build for openBLAS based mlpack build in the morning. Going to get some sleep now :)
22:34 < ShikharJ> rcurtin: Though, could you please tell me if I will just have to do a make clean (or would I have to clear CMakeCache.txt as well)?
22:36 < zoq> ShikharJ: make clean followed by cmake should work just fine, but you can also remove the build folder to be sure.
22:39 < ShikharJ> zoq: Thanks, let's see of we can atleast match Tensorflow's time with openBLAS for multi-threads!
22:40 < ShikharJ> *if
22:40 < zoq> yeah fingers crossed :)
--- Log closed Tue Jul 03 00:00:30 2018