Approaching the LSTM

I first heard about Long Sort-Term Memory here at DFKI (Kaiserslautern, Germany). Researchers here have been using it with great success in many applications, most notably OCR and digitization of historic texts and fragments.

I have seen LSTM in action and it seems like a really amazing piece of technology. For those who know about Neural Network, it would be easier to think of LSTM as a Neural Network that is also aware of the context and can use “memory” (built by past inputs) to reason about current input. For example, if the past three inputs are “I”, “Live”, “In” then it is very reasonable that the next input is the name of some location. This incredibly reduces the search space and provides amazing sequence learning capabilities. If you have worked with Recurrent Neural Networks (RNNs) which are also used for sequence learning applications you might ask that how are LSTMs better?

It is well documented in research literature (and well known to practitioners) that RNNs suffer from what is called the “Vanishing Gradient” problem, and have trouble learning when sequence dependence spans long time lags i.e. output depends on the past inputs that were seen perhaps 500 time steps ago. LSTMs on the other hand can easily learn sequences with dependencies going back to upwards of 1000 time steps.

Using an LSTM is easy. You can download any popular Machine Learning library and use its implementation of LSTM (e.g. Keras, TensorFlow). But it isn’t very satisfying intellectually, is it? Also, needless to say it helps a great deal if you know what is going on under the hood.

If you are like me and want to appreciate the Math behind it, it is natural to start with the paper that introduced LSTM. But you will find that it is not particularly illuminating (it wasn’t for me atleast) . I needed a kind of walk through guide and some background info to make the paper more accessible. I have compiled the resources below in the order that made most sense to me:

Head first to this great Quora answer by Debiprasad Ghosh for some intuitive explaination and warm up. (Gosh’s answer is the second one and is actually below the highested voted answer)
Then read this excellent blog post by Christopher Olah on LSTMs for a great exposition to LSTM’s internal state and related equations with beautifully illustrated diagrams. The post also contains links to other excellent resources as well.
Now that you have seen the equations and have know about internal working of LSTM a little better, read this blog post by Shi Yan which dissects the architectural diagrams used in LSTM lore. The post also includes a step by step walkthrough of different gates associated with LSTM, how they relate to the present input and how the internal state is manipulated by present input and gates.
Next move to the entry on DeepLearning4J for some more context and a nice commentary on Back Propagation Through Time (BPTT) algorithm.
And just before you pick the paper up again, visit this page for information and intuition about Neural Memory. The post targets RNNs but the concepts are largely applicable to LSTMs as well.

After this initial exposition, I felt more confident to tackle the paper. But since nearly all the material I found seemed to refer to the Vanishing Gradient problem I decided to start there. You can read more about it (and I suggest you do) in the paper by Bengio at al. which they aptly titled Learning Long-Term Dependencies with Gradient Descent is Difficult. After reading this paper you know the problem LSTM has been designed to solve and you will be able to connect many dots and form a clearer picture about the formal structure of LSTM and how it works.

Keeping all of the above in mind I finally approached the Grand Master itself, Hochreiter’s 1997 paper which introduced LSTM. And much to my surprise the wording of paper now made a lot of sense. I was able to connect the terminology and ideas with the intuition that I had developed by reading the blog posts.

The next natural step would be to read the paper by Felix A. Gers et al. titled Learning to Forget which introduced the now common Forget gate in the LSTM architecture. Interestingly, it seems that both Hochreiter and Felix were supervised by Jürgen Schmidhuber.

Though not directly related to the topic of the post, I’d like to include two links thatI think are nice resources for anyone interested in ML.

A visualization of NN classification. It is a great resources for peeking into the hidden layers of NN.
And for the mathematically inclined, this post relates NN, Manifols and Topology theory. (… and has some epic gif visualizations too)

One Comment

Leave a Reply Cancel reply