Chapter 6
Natural Language Processing with
Recurrent Neural Networks in
Tensorflow
Objectives
Learn Tensorflow Basics
Learn how to use RNNs, LSTMs, and GRUs in Tensorflow
Use recurrent modules for sequence modeling tasks
6.1 Tensorflow
Tensorflow is a popular library for machine learning developed by Google. Tensorflow includes all
of things that might be useful for deep learning including: neural network layers, optimizers, loss
functions, and auto differentiation. Different from PyTorch, Tensorflow primarily uses a static com-
putation graph. This means that the computations describing a neural network must be “compiled”
into a static graph. PyTorch uses a dynamic computation graph, where operations can be added
without the need for recompiling. The advantage of the static graph is better performance in some
cases, however the disadvantage is that it is more difficult to debug. To alleviate the difficulty in
debugging, Tensorflow added eager mode, which uses a dynamic computation graph. However, this
current chapter is based on Tensorflow 1 and eager mode does not support parts of Tensorflow 1, so
we will be using the default static computation graph.
Hint: Migrating to Tensorflow 2
To learn how to migrate code examples in this chapter to Tensorflow 2, please
refer to https://www.tensorflow.org/guide/migrate and to the Tensorflow API
Documentation. To install the needed Tensorflow 1.x version that would work with code
examples in this chapter, please refer to https://www.tensorflow.org/install/pip.
Tensorflow can be installed with a pip command:
sudo pip i n s t a l l t e n s o r f l o w
70
CHAPTER 6. NATURAL LANGUAGE PROCESSING WITH RECURRENT NEURAL
NETWORKS IN TENSORFLOW
For code examples in this chapter, you need to install a Tensoflow 1 version. For a CPU-only
release, use the following command:
sudo pip i n s t a l l t e n s o r f l o w ==1.15
For a release with GPU support , use the following command:
sudo pip i n s t a l l t e ns o rf lo w gpu==1.15
Tensorflow has many different high level API extensions such as tf.keras and tf.estimator. These
extensions abstract some of the underlying Tensorflow code away. While these libraries are useful
for quickly prototyping code, knowledge of the underlying Tensorflow processes is also important.
Please note that Keras has been tightly integrated with Tensorflow 2; Tensorflow 1 tf.layers has been
replaced with tf.keras.layers in Tensorflow 2. For more information, please consult the Tensorflow
API documentation at https://www.tensorflow.org/versions.
Please follow the following tutorials to become familiar with Tensorflow:
Low level intro - https://www.tensorflow.org/overview/
Tensors - https://www.tensorflow.org/programmers_guide/tensors
Variables - https://www.tensorflow.org/programmers_guide/variables
6.2 CIFAR100 in Tensorflow
We will re-implement the convolutional network from Chapter 4 in Tensorflow for demonstration pur-
poses. We provided the code for this example in cifar tensorflow.py. First we will write a function
describing the forward pass of the network. We don’t need to write any code for the backwards pass,
since the gradient computations are computed automatically from the computation graph. To im-
plement the network, we will utilize the layers provided in the tf.layers package. Specifically,
we will use tf.layers.conv2d to implement a convolutional layer, tf.layers.max pooling2d for
max pooling, and tf.layers.dense for fully connected layers. The functions tf.nn.relu and
tf.reshape are used for ReLU and flattening, respectively.
71
CHAPTER 6. NATURAL LANGUAGE PROCESSING WITH RECURRENT NEURAL
NETWORKS IN TENSORFLOW
de f cnn m o d e l fn ( x ) :
Defi n e 3l a y e r cnn from Chapter 4
# d e f i n e network
# Con v olut i o nal Layer #1
conv1 = t f . l a y e r s . conv2d (
i n put s=x ,
f i l t e r s =16 ,
k e r n e l s i z e =[3 , 3 ] ,
padding=same,
a c t i v a t i o n=t f . nn . r e l u )
# Pooling Layer #1
pool1 = t f . l a y e r s . max
pooling2d ( in p ut s=conv1 , p o o l s i z e =[2 , 2 ] ,
s t r i d e s =2)
# Con v olut i o nal Layer #2 and Pooling Layer #2
conv2 = t f . l a y e r s . conv2d (
i n put s=pool1 ,
f i l t e r s =32 ,
k e r n e l s i z e =[3 , 3 ] ,
padding=same,
a c t i v a t i o n=t f . nn . r e l u )
pool2 = t f . l a y e r s . max pooling2d ( inp u ts=conv2 , p o o l s i z e =[2 , 2 ] ,
s t r i d e s =2)
# L og it s Layer
p o o l 2 f l a t = t f . reshap e ( pool2 , [ 1 , 8 8 3 2 ] )
dense = t f . l a y e r s . dense ( in p ut s=p o o l 2 f l a t , un i t s =4, a c t i v a t i o n=t f .
nn . r e l u )
r e tur n dense
Next we will load the dataset. We have provided a function cifar100 in the dataset.py file
that loads the dataset into Numpy arrays. Similar to Chapter 3 and Chapter 4, this function takes
a seed and generates a random subset of the dataset with 4 classes.
# Load da t as et
t r a i n d a ta , t e s t d a t a = c i f a r 1 0 0 ( 1 2 34)
t r a in x , t r a i n y = t r a i n d a t a
t es t x , t e s t y = t r a i n d a t a
In Tensorflow instead of writing functions that directly perform some calculations, we actually
are writing functions that define part of a computation graph, which can be run later. To define
inputs in the computation graph, we use tf.placeholder. As the name suggests, this creates a
“placeholder” variable which could correspond to some input values. Later when we call the graph,
we can pass real data to these placeholders.
72
CHAPTER 6. NATURAL LANGUAGE PROCESSING WITH RECURRENT NEURAL
NETWORKS IN TENSORFLOW
# p l a c e h o l d e r f o r input v a r i a b l e s
x p l a c e h o l d e r = t f . p l a c e h o l d e r ( t f . f l o a t 3 2 ,
shape =(BATCH SIZE , ) + t r a i n x . shape
[ 1 : ] )
y p l a c e h o l d e r = t f . p l a c e h o l d e r ( t f . i n t 32 , shape =(BATCH SIZE) )
Next we define a few operations (ops) that we will use for training and testing. An operation
is some output or function that we are interested in computing from the computation graph. In
this case, we want an operation to get the output of the network, one to get the loss value, and
an operation to perform the gradient descent update. The gradient descent operation uses the
tf.train.GradientDescentOptimizer class.
# get the l o s s f u n c t i o n and the p r e d i c t i o n f u n c t i o n f o r the
network
pre d op = cnn mo d e l fn ( x p l a c e h o l d e r )
l o s s o p = t f . l o s s e s . s p a r s e s o f t m a x c r o s s e n t r o p y ( l a b e l s=
y p l a c e h o l d e r , l o g i t s=pred op )
# d e f i n e op t i m iz er
o p ti mi z e r = t f . t r a i n . Grad i entDes c entOpt i mizer (LR)
t r a i n o p = op t i m i z er . minimize ( l o s s o p )
Next we can start a Tensorflow session. We need to define a session in order to run the previously
defined operations. An operation can be run in a session by using sess.run(op), where sess is the
Tensorflow session, and op is the operation we want to run. As an example, we use the session to run
an operation that initializes the variables in our network using tf.global variables initializer.
# s t a r t t e n s o r f l o w s e s s i o n
s e s s = t f . S e s s i o n ( )
# i n i t i a l i z a t i o n
i n i t = t f . g l o b a l v a r i a b l e s i n i t i a l i z e r ( )
s e s s . run ( i n i t )
Finally, we can begin training. There are several ways to perform training in Tensorflow. Here
we will manually loop through our data and call the Tensorflow operations for each batch. This
approach should be familiar from Chapters 3 and 4. To pass data into the computation graph, we
use the feed dict argument of the sess.run command. Here we pass Numpy matrices into the
place holder values of the graph.
73
CHAPTER 6. NATURAL LANGUAGE PROCESSING WITH RECURRENT NEURAL
NETWORKS IN TENSORFLOW
# t r a i n loop
f o r epoch i n rang e (NUM EPOCHS) :
r u n n i n g l o s s = 0 . 0
n batch = 0
f o r i in range (0 , t r a i n x . shape [0] BATCH SIZE , BATCH SIZE) :
# get batch data
x batch = t r a i n x [ i : i+BATCH SIZE ]
y batch = t r a i n y [ i : i+BATCH SIZE ]
# run s t ep o f gr ad ie nt d e sce n t
f e e d d i c t = {
x p l a c e h o l d e r : x batch ,
y p l a c e h o l d e r : y batch ,
}
, l o s s v a l u e = s e s s . run ( [ t ra in o p , l o s s o p ] ,
f e e d d i c t=f e e d d i c t )
r u n n i n g l o s s += l o s s v a l u e
n batch += 1
p r i n t ( ’ [ Epoch : %d ] l o s s : %.3 f %
( epoch + 1 , r u n n i n g l o s s / ( n b atch ) ) )
We can perform testing in a similar way as training. One difficulty is that the static graph requires
a constant batch size. Our dataset might not be able to be evenly split into batches, depending on
the batch size. There are a few possible ways to deal with this. One way (shown below) is to pad
the last batch such that it matches the batch size. After padding, we should make sure that we
don’t take any of the padded outputs, since these won’t have ground truth labels.
74
CHAPTER 6. NATURAL LANGUAGE PROCESSING WITH RECURRENT NEURAL
NETWORKS IN TENSORFLOW
# t e s t l o o p
a l l p r e d i c t i o n s = np . z e r o s ( ( 0 , 1) )
f o r i in range (0 , t e s t x . shape [ 0 ] , BATCH SIZE) :
x batch = t e s t x [ i : i+BATCH SIZE]
# pad s mal l batch
padded = BATCH SIZE x ba t ch . shape [ 0 ]
i f padded > 0 :
x batch = np . pad ( x batch ,
( ( 0 , padded ) , ( 0 , 0) , ( 0 , 0) , ( 0 , 0) ) ,
c ons t a nt )
# run s t ep
f e e d d i c t = {x p l a c e h o l d e r : x ba t ch }
batch pred = s e s s . run ( pred op ,
f e e d d i c t=f e e d d i c t )
# r e c o v e r i f padding
i f padded > 0 :
batch pred = batch pred [0: padded ]
# get argmax to g e t c l a s s p r e d i c t i o n
batch pred = np . argmax ( batch p red , a x i s =1)
a l l p r e d i c t i o n s = np . append ( a l l p r e d i c t i o n s , batch pred )
6.3 RNN Background
Convolutional neural networks work well for certain types of data that can be made into fixed size
inputs. But what if the data cannot be assumed to be of fixed size? This is common in problems
that consider some sort of time series input like speech or text.
If the input data cannot be assumed to be a fixed size, we need some way to have our network
automatically adapt to the different sized inputs. One way to do this is to use recurrent neurons,
where a neuron’s output can be used as an input to the same neuron at a different time step
(Figure 6.1). With this formulation we can “unroll” the network for as many time steps as needed to
process the data (Figure 6.2). The unrolled representation can be used to perform backpropagation.
RNN
x
t
y
t
h
t
Figure 6.1: Basic RNN Unit.
There are several “flavors” of recurrent units. In this project we will consider three common
variants: vanilla recurrent neurons (RNN), long-short term memory units (LSTM) [32], and gated
75
CHAPTER 6. NATURAL LANGUAGE PROCESSING WITH RECURRENT NEURAL
NETWORKS IN TENSORFLOW
RNN
x
0
y
0
RNN
x
1
y
1
RNN
x
2
y
2
RNN
x
3
y
3
Figure 6.2: Unrolled Recurrent Network.
recurrent units (GRU) [33]. We will use these recurrent units to create a network that can classify
text.
6.3.1 RNN
The equation for the hidden state of a vanilla RNN layer takes the following form:
h
t
= tanh (W
x
x
t
+ W
h
h
t1
+ b
h
) (6.1)
where x
t
is an input at time t, and h
t
is the corresponding state vector. The W matrices and b
vectors represent learnable parameters.
RNN layers can be used in Tensorflow with the tf.contrib.rnn.BasicRNNCell layer. The main
argument in the initialization of this layer is the output feature dimension (num units). You must
set this parameter appropriately to achieve good performance.
To use the RNN layer, we wrap it in a tf.contrib.rnn.static rnn function. This function
takes the RNN cell and a list of tensors as input. Each tensor of the input list corresponds to a
different time step, and the tensor at each time step is of dimensions (batch size, input feature size).
The static rnn function yields two outputs: the hidden state tensor at all time steps, and the final
value of the hidden state.
The output of the RNN is based on the value of the hidden state. The form of the output is
identical to a fully connected layer:
y
t
= softmax (W
y
h
t
+ b
y
) (6.2)
where y
t
is an output at time t. For this project we are only interested in the output at the last
time step. The BasicRNNCell layer itself does not compute this output. To implement the output
we need to use a separate fully connected layer in Tensorflow (tf.layers.dense).
6.3.2 LSTM
The long-short term memory unit (LSTM) is a more complex recurrent node that incorporates
different “gates” to allow or reject information from passing through the network. There is an input
gate, an output gate, and a forget gate that controls the flow of information. Additionally, the
LSTM has both a memory cell and a hidden state. The LSTM is computed as:
f
t
= σ(W
f
x
t
+ U
f
h
t1
+ b
f
)
i
t
= σ(W
i
x
t
+ U
i
h
t1
+ b
i
)
o
t
= σ(W
o
x
t
+ U
o
h
t1
+ b
o
)
c
t
= f
t
c
t1
+ i
t
tanh(W
c
x
t
+ U
c
h
t1
+ b
c
)
h
t
= o
t
tanh(c
t
)
76
CHAPTER 6. NATURAL LANGUAGE PROCESSING WITH RECURRENT NEURAL
NETWORKS IN TENSORFLOW
where is the Hadamard product (e.g., element-wise product), σ is a sigmoid function, f is the
forget gate, i is the input gate, o is the output gate, c is the memory cell, and h is the hidden state.
W, U and b represent learnable parameters.
LSTM layers can be used in Tensorflow with the rnn.BasicLSTMCell. As with the RNN, the
main arguments in the initialization of this layer is the output feature dimension. The output feature
dimension is a parameter that you must set yourself to achieve good performance. To use the LSTM,
we again use the tf.contrib.rnn.static rnn function. This function takes the LSTM cell and a
list of tensors as input. Each element of the input list corresponds to a different time step, and the
tensor at each time step is of dimensions (batch size, input feature size). The static rnn function
yields two outputs: the hidden state tensor at all positions, and the final value of the hidden state.
The Tensorflow LSTM implementation has no output y. We can create an output using the
hidden state at a certain time using equation 6.2. This corresponds to a fully connected layer with
the hidden state as the input.
6.3.3 GRU
The Gated Recurrent unit GRU can be thought of as a simplified version of LSTM. In practice
the performance of the GRU and the LSTM can often be similar. The GRU incorporates only two
gates: an update gate and a reset gate. It does not use a memory cell like the LSTM. The GRU is
computed by:
z
t
= σ(W
z
x
t
+ U
z
h
t1
+ b
z
)
r
t
= σ(W
r
x
t
+ U
r
h
t1
+ b
r
)
h
t
= (1 z
t
) h
t1
+ z
t
tanh(W
h
x
t
+ U
h
(r
t
h
t1
) + b
h
)
where is the Hadamard product, σ is a sigmoid function, z is the update gate, r is the reset gate,
and h is the hidden state. W, U and b represent learnable parameters.
GRU layers can be used in Tensorflow with the rnn.GRUCell. As with the RNN, the main
arguments in the initialization of this layer is the output feature dimension. The output feature
dimension is a parameter that you must set yourself to achieve good performance. To use the GRU,
we again use the tf.contrib.rnn.static rnn function. This function takes the LSTM cell and a
list of tensors as input. Each element of the input list corresponds to a different time step, and the
tensor at each time step is of dimensions (batch size, input feature size). The static rnn function
yields two outputs: the hidden state tensor at all positions, and the final value of the hidden state.
The Tensorflow GRU implementation has no output y. We can create an output using the
hidden state at a certain time using equation 6.2. This corresponds to a fully connected layer with
the hidden state as the input.
6.4 RNNs for sentence classification
In this project we will utilize recurrent layers to classify sentences.
6.4.1 Datasets
In this project, we consider 4 datasets: sentiment, spam, questions, newsgroups. Each dataset is a
collection of text entries, with a corresponding text label for each entry. Each group will be assigned
two datasets to work on (see Section 6.4.2). Table 6.1 describes each dataset.
The sentiment dataset consists of movie reviews that are either positive or negative [34]. The
spam dataset consists of emails that are either spam or not spam [35].
The questions dataset [36] contains questions classified by what type is the answer to the question.
For example the questions “Who is Snoopy’s arch-enemy?” and “What actress has received the most
77
CHAPTER 6. NATURAL LANGUAGE PROCESSING WITH RECURRENT NEURAL
NETWORKS IN TENSORFLOW
Oscar nominations?” have the same class. As another example the questions “What novel inspired
the movie BladeRunner?” and “What’s the only work by Michelangelo that bears his signature?”
also have the same class. The classes are varied enough that the network has to understand the
whole sentence to classify, rather than just looking for key words like “what” or “who”
The newsgroups dataset [37] consists of newsgroup postings from twenty different categories
(Newsgroups are where people posted things on the internet before Facebook and Reddit). Examples
of the newsgroup classes are “comp.sys.mac.hardware”, “talk.politics.guns”, “rec.sport.hockey”.
Each dataset is stored in two CSV files: train.csv and test.csv. Each line of the CSV file
has a sentence and a label. We have provided a load text dataset function in dataset.py to load
the datasets as numpy arrays. This function takes as arguments the name of the dataset and the
maximum sequence size. Any sentences that are smaller than the maximum sequence size will be
padded, and any sequences that are larger will be truncated. The dataset loader will translate each
unique word to a unique number using a lookup table. The number of unique words is the vocabulary
size of the dataset. This load text dataset function will also print out how many unique words
(the vocabulary size) are there in the loaded .csv file. This will be the vocabulary size that you
used for embedding.
Table 6.1: Datasets for NLP with RNN in Tensorflow
Dataset Description # Categories # Train / # Test
Sentiment Movie review dataset [34] 2 7996 / 2666
Spam Enron spam dataset [35] 2 3879 / 1293
Questions Learning question classi-
fiers dataset [36]
50 4464 / 1488
Newsgroups 20 newsgroups dataset
[37]
20 14121 / 4707
6.4.2 Network Design
The input to our network is a batch of sentences and the output is a batch of predicted labels. From
the input we first want to perform a word embedding. The embedding layer takes the words, which
have been mapped to integers by the dataset loader, and outputs a vector embedding for each word.
For example, the word “yes” might be mapped to the vector [0.1, 0.7, 0.5] and the word “no” might
be mapped to the vector [1.1, 0.5, 0.3]. The dimension of the embedding is a hyper-parameter that
must be set. We can use the tf.nn.embedding lookup layer for this. The embedding layer uses an
embedding matrix. For our application the embedding matrix can be randomly initialized.
Next we pass the embedded words into the recurrent layer. In general, we could utilize several
stacks of recurrent layers, but in this project it is only required to consider a single recurrent layer
stack. We want to obtain a class prediction from the output of the recurrent unit, so we can use a
fully connected layer after the recurrent layer. The number of outputs of the fully connected layer
should be equal to the number of classes.
Since we are doing classification, we can use the tf.losses.sparse
softmax cross entropy
loss function. This loss function incorporates the softmax internally, so we do not need to explicitly
add a softmax layer to our model.
Training the sequence models is almost identical to training convolutional neural networks. We
still need to loop through our data, and for each batch call the train operation.
78
CHAPTER 6. NATURAL LANGUAGE PROCESSING WITH RECURRENT NEURAL
NETWORKS IN TENSORFLOW
Table 6.2: Layers for text processing
Layer Description Initialization Arguments
tf.nn.embedding lookup Embed word vectors (embedding matrix, input tensor)
rnn.RNNCell Vanilla RNN unit (output feature size)
rnn.BasicLSTMCell LSTM unit (output feature size)
rnn.GRUCell GRU unit (output feature size)
layers.dense Fully connected output (input tensor, output feature size)
Hint: Training RNNs
Training RNNs may be a little more difficult than training convolutional neural net-
works. If training is not working well, you may need to fiddle with the learning rate,
or try another optimizer such as ADAM [38]. ADAM can be used with the Tensorflow
tf.train.AdamOptimizer class.
Deliverable: NLP with Tensorflow
Depending on your task number, you are assigned two different datasets and one recurrent
unit type (Table 6.3). For the assigned datasets and recurrent unit type, you must determine
the hyper parameters (e.g., learning rate, batch size, embedding size, hidden state feature
dimension) of the network to achieve good performance.
Submit code for both datasets named as dataset unitname.py, where dataset is
replaced by your assigned dataset and unitname is replaced by your assigned recurrent
unit type (lstm, gru)
Provide the final classification accuracy for both datasets
Submit the two trained models
Additional Resources
Tensorflow API Documentation - https://www.tensorflow.org/api_docs/
Tensorflow Tutorials - https://www.tensorflow.org/tutorials/
Tensorflow Github - https://github.com/tensorflow/tensorflow
Tensorflow Discussion Forum - https://www.tensorflow.org/community/forums
Task Assignment
Submission Instructions
Submit your project as a folder named FIRSTNAME LASTNAME TASKNUMBER CH6 and zip the folder for
submission. The plots and saved models that you generated during this lab should be placed in a
folder called results. The grading rubric is shown in Table 6.4. There are no unit tests for the
submitted code, but your code will be graded by attempting to run it.
79
CHAPTER 6. NATURAL LANGUAGE PROCESSING WITH RECURRENT NEURAL
NETWORKS IN TENSORFLOW
Table 6.3: Sample tasks for RNN in Tensorflow
Task number Datasets RNN unit
1 Sentiment, Spam LSTM
2 Sentiment, Questions LSTM
3 Sentiment, Newsgroups LSTM
4 Spam, Questions LSTM
5 Spam, Newsgroups LSTM
6 Questions, Newsgroups LSTM
7 Sentiment, Spam GRU
8 Sentiment, Questions GRU
9 Sentiment, Newsgroups GRU
10 Spam, Questions GRU
11 Spam, Newsgroups GRU
12 Questions, Newsgroups GRU
Table 6.4: Grading rubric
Points Description
NOTE: DO NOT SUBMIT THE DATASET
40 Working code for first dataset
40 Working code for second dataset
10 Report final classification accuracy for both datasets
10 Submit saved models for both datasets
Total 100
80
Bibliography
[1] “Numpy reference.” https://docs.scipy.org/doc/numpy-1.13.0/reference/, 2017.
[2] “Scipy reference.” https://docs.scipy.org/doc/scipy-1.0.0/reference/, 2017.
[3] “Matplotlib pyplot reference.” https://matplotlib.org/api/pyplot_api.html, 2017.
[4] J. Johnson, “Python numpy tutorial.” http://cs231n.github.io/python-numpy-tutorial/,
2017.
[5] “Luma coding in video systems.” https://en.wikipedia.org/wiki/Grayscale#Luma_
coding_in_video_systems, 2017.
[6] Wikipedia, “DFT matrix — Wikipedia, the free encyclopedia.” http://en.wikipedia.org/w/
index.php?title=DFT%20matrix&oldid=811427639, 2017.
[7] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Commu-
nications of the ACM, vol. 18, no. 9, pp. 509–517, 1975.
[8] D. R. Cox, “The regression analysis of binary sequences,” Journal of the Royal Statistical
Society. Series B (Methodological), pp. 215–242, 1958.
[9] M. Aly, “Survey on multiclass classification methods,” Neural Networks, 2005.
[10] T. P. Minka, “A comparison of numerical optimizers for logistic regression,” 2003.
[11] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–
297, 1995.
[12] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Transac-
tions on Intelligent Systems and Technology, vol. 2, pp. 27:1–27:27, 2011. Software available at
http://www.csie.ntu.edu.tw/
~
cjlin/libsvm.
[13] H. Yu and S. Kim, “SVM tutorial: classification, regression and ranking,” in Handbook of
Natural computing, pp. 479–506, Springer, 2012.
[14] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016. http://www.
deeplearningbook.org.
[15] J. Johnson, “Backpropagation for a Linear Layer.” http://cs231n.stanford.edu/handouts/
linear-backprop.pdf. [Online; accessed 22-Dec-2017].
[16] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural
networks,” in Proceedings of the Thirteenth International Conference on Artificial Intelligence
and Statistics, pp. 249–256, 2010.
81
BIBLIOGRAPHY
[17] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” in
Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814,
2010.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778,
2016.
[19] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Technical Report, 2009.
[20] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
[21] A. Lavin and S. Gray, “Fast algorithms for convolutional neural networks,” in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4013–4021, 2016.
[22] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recog-
nition,” arXiv preprint arXiv:1409.1556, 2014.
[23] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison,
L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
[24] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, et al., “Imagenet large scale visual recognition challenge,” Interna-
tional Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
[25] “iNaturalist challenge at FGVC 2017.” https://www.kaggle.com/c/
inaturalist-challenge-at-fgvc-2017. Accessed: 2018-04-11.
[26] E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and G. Hua, “Labeled faces in the
wild: A survey,” in Advances in face detection and facial image analysis, pp. 189–248, Springer,
2016.
[27] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene
recognition using places database,” in Advances in neural information processing systems,
pp. 487–495, 2014.
[28] “iMaterialist challenge at FGVC 2018.” https://www.kaggle.com/c/
imaterialist-challenge-furniture-2018. Accessed: 2018-04-11.
[29] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training
examples: An incremental bayesian approach tested on 101 object categories,” Computer vision
and Image understanding, vol. 106, no. 1, pp. 59–70, 2007.
[30] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional
networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2017.
[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich, “Going deeper with convolutions,” IEEE International Conference on Computer
Vision (CVPR), pp. 1–9, 2015.
[32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9,
no. 8, pp. 1735–1780, 1997.
82
BIBLIOGRAPHY
[33] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Ben-
gio, “Learning phrase representations using rnn encoder-decoder for statistical machine trans-
lation,” arXiv preprint arXiv:1406.1078, 2014.
[34] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine
learning techniques,” in Proceedings of the ACL-02 conference on Empirical methods in natural
language processing-Volume 10, pp. 79–86, Association for Computational Linguistics, 2002.
[35] V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with naive bayes-which naive
bayes?,” in Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS 2006), 2006.
[36] X. Li and D. Roth, “Learning question classifiers,” in Proceedings of the 19th international
conference on Computational linguistics-Volume 1, pp. 1–7, Association for Computational Lin-
guistics, 2002.
[37] K. Lang, “Newsweeder: Learning to filter netnews,” in Proceedings of the Twelfth International
Conference on Machine Learning, pp. 331–339, 1995.
[38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.
83