Does bias in the convolutional layer really make a difference to the test accuracy?
Does bias in the convolutional layer really make a difference to the test accuracy?
I understand that bias are required in small networks, to shift the activation function. But in the case of Deep network that has multiple layers of CNN, pooling, dropout and other non -linear activations, is Bias really making a difference? The convolutional filter is learning local features and for a given conv output channel same bias is used.
This is not a dupe of this link. The above link only explains role of bias in small neural network and does not attempt to explain role of bias in deep-networks containing multiple CNN layers, drop-outs, pooling and non-linear activation functions.
I ran a simple experiment and the results indicated that removing bias from conv layer made no difference in final test accuracy.
There are two models trained and the test-accuracy is almost same (slightly better in one without bias.)
Are they being used only for historical reasons?
If using bias provides no gain in accuracy, shouldn't we omit them? Less parameters to learn.
I would be thankful if someone who have deeper knowledge than me, could explain the significance(if- any) of these bias in deep networks.
Here is the complete code and the experiment result bias-VS-no_bias experiment
batch_size = 16
patch_size = 5
depth = 16
num_hidden = 64
graph = tf.Graph()
with graph.as_default():
# Input data.
tf_train_dataset = tf.placeholder(
tf.float32, shape=(batch_size, image_size, image_size, num_channels))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
# Variables.
layer1_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, num_channels, depth], stddev=0.1))
layer1_biases = tf.Variable(tf.zeros([depth]))
layer2_weights = tf.Variable(tf.truncated_normal(
[patch_size, patch_size, depth, depth], stddev=0.1))
layer2_biases = tf.Variable(tf.constant(1.0, shape=[depth]))
layer3_weights = tf.Variable(tf.truncated_normal(
[image_size // 4 * image_size // 4 * depth, num_hidden], stddev=0.1))
layer3_biases = tf.Variable(tf.constant(1.0, shape=[num_hidden]))
layer4_weights = tf.Variable(tf.truncated_normal(
[num_hidden, num_labels], stddev=0.1))
layer4_biases = tf.Variable(tf.constant(1.0, shape=[num_labels]))
# define a Model with bias .
def model_with_bias(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer1_biases)
conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv + layer2_biases)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
# define a Model without bias added in the convolutional layer.
def model_without_bias(data):
conv = tf.nn.conv2d(data, layer1_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv ) # layer1_ bias is not added
conv = tf.nn.conv2d(hidden, layer2_weights, [1, 2, 2, 1], padding='SAME')
hidden = tf.nn.relu(conv) # + layer2_biases)
shape = hidden.get_shape().as_list()
reshape = tf.reshape(hidden, [shape[0], shape[1] * shape[2] * shape[3]])
# bias are added only in Fully connected layer(layer 3 and layer 4)
hidden = tf.nn.relu(tf.matmul(reshape, layer3_weights) + layer3_biases)
return tf.matmul(hidden, layer4_weights) + layer4_biases
# Training computation.
logits_with_bias = model_with_bias(tf_train_dataset)
loss_with_bias = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_with_bias))
logits_without_bias = model_without_bias(tf_train_dataset)
loss_without_bias = tf.reduce_mean(
tf.nn.softmax_cross_entropy_with_logits(labels=tf_train_labels, logits=logits_without_bias))
# Optimizer.
optimizer_with_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_with_bias)
optimizer_without_bias = tf.train.GradientDescentOptimizer(0.05).minimize(loss_without_bias)
# Predictions for the training, validation, and test data.
train_prediction_with_bias = tf.nn.softmax(logits_with_bias)
valid_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_valid_dataset))
test_prediction_with_bias = tf.nn.softmax(model_with_bias(tf_test_dataset))
# Predictions for without
train_prediction_without_bias = tf.nn.softmax(logits_without_bias)
valid_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_valid_dataset))
test_prediction_without_bias = tf.nn.softmax(model_without_bias(tf_test_dataset))
num_steps = 1001
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
for step in range(num_steps):
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
batch_data = train_dataset[offset:(offset + batch_size), :, :, :]
batch_labels = train_labels[offset:(offset + batch_size), :]
feed_dict = tf_train_dataset : batch_data, tf_train_labels : batch_labels
session.run(optimizer_with_bias, feed_dict=feed_dict)
session.run(optimizer_without_bias, feed_dict = feed_dict)
print('Test accuracy(with bias): %.1f%%' % accuracy(test_prediction_with_bias.eval(), test_labels))
print('Test accuracy(without bias): %.1f%%' % accuracy(test_prediction_without_bias.eval(), test_labels))
Output:
Initialized
Test accuracy(with bias): 90.5%
Test accuracy(without bias): 90.6%
I understand that bias are required in small networks, to shift the activation function. But in the case of Deep network that has layers of CNN, and other non -linear activations, is Bias making a difference? Omitting the bias term in the above almost make no difference.
– Aparajuli
Aug 23 at 1:34
2 Answers
2
Biases are tuned alongside weights by learning algorithms such as
gradient descent. biases differ from weights is that they are
independent of the output from previous layers. Conceptually bias is
caused by input from a neuron with a fixed activation of 1, and so is
updated by subtracting the just the product of the delta value and
learning rate.
In a large model, removing the bias inputs makes very little difference because each node can make a bias node out of the average activation of all of its inputs, which by the law of large numbers will be roughly normal. At the first layer, the ability for this to happens depends on your input distribution. For MNIST for example, the input's average activation is roughly constant. On a small network, of course you need a bias input, but on a large network, removing it makes almost no difference.
See also:
Reference
I understand Bias's role in neural network. However, in conv layer where we want to just learn local features such as edges, patterns, moments e.t.c that are dependent on the previous layer, do we really need bias? Any explanation for the results comparing test accuracy with and without bias?
– Aparajuli
Aug 23 at 15:00
@Aparajuli In a large model, removing the bias inputs makes very little difference because each node can make a bias node out of the average activation of all of its inputs, which by the law of large numbers will be roughly normal. At the first layer, the ability for this to happens depends on your input distribution. For MNIST for example, the input's average activation is roughly constant. On a small network, of course you need a bias input, but on a large network, removing it makes almost no difference. (But, why would you remove it?)
– Amir Hadifar
Aug 23 at 15:12
My point is exactly what you have written "On a small network, of course you need a bias input, but on a large network, removing it makes almost no difference" If it makes no difference, why add it. removing the bias means less parameters to learn, less training time. There is 100 or to even thousand (in deeper architecture) less parameters to learn.
– Aparajuli
Aug 23 at 15:26
@Aparajuli, In todays NN architecture hundreds of biases are negligible compare to millions of parameters. Unfortunately, I couldn't find mathematical reason.
– Amir Hadifar
Aug 23 at 19:11
layer1_biases
and layer2_biases
are NOT in convolutional layers, but in ReLU layers. The existence of ReLU layers make sense, because, as quoted from wikipedia,
layer1_biases
layer2_biases
ReLU is the abbreviation of Rectified Linear Units. This layer applies the non-saturating activation function f ( x ) = max ( 0 , x ). It increases the nonlinear properties of the decision function and of the overall network without affecting the receptive fields of the convolution layer.
My point is that bias are required in small networks, to shift the activation function. But in the case of Deep network that has multiple layers of CNN, and other non -linear activations, is Bias making a difference? Omitting the bias term in the above almost make no difference?
– Aparajuli
Aug 23 at 1:35
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Biases are needed for convolutional layers for the same reason why they are needed for other layers. stackoverflow.com/questions/2480650/…
– HSK
Aug 22 at 6:06