[tts] WaveNet初步理解

WaveNet初步理解

paper 是DeepMind使用CNN来做语音生成的工作,这个模型直接使用声音的原始波形进行训练的。目前github上开源了一个tensorflow-wavenet项目
文章主要内容有几点：
1.文章中，通过该模型进行语音生成任务，结果很接近真人发出的声音
2.Wavenet还可以抓取不同说话者的特征，有高保真度
3.使用音乐文件来训练该模型，可以生成新的高保真度的音乐片段
4.还可以加入判别模型，使之完成语音识别任务

模型结构

![wavenet](file:///C:/Users/makui/Desktop/wavenet.png)
为了提高receptive field，文章使用dilated convolution技术（跳着卷积）。虽然音频是连续的，但是将其进行量化，降维到256dims，转换成了分类问题。文章基于的条件分布概率模型， $p(x)=\prod ^{T}_{t=1} p(x_{t}|x_{1},...,x_{t-1})$ 是通过causal convolutional网络结构来实现的。

input

原始音频文件是16bit的整数值序列存储，为了便于运算，文章使用了ulaw压缩编码，将音频16bit归一化后降低到了256dims。
$f(x_{t})=sign(x_{t})\frac{ln(1+\mu |x_{t}|)}{ln(1+\mu)};-1<x_{t}<1,\mu =255$

G711编码的声音清晰度好，语音自然度高，但是压缩效率低，输出为8bit，主要分为ulaw和alaw。alaw也叫G711a,输入时13位（s16的高13位），在欧洲和中国使用广泛；ulaw也叫G711u,输入14位，主要北美和日本使用。编码算法多以查表为主，基础值+平均偏移。

convolution

Convolutions

在wavenet中采用causal convolutions,保证
$p(x_{t+1}|x_{1},...,x_{t})$
中不包含
$x_{t+1},x_{t+2}...,x_{t+T}$
中的信息。
对于1D数据信息，输出label为输入信息的偏移。

input: x: 0-255 int
output: label: x向负方向shift 1. 使用前t个采样点来预测第t+1个采样点，空缺位补0.
eg: x=[1,2,3,4,5] y=[2,3,4,5,0]

由于声波采样非常密集，为了提高receptive field，paper使用了dilated convolutions,可以在保持在原本卷积层数量的参数情况下，指数级的增长receptive field。
conv1
conv2

Gated Activation Units

图片描述
在声音信号建模中，非线性模型效果更好。

residual block and skip connection

WaveNet中使用residual block and skip connection技术是用来加速模型收敛速度的，并且使得梯度能够传到更深的模型。
Residual and skip connection技术是KaimingHe在2015年ResNet paper 提出的。
ResNet产生的背景是当网络越深时的学习越难，主要原因是梯度弥散或爆炸（vanishing/exploding gradients）。有一种解决方案是采用标准化，标准化的初始化数据或中间层的标准化（batch normalization）。当网络的深度继续增加，训练精确度达到饱和后，会出现急剧衰退，一个深度合适的模型增加更多的网络层会导致更高的训练错误，这并不是由过拟合导致的。ResNet通过引入一个深度残差学习网络来解决衰退问题。
图片描述

构造方式是增加恒等映射(identity mapping，即f(x)=x),明确让网络层去拟合残差映射，将期望的潜在映射指为H(x),非线性层匹配其他映射F(x)=H(x)-x。通过多个非线性网络层去逼近恒等映射是有困难的，重新公式化后H(x)=F(x)+x，如果恒等映射时最优的，则非线性网络的权重向0逼近，以此来逼近恒等映射。
ResNet paper中实验表明，恒等映射足以解决网络衰退问题，并且恒等映射是经济的(introduce neither extra parameter nor computation complexity)。

公式化输入输出： $y=F(x,{W_{i})+W_{s}x$
Ws只有在匹配维度时才使用，例如pooling操作前后。

Code

fake code

def loss(input_batch):
	'''Creates a WaveNet network and returns the autoencoding loss.
	with tf.name_scope(name):
		# ulaw编码，return 0-255
		input_batch = mu_law_encode(input_batch,
									self.quantization_channels)
		encoded = one_hot(input_batch)
		#wavenet model output
		raw_output = self._create_network(encoded)

		with tf.name_scope('loss'):
			# 向左偏移一位，即减去第一位，保证每次是预测下一个输出。
			# encoded=[0,0,0,0,1,2,3,4,5], encoded.shape=(1,9,1)
			# shifted.shape=(1,8,1),[0,0,0,1,2,3,4,5]
			shifted = tf.slice(encoded, [0, 1, 0],
							   [-1, tf.shape(encoded)[1] - 1, -1])
			# 补零，shifted.shape=(1,9,1),[0,0,0,1,2,3,4,5,0]
			shifted = tf.pad(shifted, [[0, 0], [0, 1], [0, 0]])

			#将模型预测转换shape为prediction
			prediction = tf.reshape(raw_output,
									[-1, self.quantization_channels])
			#loss函数
			loss = tf.nn.softmax_cross_entropy_with_logits(
				prediction,
				tf.reshape(shifted, [-1, self.quantization_channels]))
			reduced_loss = tf.reduce_mean(loss)

	return reduced_loss

def create_network(input_batch):
	outputs = []
	current_layer = input_batch
	current_layer = create_causal_layer(current_layer)
	# Add all defined dilation layers.	#18 layers
	for layer_index, dilation in enumerate(self.dilations):
		output, current_layer = create_dilation_layer(
			current_layer, layer_index, dilation)
		outputs.append(output)
	# Perform (+) -> ReLU -> 1x1 conv -> ReLU -> 1x1 conv to the output.
	total = sum(outputs)
	transformed1 = tf.nn.relu(total)
	conv1 = tf.nn.conv1d(transformed1, w1, stride=1, padding="SAME")
	transformed2 = tf.nn.relu(conv1)
	conv2 = tf.nn.conv1d(transformed2, w2, stride=1, padding="SAME")

	return conv2


def create_dilation_layer(input_batch, layer_index, dilation):

	'''Creates a single causal dilated convolution layer.
	The layer contains a gated filter that connects to dense output
	and to a skip connection:

		   |-> [gate]   -|        |-> 1x1 conv -> skip output
		   |             |-> (*) -|
	input -|-> [filter] -|        |-> 1x1 conv -|
		   |                                    |-> (+) -> dense output
		   |------------------------------------|

	Where `[gate]` and `[filter]` are causal convolutions with a
	non-linear activation at the output.
	'''

	conv_filter = causal_conv(input_batch, weights_filter, dilation)
	conv_gate = causal_conv(input_batch, weights_gate, dilation)
	out = tf.tanh(conv_filter) * tf.sigmoid(conv_gate)
	transformed = tf.nn.conv1d(
		out, weights_dense, stride=1, padding="SAME", name="dense")
	skip_contribution = tf.nn.conv1d(
		out, weights_skip, stride=1, padding="SAME", name="skip")

	#返回skip output和（残差+input）
	return skip_contribution, input_batch + transformed


def create_causal_layer():
	detail source code
	实现方式巧妙，直接见源码分析

source code

causal_conv

def time_to_batch(value, dilation, name=None):
    with tf.name_scope('time_to_batch'):
        #value=[[[0],[0],[0],[0],[1],[2],[3],[4],[5]]]
        #value.shape=(1,9,1) dilation=4
        shape = tf.shape(value)
        #pad_elements=4-1-(9+4-1)%4=3
        pad_elements = dilation - 1 - (shape[1] + dilation - 1) % dilation
        #padded=[[[0],[0],[0],[0],[1],[2],[3],[4],[5],[0],[0],[0]]]
        padded = tf.pad(value, [[0, 0], [0, pad_elements], [0, 0]])
        #reshape=[[[0,0,0],[0,1,2],[3,4,5],[0,0,0]]]
        #reshape.shape=（3,4,1）
        reshaped = tf.reshape(padded, [-1, dilation, shape[2]])
        #transposed=[[[0,0,3,0],[0,1,4,0],[0,2,5,0]]]
        #transposed.shape=（4,3,1）
        transposed = tf.transpose(reshaped, perm=[1, 0, 2])
        #return shape=(4,3,1)
        return tf.reshape(transposed, [shape[0] * dilation, -1, shape[2]])


def batch_to_time(value, dilation, name=None):
    with tf.name_scope('batch_to_time'):
        shape = tf.shape(value)
        prepared = tf.reshape(value, [dilation, -1, shape[2]])
        transposed = tf.transpose(prepared, perm=[1, 0, 2])
        #最后返回的是前面time_to_batch的最初输入数值的shape
        #return shape为（1,9,1）
        return tf.reshape(transposed,
                          [tf.div(shape[0], dilation), -1, shape[2]])
##if filter_width=2，dilation=4, value=[[[1],[2],[3],[4],[5]]] value.shape=(1,5,1)
def causal_conv(value, filter_, dilation, name='causal_conv'):
    with tf.name_scope(name):
        # Pad beforehand to preserve causality.
        filter_width = tf.shape(filter_)[0]
        #padding=[[0, 0], [4, 0], [0, 0]]
        padding = [[0, 0], [(filter_width - 1) * dilation, 0], [0, 0]]
        #第二个维度前边增加4个0，shape:(1,5,1)->(1,9,1)
		#padded=[[[0],[0],[0],[0],[1],[2],[3],[4],[5]]]
        padded = tf.pad(value, padding)
        if dilation > 1:
            #return shape=（4,3,1）
            transformed = time_to_batch(padded, dilation)
            
            conv = tf.nn.conv1d(transformed, filter_, stride=1, padding='SAME')

            restored = batch_to_time(conv, dilation)
        else:
            restored = tf.nn.conv1d(padded, filter_, stride=1, padding='SAME')
        # Remove excess elements at the end.
        result = tf.slice(restored,
                          [0, 0, 0],
                          [-1, tf.shape(value)[1], -1])
        #result.shape = padded.shape
        return result

dilation_layer

def _create_dilation_layer(self, input_batch, layer_index, dilation):
	'''Creates a single causal dilated convolution layer.

	The layer contains a gated filter that connects to dense output
	and to a skip connection:

		   |-> [gate]   -|        |-> 1x1 conv -> skip output
		   |             |-> (*) -|
	input -|-> [filter] -|        |-> 1x1 conv -|
		   |                                    |-> (+) -> dense output
		   |------------------------------------|

	Where `[gate]` and `[filter]` are causal convolutions with a
	non-linear activation at the output.
	'''
	variables = self.variables['dilated_stack'][layer_index]

	weights_filter = variables['filter']
	weights_gate = variables['gate']

	conv_filter = causal_conv(input_batch, weights_filter, dilation)
	conv_gate = causal_conv(input_batch, weights_gate, dilation)

	if self.use_biases:
		filter_bias = variables['filter_bias']
		gate_bias = variables['gate_bias']
		conv_filter = tf.add(conv_filter, filter_bias)
		conv_gate = tf.add(conv_gate, gate_bias)

	#gate和filter共同输出
	out = tf.tanh(conv_filter) * tf.sigmoid(conv_gate)

	# The 1x1 conv to produce the residual output
	weights_dense = variables['dense']
	transformed = tf.nn.conv1d(
		out, weights_dense, stride=1, padding="SAME", name="dense")

	# The 1x1 conv to produce the skip output
	weights_skip = variables['skip']
	#skip output
	skip_contribution = tf.nn.conv1d(
		out, weights_skip, stride=1, padding="SAME", name="skip")

	if self.use_biases:
		dense_bias = variables['dense_bias']
		skip_bias = variables['skip_bias']
		transformed = transformed + dense_bias
		skip_contribution = skip_contribution + skip_bias

	#返回skip output和（残差+input）
	return skip_contribution, input_batch + transformed

network

def _create_network(self, input_batch):
	'''Construct the WaveNet network.'''
	outputs = []
	current_layer = input_batch

	current_layer = self._create_causal_layer(current_layer)

	# Add all defined dilation layers.	#18 layers
	with tf.name_scope('dilated_stack'):
		for layer_index, dilation in enumerate(self.dilations):
			with tf.name_scope('layer{}'.format(layer_index)):
				output, current_layer = self._create_dilation_layer(
					current_layer, layer_index, dilation)
				outputs.append(output)

	#postprocess层
	with tf.name_scope('postprocessing'):
		# Perform (+) -> ReLU -> 1x1 conv -> ReLU -> 1x1 conv to
		# postprocess the output.
		# conv weight
		w1 = self.variables['postprocessing']['postprocess1']
		w2 = self.variables['postprocessing']['postprocess2']
		if self.use_biases:
			b1 = self.variables['postprocessing']['postprocess1_bias']
			b2 = self.variables['postprocessing']['postprocess2_bias']

		# We skip connections from the outputs of each layer, adding them
		# all up here.
		#将每一层的skip connection输出累加
		total = sum(outputs)
		transformed1 = tf.nn.relu(total)
		conv1 = tf.nn.conv1d(transformed1, w1, stride=1, padding="SAME")
		if self.use_biases:
			conv1 = tf.add(conv1, b1)
		transformed2 = tf.nn.relu(conv1)
		conv2 = tf.nn.conv1d(transformed2, w2, stride=1, padding="SAME")
		if self.use_biases:
			conv2 = tf.add(conv2, b2)

	return conv2

Reference

WAVENET
github open source code
【Emotibot Tech】WaveNet语音合成与深度生成模型解析
 WaveNet 分析和实现
 谷歌WaveNet如何通过深度学习方法来生成声音？
谷歌WaveNet 源码详解
 技术 | DeepMind语音生成模型WaveNet的TensorFlow实现
 Pixel CNN, Wavenet, GCNN笔记
一些ResNet的参考：
https://zhuanlan.zhihu.com/p/27082562
https://tracholar.github.io/wiki/machine-learning/residual-network.html
https://zhuanlan.zhihu.com/p/22071346
https://zhuanlan.zhihu.com/p/22447440
http://blog.csdn.net/mao_feng/article/details/52734438
http://blog.csdn.net/bea_tree/article/details/51817142
http://www.voidcn.com/article/p-hnytmiyq-pt.html
http://pengshuang.space/2017/08/05/Resnet-学习笔记/
G711编码：
https://en.wikipedia.org/wiki/Μ-law_algorithm
http://www.voidcn.com/article/p-fgyvjsfz-bmq.html
http://www.21ic.com/evm/audio/201705/721797.htm