lstm 三角函数预测

Preface

说了好久要手撕一次lstm预测，结果上学期用bucket时遇到issue后就搁了下来，后面还被突然尴尬了几次(⊙﹏⊙)b。
好吧，先把issue放出来https://github.com/apache/incubator-mxnet/issues/8663，然而并没有大神鸟(我也不知道为什么 ...)。

Code

今天也是事起突然，然后就写了段测试程序( 可能大家都玩gluon，不理symbol那一套了):

import mxnet as mx
from mxnet import gluon
import numpy as np

hiden_sizes=[10,20,1]
batch_size=300
iteration=300000
log_freq = 20
ctx=mx.gpu()
opt = 'adam' # 'sgd'

unroll_len =9
t= mx.nd.arange(0,0.01*(1+unroll_len),.01, ctx=ctx)
tt= mx.nd.random.uniform(shape=(iteration,1), ctx=ctx)
t= (t+tt).T   # (unroll_len, iteration)
y= mx.nd.sin(t[-1])/2

model=gluon.rnn.SequentialRNNCell()
with model.name_scope():
    for hidden_size in hiden_sizes:
        model.add(gluon.rnn.LSTMCell(hidden_size))
model.initialize(ctx=ctx)
L=gluon.loss.L2Loss()
Trainer= gluon.Trainer(model.collect_params(),opt)
prev_batch_idx=-1
acc_l = mx.nd.array([0,], ctx=ctx)

for batch_idx in xrange(iteration/batch_size):
    x_list = [x[batch_idx*batch_size:(batch_idx+1)*batch_size].T for x in t[:unroll_len]]
    # e in x_list: (b,1)
    label =   y[batch_idx*batch_size:(batch_idx+1)*batch_size]
    with mx.autograd.record():
        outputs, states = model.unroll(unroll_len, x_list)
        l=L(outputs[-1], label)
        l.backward()
    Trainer.step(batch_size)
    acc_l += l.mean()
    if batch_idx- prev_batch_idx == log_freq:
        print 'loss:%.4f'%((acc_l/log_freq).asnumpy())
        prev_batch_idx = batch_idx
        acc_l *= 0

Note

adam要比sgd显著地快，见文末loss的比较列表。
没有relu激活，然后层数多了之后，难以优化？
前一个问题:LSTM的定义式里面没有这个存在的地方；第二个问题，发现有几个链接
https://www.reddit.com/r/MachineLearning/comments/30eges/batch_normalization_or_other_tricks_for_lstms/
https://groups.google.com/forum/#!topic/lasagne-users/EczUQckJggU
以上是相关的讨论。
然后这份工作(http://cn.arxiv.org/abs/1603.09025)是针对hidden-to-hidden提出的BN。从描述和贴上的结果来看，收敛速度和精度并没有可观的提升。

Optimizer	1	2	3	4	5
ADAM	0.0378	0.0223	0.0059	0.0043	0.0030
SGD	0.0387	0.0335	0.0284	0.0247	0.0214

2018.3.12 记
用上面的损失计算，直观上，得到的模型不能适应变化的长度，比如把预测的几个output打印出来:

	1	2	3	4	5	6	7	8	9
Label	0.3411144	0.35049742	0.35984537	0.36915734	0.37843242	0.38766962	0.39686805	0.40602681	0.41514498
Final Predict	0.0347072	0.06628791	0.09616056	0.12489445	0.15264377	0.17937641	0.20499264	0.22938542	0.25246838
Initial Predict	6.92845606e-06	1.78557530e-05	2.94411457e-05	3.98039047e-05	4.82983705e-05	5.49682954e-05	6.01386819e-05	6.41888546e-05	6.74523835e-05

最开始的几个输出误差是很大的。容易想到，对变长度的学习，本质上，还是一种映射学习。