pointnet++之scannet/train.py

1.作者可能把scannet数据集分成了训练集和测试集并处理成了.pickle文件。

2.在代码运行过程中，作者从.pickle文件中读出训练集1201个场景的x、y、z坐标和测试集312个场景的x、y、z坐标。

3.考虑把点存到.txt文件中，用cloudcompare可视化一下。

2--地板

3--椅子

8--沙发

20--靠枕

单独存入训练数据到txt文件：

TRAIN_DATASET = scannet_dataset.ScannetDataset(root=DATA_PATH, npoints=NUM_POINT, split='train')
for i in range(len(TRAIN_DATASET.scene_points_list)):
    filename=''.join(["TRAIN_DATASET_",str(i+1),'.txt'])
    np.savetxt(filename, TRAIN_DATASET.scene_points_list[i],fmt="%.8f", delimiter=',')

单独存入训练数据的标签到txt文件：

for i in range(len(TRAIN_DATASET.semantic_labels_list)):
    filename=''.join(["data/train_dataset/train_label_",str(i+1),'.txt'])
    np.savetxt(filename, TRAIN_DATASET.semantic_labels_list[i],fmt="%d", delimiter=',')

单独存入测试数据到txt文件：

TEST_DATASET = scannet_dataset.ScannetDataset(root=DATA_PATH, npoints=NUM_POINT, split='test')
for i in range(len(TEST_DATASET.scene_points_list)):
    filename=''.join(["data/test_dataset/test_",str(i+1),'.txt'])
    np.savetxt(filename, TEST_DATASET.scene_points_list[i],fmt="%.8f", delimiter=',')

单独存入测试数据的标签到txt文件：

for i in range(len(TEST_DATASET.semantic_labels_list)):
    filename=''.join(["data/test_dataset/test_",str(i+1),'.txt'])
    np.savetxt(filename, TEST_DATASET.semantic_labels_list[i],fmt="%.8f", delimiter=',')

将训练集及其对应标签存在一起：

    traindata_and_label=np.column_stack((TRAIN_DATASET.scene_points_list, TRAIN_DATASET.semantic_labels_list))#np.column_stack将两个矩阵进行组合连接
    filename=''.join(["data/train_dataset/train_data_and_label_",str(1),'.txt'])
    np.savetxt(filename, traindata_and_label,fmt="%.8f,%.8f,%.8f,%d", delimiter=',')

将测试集及其对应标签存在一起：

    traindata_and_label=np.column_stack((TEST_DATASET.scene_points_list, TEST_DATASET.semantic_labels_list))#np.column_stack将两个矩阵进行组合连接
    filename=''.join(["data/test_dataset/test_data_and_label_",str(1),'.txt'])
    np.savetxt(filename, testdata_and_label,fmt="%.8f,%.8f,%.8f,%d", delimiter=',')

    def __getitem__(self, index):
        point_set = self.scene_points_list[index]
        semantic_seg = self.semantic_labels_list[index].astype(np.int32)
        coordmax = np.max(point_set,axis=0)
        coordmin = np.min(point_set,axis=0)
        smpmin = np.maximum(coordmax-[1.5,1.5,3.0], coordmin) #（1）
        smpmin[2] = coordmin[2]
        smpsz = np.minimum(coordmax-smpmin,[1.5,1.5,3.0])
        smpsz[2] = coordmax[2]-coordmin[2]
        isvalid = False
        #global sample_weight  # 2019.11.4
        sample_weight=0  # 2019.11.4

#（1）对场景按照体素采样，采样的体素大小是1.5*1.5*3，有的场景高度可能没有3m，那么体素的高度就按照实际的场景最小包围盒高度。

原文：

B.2虚拟扫描生成

在这一节中，我们描述了如何产生来自扫描网络场景的非均匀采样密度的标记虚拟扫描。对于ScanNet中的每个场景，我们将相机位置设置为高于平面质心的1.5m，并在水平平面上沿8个方向均匀地旋转相机方向。在每一个方向上，我们使用大小为100px的图像平面，并将每个像素的光线从每个像素投射到场景。这给出了一种在场景中选择可见点的方法。然后，我们可以生成类似于每个测试场景的8个虚拟扫描，并且在图9中示出了一个示例。注意点样本在靠近相机的区域更密集。

B.4扫描网实验细节

为了从ScanNet场景中生成训练数据，我们从初始场景中采样1.5m×1.5m×3m这么大的立方体，然后保持立方体，其中≥2%的体素被占用，≥70%的表面体素具有有效的标注（这与[5]中的设置相同）。我们在飞行中采样这样的训练立方体，并沿着右上轴随机旋转它。增强点被添加到点集以形成固定基数（在本例中为8192）。在测试期间，我们同样将测试场景分割成更小的立方体，并首先获得立方体中每个点的标签预测，然后合并来自同一场景的所有立方体中的标签预测。如果一个点从不同的立方体得到不同的标签，我们将进行多数表决来得到最终的点标签预测。

10----10批，每批有16个场景点集，每个点集8192个点

20----20批，每批有16个场景点集，每个点集8192个点

1 BatchSize=16或者32个数据集

75----Total BatchSize ，1201个数据集，假如以16个数据集为1批，则1201//16=75,共可以分为75批。

mean loss----平均损失

accuracy----准确率

训练过程：

TRAIN_DATASET = scannet_dataset.ScannetDataset(root=DATA_PATH, npoints=NUM_POINT, split='train') #a
TEST_DATASET = scannet_dataset.ScannetDataset(root=DATA_PATH, npoints=NUM_POINT, split='test') #b

a. 加载训练集，总共1201个场景，每个场景点云数量不固定，其中一个场景如图所示：

训练集：1201×N*3，N代表点数，3代表x,y,z.

标签：1201×N

权重：1201个场景中根据各个类别的点数占总点数的比例x计算这个类别的权重w。计算方法：w=1/（In（1.2+x））,（不知道为啥要这样计算）。

b.加载测试集。

测试集：312×N*3，N代表点数，3代表x,y,z.

标签：312×N

权重：每个类别的权重都是1

TEST_DATASET_WHOLE_SCENE = scannet_dataset.ScannetDatasetWholeScene(root=DATA_PATH, npoints=NUM_POINT, split='test')

加载整个场景的测试集，返回的点云和b返回的一样。

2.语义分割网络训练模型

def get_model(point_cloud, is_training, num_class, bn_decay=None):
    """ Semantic segmentation PointNet, input is BxNx3, output Bxnum_class """ 
    batch_size = point_cloud.get_shape()[0].value
    num_point = point_cloud.get_shape()[1].value
    end_points = {}
    l0_xyz = point_cloud
    l0_points = None
    end_points['l0_xyz'] = l0_xyz

    # Layer 1
    l1_xyz, l1_points, l1_indices = pointnet_sa_module(l0_xyz, l0_points, npoint=1024, radius=0.1, nsample=32, mlp=[32,32,64], mlp2=None, group_all=False, is_training=is_training, bn_decay=bn_decay, scope='layer1') #a                      
    l2_xyz, l2_points, l2_indices = pointnet_sa_module(l1_xyz, l1_points, npoint=256, radius=0.2, nsample=32, mlp=[64,64,128], mlp2=None, group_all=False, is_training=is_training, bn_decay=bn_decay, scope='layer2') #b 
    l3_xyz, l3_points, l3_indices = pointnet_sa_module(l2_xyz, l2_points, npoint=64, radius=0.4, nsample=32, mlp=[128,128,256], mlp2=None, group_all=False, is_training=is_training, bn_decay=bn_decay, scope='layer3') #c 
    l4_xyz, l4_points, l4_indices = pointnet_sa_module(l3_xyz, l3_points, npoint=16, radius=0.8, nsample=32, mlp=[256,256,512], mlp2=None, group_all=False, is_training=is_training, bn_decay=bn_decay, scope='layer4') #d 

    # Feature Propagation layers
    l3_points = pointnet_fp_module(l3_xyz, l4_xyz, l3_points, l4_points, [256,256], is_training, bn_decay, scope='fa_layer1')
    l2_points = pointnet_fp_module(l2_xyz, l3_xyz, l2_points, l3_points, [256,256], is_training, bn_decay, scope='fa_layer2')
    l1_points = pointnet_fp_module(l1_xyz, l2_xyz, l1_points, l2_points, [256,128], is_training, bn_decay, scope='fa_layer3')
    l0_points = pointnet_fp_module(l0_xyz, l1_xyz, l0_points, l1_points, [128,128,128], is_training, bn_decay, scope='fa_layer4')

    # FC layers
    net = tf_util.conv1d(l0_points, 128, 1, padding='VALID', bn=True, is_training=is_training, scope='fc1', bn_decay=bn_decay)
    end_points['feats'] = net 
    net = tf_util.dropout(net, keep_prob=0.5, is_training=is_training, scope='dp1')
    net = tf_util.conv1d(net, num_class, 1, padding='VALID', activation_fn=None, scope='fc2')

    return net, end_points

set abstraction:

sampling and grouping层

(N,d+C)=(8192,3+0)

l0_xyz, l0_points, npoint=1024, radius=0.1, nsample=32, mlp=[32,32,64]

l0_xyz: B*N*d=16*8192*3，输入点云的维度

l0_points： B*N*C=16*8192*0

npoint=1024，采样的质心点数

radius=0.1，ball query的半径，注意这是坐标归一化后的尺度

nsample=32，每个质心周围局部球形邻域内的点数

mlp=[32,32,64]，利用Mlp提取点云局部特征向量，特征向量维度的变化

l0_xyz： <只包含坐标的点>
l0_points： <不仅包含坐标，还包含了每个点经过之前层后提取的特征，所以第一层没有>
npoint = 1024： <Sample layer找512个点作为中心点，这个手工选择的，靠经验或者说靠实验>
radius=0.1： <Grouping layer中ball quary的球半径是0.2，注意这是坐标归一化后的尺度>
nsample=32： <围绕每个中心点在指定半径球内进行采样，上限是32个；半径占主导>
mlp=[32,32,64] ：<PointNet layer有3层，特征维度变化分别是64,64,128>

    grouped_xyz = group_point(xyz, idx) # (batch_size, npoint, nsample, 3) (16*1024*32*3)
    grouped_xyz -= tf.tile(tf.expand_dims(new_xyz, 2), [1,1,nsample,1]) # translation normalization   每一个区域的点云坐标都减去其质心坐标

new_xyz, new_points, idx, grouped_xyz = sample_and_group(npoint, radius, nsample, xyz, points, knn, use_xyz)

xyz:B*N*d=16*8192*3，输入点云的维度
points: (batch_size, ndataset, channel) TF tensor (16, ndataset, channel)    (16,8192,0)

knn:false

use_xyz:use_xyz: bool, if True concat XYZ with local point features, otherwise just use point features

new_xyz: (batch_size, npoint, 3) TF tensor  (16, 　1024　, 3) （16,1024,3） 特征向量是x,y,z
new_points: (batch_size, npoint, mlp[-1] or mlp2[-1]) TF tensor (16,1024,32) 特征向量是mlp提取的特征。
idx: (batch_size, npoint, nsample) int32 -- indices for local regions  (16,1024,32)

new_xyz: 经过sampling后，得到的1024个中心点的坐标

注:Farthest Point Sampling的原理是,先随机选一个点,然后呢选择离这个点距离最远的点(D中值最大的点)加入起点,然后继续迭代,直到选出需要的个数为止。
idx：是每个区域内点的索引（16,1024,32）

xyz：（16,8192,3）

new_xyz：（16,1024,3）

idx, pts_cnt = query_ball_point(radius, nsample, xyz, new_xyz)

ball query后得到的是idx,和pts_cnt，因为是优先根据radius分区，每个区域的点的数量是不确定的（最大32），所以pts_count就是计数的，每个区域有多少个点。

grouped_xyz：分组后的点集，是一个四维向量（batch_size, 1024个区域，每个区域的32个点，每个点3个坐标）

（16,1024,32,3）
new_points：也是就是分组后的点集，不过里面存的是特征，如果是第一次，就等于grouped_xyz，可以选择在卷积的时候把坐标和特征进行concat后卷积.

b.pointnet层

Feature Propagation layers：

采用邻近的3点反距离加权插值。

小结：

(1)furthest point sampling:

分间隔（采样点数量）遍历每个点，计算点与点之间的空间距离，将距离最远的点作为距离当前点最远的点。

比如：总点数是2048时，采样点为512时，从第一个点开始遍历，计算第一个点和（1+512）、（1+512*2）、（1+512*3）的点距离，把距离最大的点当做距离当前点最大的点，然后遍历512次，找到512个最远点。
(2) GatherOperation：将点的Id转变成点的坐标。

(3) QueryAndGroup：连接已分好组的中心点坐标和中心点特征。

(4) BallQuery：将与中心点之间的距离小于radius的点的id聚集起来，可以聚合多个radius

(5) GroupingOperation：根据已分好组的idx进行坐标合并。(并将点的idx和N传入到backward中，用于插点的时候进行梯度计算)

(6)TreenNN：获取每个分好组的中心点周围最近的三个点，并返回相应的距离(用于计算权重，远的点占的权重大，近的点占的权重小)和ID.

(7)Three_Interpolation：根据权重计算每个点被插入到中心点的概率，如果是较远的中心点，周围插的点会比较多，较近的点周围点的数量相对少一些。(优点:In such a case,the second vector should be weighted higher. On the other hand, when the density of a local region is high, the first vector provides information of finer details since it possesses the ability to inspect at higher resolutions recursively in lower levels.)

卷积过程也可以参考：https://blog.csdn.net/wqwqqwqw1231/article/details/90757687

3.对1201个场景的数据集的索引进行随机打乱。

4.把1201个场景数据集分批训练，每一批为16个场景数据集，即numbers of one BatchSize=16，共分为BatchSize=1201//16=75批。(如果分为每批32个场景数据集，显卡内存可能不够)。

5.在训练数据之前，加载scannet_train.pickle文件，文件中包含的内容：

（1）1201个场景数据集。

（2）假定每个场景有n个点云，文件中有1201*n个点的标签。

（3）21个类别的初始权值，初始权值计算的依据某一类点云标签数量占所有点云的比例。

6.取打乱后索引顺序的前16个场景的点云数据集和对应的点的标签。

首先是第一个场景的数据集，min(x,y,z)是-0.022...max(x,y,z)是4.37...，这里应该设计到对初始数据集的预处理，不知道作者是怎样处理成这个坐标的，一般可能是把数据集坐标系的坐标原点平移到其重心位置，或者平移到其包围盒的最小位置（最小角点）处。

（7）

smpmin = np.maximum(coordmax-[1.5,1.5,3.0], coordmin)

右上角A点往里面采样，大小是1.5*1.5×h，smpmin是采样体素的A点的对角线点B点坐标。

smpsz = np.minimum(coordmax-smpmin,[1.5,1.5,3.0])
smpsz[2] = coordmax[2]-coordmin[2]

采样体素大小是1.5*1.5×h

（8）

curcenter = point_set[np.random.choice(len(semantic_seg),1)[0],:]

在此点集中随机选择一个点作为当前点。

            curmin = curcenter-[0.75,0.75,1.5]
            curmax = curcenter+[0.75,0.75,1.5]
            curmin[2] = coordmin[2]
            curmax[2] = coordmax[2]

以当前点为体素中心，采样一个1.5*1.5×h大小的体素，把此体素记为V1，h是点集包围盒的高度。

            curchoice = np.sum((point_set>=(curmin-0.2))*(point_set<=(curmax+0.2)),axis=1)==3
            cur_point_set = point_set[curchoice,:]
            cur_semantic_seg = semantic_seg[curchoice]

把体素V1的大小扩大0.2m（各边边长加0.2m），把此体素里面的点（从当前场景点集中）选出来，记为点集A。

（9）

mask = np.sum((cur_point_set>=(curmin-0.01))*(cur_point_set<=(curmax+0.01)),axis=1)==3

把体素V1的大小扩大0.01m，记为体素V2，从点集A中取出位于体素V2中的点。

            vidx = np.ceil((cur_point_set[mask,:]-curmin)/(curmax-curmin)*[31.0,31.0,62.0])
            vidx = np.unique(vidx[:,0]*31.0*62.0+vidx[:,1]*62.0+vidx[:,2])
            isvalid = np.sum(cur_semantic_seg>0)/len(cur_semantic_seg)>=0.7 and len(vidx)/31.0/31.0/62.0>=0.02

先在各个坐标轴上做归一化，然后归一化之后的值乘以31、31、62。

np.sum(cur_semantic_seg>0)/len(cur_semantic_seg)>=0.7 #非零语义标签占总语义标签的比例>=70%

这三句，没看懂！应该是判断所选的点集是不是有效的训练样本的语句。。

        sample_weight *= mask

这一句让部分权值变成0.

（10）

        choice = np.random.choice(len(cur_semantic_seg), self.npoints, replace=True)
        point_set = cur_point_set[choice,:]

从第（6）步中获得的点集A中随机选择8192个点。

（11）

dropout_ratio = np.random.random()*0.875 # 丢弃比率:0-0.875

drop_idx = np.where(np.random.random((ps.shape[0]))<=dropout_ratio)[0]

batch_data[i,drop_idx,:] = batch_data[i,0,:] batch_label[i,drop_idx] = batch_label[i,0]

从8192个点中随机丢弃一些点，丢弃的点的位置坐标用第一个点的坐标代替，丢弃的点的权重设为0.有点点集丢弃比例很大，不知道这个对最终的预测效果影响大不大。

(12)

return batch_data, batch_label, batch_smpw

batch_data：16×8192×3

batch_label：16×8192

batch_smpw：16*8192

最终的结局是返回一个批次16个场景的点集的三维坐标(x,y,z），每个点的标签，每个点的权重。

(13)

aug_data = provider.rotate_point_cloud_z(batch_data)

数据增强:围绕z轴随机旋转一个角度，这个角度在(0,2×pi)里面随机产生。

(14)

        summary, step, _, loss_val, pred_val = sess.run([ops['merged'], ops['step'],
            ops['train_op'], ops['loss'], ops['pred']], feed_dict=feed_dict)

得到训练一个批次的损失和预测值，每个点的预测值是都21个类别的得分值，最大得分即为预测标签。

(15)

total_correct += correct                     
total_seen += (BATCH_SIZE*NUM_POINT)         
loss_sum += loss_val

累计每个批次（每批次包含16个场景点云，每个场景点云8192个点）正确预测的点数，总共点数以及损失，当累计10个批次时，执行(16).

(16)

if (batch_idx+1)%10 == 0:
    log_string(' -- %03d / %03d --' % (batch_idx+1, num_batches))
    log_string('mean loss: %f' % (loss_sum / 10))
    log_string('accuracy: %f' % (total_correct / float(total_seen)))
    total_correct = 0
    total_seen = 0
    loss_sum = 0

每10个批次(总共1201//16=75个批次)计算一次平均损失，计算一次平均准确率。11~20批次会重新计算这11~20个批次的平均损失和准确率。

(17)

可以看到一般后10个批次的平均准确率会大于前10个批次，期间网络应该在不断的自我调权。

(18)

train_one_epoch(sess, ops, train_writer)

训练了一个epoch.

（19）

        for epoch in range(MAX_EPOCH):
            log_string('**** EPOCH %03d ****' % (epoch))
            sys.stdout.flush() #强制刷新缓冲区

            train_one_epoch(sess, ops, train_writer)
            if (epoch+1)%5==0:      #原代码是: if epoch%5==0:
                acc = eval_one_epoch(sess, ops, test_writer)
                acc = eval_whole_scene_one_epoch(sess, ops, test_writer)   #评价整个场景的准确率
            if acc > best_acc:
                best_acc = acc
                save_path = saver.save(sess, os.path.join(LOG_DIR, "best_model_epoch_%03d.ckpt"%(epoch)))
                log_string("Model saved in file: %s" % save_path)

每5的倍数个epoch进行一次测试。

在第5,10,15,20...,200个epoch进行分别进行一次测试。

# evaluate on randomly chopped scenes   随机分割场景的评价
def eval_one_epoch(sess, ops, test_writer):
    """ ops: dict mapping from string to tf ops """
    global EPOCH_CNT
    is_training = False
    test_idxs = np.arange(0, len(TEST_DATASET)) #312个测试场景
    num_batches = len(TEST_DATASET)//BATCH_SIZE #把测试集分为19个批次，每一个批次包含16个点集

    total_correct = 0 #a
    total_seen = 0 #b
    loss_sum = 0 #c
    total_seen_class = [0 for _ in range(NUM_CLASSES)]  #d
    total_correct_class = [0 for _ in range(NUM_CLASSES)]  #e

    total_correct_vox = 0 #f
    total_seen_vox = 0 #g
    total_seen_class_vox = [0 for _ in range(NUM_CLASSES)] #h
    total_correct_class_vox = [0 for _ in range(NUM_CLASSES)] #i

a: 点云正确预测的数量

b: 总的训练点云数量

c: 损失，这个参数是为了求取测试集（BatchNumber*BatchSize*Number of pointcloud : 19*16*8192）的平均损失

d: 各个类别具有的点云数量

e: 各个类别正确预测的点云数量

f: 点云正确预测的数量----基于体素

g: 总的训练点云数量----基于体素

h: 各个类别具有的点云数量----基于体素

i: 各个类别正确预测的点云数量----基于体素

        correct = np.sum((pred_val == batch_label) & (batch_label>0) & (batch_smpw>0)) # evaluate only on 20 categories but not unknown

只评价20个类别的正确率，但不是未知(?)。

        for l in range(NUM_CLASSES):         #针对每一个类别而言，一批中各个类别的点总共有多少个，正确预测了多少个
            total_seen_class[l] += np.sum((batch_label==l) & (batch_smpw>0))
            total_correct_class[l] += np.sum((pred_val==l) & (batch_label==l) & (batch_smpw>0))

def point_cloud_label_to_surface_voxel_label_fast(point_cloud, label, res=0.0484):
    coordmax = np.max(point_cloud,axis=0)
    coordmin = np.min(point_cloud,axis=0)
    nvox = np.ceil((coordmax-coordmin)/res) #体素的大小放大50倍 #a.
    vidx = np.ceil((point_cloud-coordmin)/res) #b.
    vidx = vidx[:,0]+vidx[:,1]*nvox[0]+vidx[:,2]*nvox[0]*nvox[1] #c.
    uvidx, vpidx = np.unique(vidx,return_index=True) #d.
    if label.ndim==1:
        uvlabel = label[vpidx]
    else:
        assert(label.ndim==2)
        uvlabel = label[vpidx,:]
    return uvidx, uvlabel, nvox

参数1：point_cloud是点集中权值不为0的点云。
参数2：是单个测试场景的实际标签和对应预测标签

a.体素的大小放大50倍，即长宽高各放大50倍，L×50=L1，W*50=W1，H×50=H1.

b.测试点云(x,y,z) (N*3)减去最小坐标((x-min(x)=x1,y-min(y)=y1,z-min(z)=z1))后的x1,y1,z1各放大50倍,x1*50=x2, y1*50=y2, z1*50=z2

c.x+y*(场景包围盒的长度放大50倍后的数)+z*(场景包围盒的长度放大50倍后的数)*(场景包围盒的宽度放大50倍后的数)，vidx:N*3(如：4722*3)

即：x + y*L1 +z *L1*w1

测试集test_dataset的测试结果：

平均损失：

体素准确率：

平均类别体素准确率：

点云预测准确率：

平均类别准确率：

加权后平均类别准确率：

（20）

利用训练模型对整个测试场景进行评价:

# evaluate on whole scenes to generate numbers provided in the paper
def eval_whole_scene_one_epoch(sess, ops, test_writer):

循环每一个场景:

    for batch_idx in range(num_batches): #num_batches=312
        if not is_continue_batch:
            batch_data, batch_label, batch_smpw = TEST_DATASET_WHOLE_SCENE[batch_idx]

对每一个场景，循环提取1.5×1.5×H大小的体素内的点云（H是场景的实际高度）:

        for i in range(nsubvolume_x):
            for j in range(nsubvolume_y):
                curmin = coordmin+[i*1.5,j*1.5,0]
                curmax = coordmin+[(i+1)*1.5,(j+1)*1.5,coordmax[2]-coordmin[2]]

            batch_data = batch_data[:BATCH_SIZE,:,:]
            batch_label = batch_label[:BATCH_SIZE,:]
            batch_smpw = batch_smpw[:BATCH_SIZE,:]

取出前16个立方体内的点云作为一批计算准确率，并与下一个场景的前16个立方体内的点云的准确率进行叠加，直到循环完毕312个场景，累计的准确率作为整个场景的点云预测准确率。