Python 调C 示例《采样方法总结：Alias method》

采样方法总结：Alias method

之前做的论文中，有一处关键的计算涉及到对一个高维度的离散分布求数学期望，这个问题本来可以简单地sum over这个概率分布的support解决，然而这个方案被reviewer质疑运行速度太慢，于是又想出一招用alias method做Monte Carlo采样算积分。用了采样以后程序变得更容易并行化，用C写了下效率大增。由于网上已经有相当多介绍alias method采样方法的文章了，其中不少还非常详细，我再写一遍也不过鹦鹉学舌，因此本文只对实现方法作简单记录。

Naive Implementation:

最早写的这个版本应该是网上所有alias method实现里代码量最少，也最简单暴力的版本了，虽然复杂度并不是最优。

注意：

为了方便在Python中调用，这里把生成alias table的过程写在cpp中，在Python里用numpy的C++接口调用cpp；
因为这个实现过于简单暴力，两种情况下代码会死循环，第一是输入的pdf不合法的情况，即pdf向量中所有元素加起来不等于1.0，第二是浮点数精度不足带来的条件判断不通过问题，这里的代码在我的电脑上运行良好；


用shell（或用 python 调shell ）跑咩？

// TO COMPILE sampler.cpp into sampler.so:
// g++ -std=c++11 sampler.cpp -o sampler.so -shared -fPIC -O2 -D_GLIBCXX_USE_CXX11_ABI=0

extern "C" {
/**
* @brief	Generate an alias table from a discrete distribution
* @param	out		An array of size 2 * length, representing the alias table
Note that this array should be 1.0 in each column after the function
**/
void gen_alias_table(float* out, int* events, int length) {
	int more, less, flag;
	while (true) {
		flag = 1;
		for (int i = 0; i < length; i++) {
			if (out[i] + out[i + length] < 1.0) {
				less = i;
				flag = 0;
				break;
			}
		}
		for (int i = 0; i < length; i++) {
			if (out[i] + out[i + length] > 1.0) {
				more = i;
				flag = 0;
				break;
			}
		}
		if (flag)
			break;
		out[less + length] = 1.0 - out[less];
		out[more] = out[more] - 1.0 + out[less];
		events[less + length] = more;
	}
}
} // extern "C"

Python部分实现如下：

# utils.py
import numpy as np
import ctypes as ct

class discreteSampler(object):
    def __init__(self, probs):
        '''
        This class can sample from high dimensional discrete distribution using the Alias method
        For better efficiency, the generation of alias table is implemented with C++
        :param probs: a 1 dimensional numpy.ndarray representing the PDF for the discrete distribution
        '''
        length = probs.shape[-1]
        self.probs = np.copy(probs)
        self.alias = np.zeros((2, length), dtype=np.float32)
        self.alias[0, :] = self.probs * length
        self.events = np.ones((2, length), dtype=np.int32)
        self.events[0, :] = np.arange(0, length, dtype=np.int32)
        self.events[1, :] = -1.0

        # Load and run the C++ dynamic library
        BASE_DIR = os.path.dirname(os.path.abspath(__file__))
        self.dll = np.ctypeslib.load_library(os.path.join(BASE_DIR, 'sampler.so'), '.')
        self.dll.gen_alias_table(self.alias.ctypes.data_as(ct.c_void_p),
                                 self.events.ctypes.data_as(ct.c_void_p),
                                 ct.c_int(length))

    def generate(self):
        idx = random.randint(0, self.probs.shape[-1] - 1)
        if random.uniform(0.0, 1.0) <= self.alias[0, idx]:
            return idx
        else:
            return self.events[1, idx]

最后用随机数测试下，采样10000次，可视化。

# Testing the sampling using the alias method implementation
import numpy as np
import utils
import matplotlib.pyplot as plt

probs = np.random.uniform(0.0, 1.0, [10])
probs /= np.sum(probs)
idxs = np.arange(0, 10)

sampler = utils.discreteSampler(probs)
collect = list()
for it in range(10000):
    collect.append(sampler.generate())
    
plt.figure()
plt.subplot(121)
plt.bar(idxs, probs)
plt.title('true distribution')
plt.subplot(122)
plt.hist(collect, bins=10, normed=True)
plt.title('sampled distribution')
plt.show()

运行结果：

Python 调C 示例《采样方法总结：Alias method》

采样方法总结：Alias method

相关链接：

Naive Implementation: