K-D Tree

这篇随笔是对Wikipedia上 k-d tree 词条的摘录, 我认为对该词条解释相当生动详细, 是不可多得的好文.

Overview

A $k$-d tree (short for $k$-dimensional tree) is a binary space-partitioning tree for organizing points in a $k$-dimensional space. $k$-d trees are a useful data structure for searches involving a multidimensional search key.

Construction

The canonical method of $k$-d tree construction has the following constraints:

As one moves down the tree, one cycles through the axes used to select the splitting planes.
Points are inserted by selecting the median of the points being put into the subtree, with respect to their coordinates in the axis being used to create the splitting plane.

This method leads to a balanced $k$-d tree, in which each leaf node is approximately the same distance from the root. However, balanced trees are not necessarily optimal for all applications.

Nearest Neighboring Search

Terms:

the split dimensions
the splitting (hyper)plane
"current best"

The **nearest neighbour ** (NN) search algorithm aims to find the point in the tree that is nearest to a given point. This search can be done efficiently by using the tree properties to quickly eliminate large portions of the search space.

Searching for a nearest neighbour in a $k$-d tree proceeds as follows:

Starting with the root node, the algorithm moves down the tree recursively.
Once the algorithm reaches a leaf node, it saves that node point as "current best"
The algorithm unwinds the recursion of the tree, performing the following steps at each node:
1. If the current node is closer than the current best, then it becomes the current best.
2. The algorithm checks whether there could be any points on the other side of the splitting plane that are closer to the search point than the current best. In concept, this is done by intersecting the splitting hyperplane with a hypersphere around the the search point that has a radius equal to the current nearest distance. Since the hyperplanes are all axis-aligned this is implemented as a simple comparison to see whether the distance between the splitting coordinate of the search point and current node is less than the distance (overall coordinates) from the search point to the current best.
  1. If the hypersphere crosses the plane, there could be nearer points on the other side of the plane, so the algorithm must move down the other branch of the tree from the current node looking for closer points, following the same recursive process as the entire search.
  2. If the hypersphere doesn't intersect the splitting plane, then the algorithm continues walking up the tree, and the entire branch on the other side of that node is eliminated.

Generally, the algorithm uses squared distances for comparison to avoid computing square roots. Additionally, it can save computation by holding the squared current best distance in a variable for computation.

The algorithm can be extended in several ways by simple modifications. If can provide the $k $ nearest neighbors to a point by maintaining $k$ current bests instead of just one. A branch is only eliminated when $k$ points have been found and the branch cannot have points closer than any of the $k$ current bests.

Implementation

$k$ 近临 ($k$NN)

#include <bits/stdc++.h>
#define lson id<<1
#define rson id<<1|1
#define sqr(x) (x)*(x)
using namespace std;
using LL=long long;
const int N=5e4+5;

// K-D tree: a special case of binary space partitioning trees

int DIM, idx;

struct Node{
    int key[5];
    bool operator<(const Node &rhs)const{
        return key[idx]<rhs.key[idx];
    }
    void read(){
        for(int i=0; i<DIM; i++)
            scanf("%d", key+i);
    }
    LL dis2(const Node &rhs)const{
        LL res=0;
        for(int i=0; i<DIM; i++)
            res+=sqr(key[i]-rhs.key[i]);
        return res;
    }
    void out(){
        for(int i=0; i<DIM; i++)
            printf("%d%c", key[i], i==DIM-1?'
':' ');
    }
}p[N];

Node a[N<<2];   // K-D tree
bool f[N<<2];

// [l, r)
void build(int id, int l, int r, int dep)
{
    if(l==r) return;    // error-prone
    f[id]=true, f[lson]=f[rson]=false;
    
    // select axis based on depth so that axis cycles through all valid values
    idx=dep%DIM;
    int mid=l+r>>1;

    // sort point list and choose median as pivot element
    nth_element(p+l, p+mid, p+r);
    a[id]=p[mid];
    build(lson, l, mid, dep+1);
    build(rson, mid+1, r, dep+1);
}

using P=pair<LL,Node>;
priority_queue<P> que;

// multidimensional search key

void query(const Node &p, int id, int m, int dep){
    int dim=dep%DIM;
    int x=lson, y=rson;
    // left: <, right >=
    if(p.key[dim]>=a[id].key[dim])
        swap(x, y);

    if(f[x]) query(p, x, m, dep+1);

    P cur{p.dis2(a[id]), a[id]};

    if(que.size()<m){
        que.push(cur);
    }
    else if(cur.first<que.top().first){
        que.pop();
        que.push(cur);
    }
    if(f[y] && sqr(a[id].key[dim]-p.key[dim])<que.top().first)
        query(p, y, m, dep+1);
}

说明:

bool数组f[], 表示一个完全二叉树中的某个节点是否存在, 也可不用完全二叉树的表示法, 而用两个数组lson[]和rson[]表示, 这样的好处还有: 节省空间, 数组可以只开到节点数的2倍.
区间采用左闭右开表示.