Cpusets学习

1. cpusets

1.1 什么是cpusets

cpusets基本功能是限制某一组进程只运行在某些cpu和内存节点上，举个简单例子：系统中有4个进程，4个内存节点，4个cpu.利用cpuset可以让第1，2个进程只运行在第1，2颗cpu上并且只在第1，2个内存节点上分配内存。cpuset是基于cgroup子系统实现(关于cgroup子系统可以参考内核文档 Documentation/cgroups/cgroups.txt.）使用cpuset上述功能可以让系统管理员动态调整进程运行所在的cpu和内存节点。

cpusets是cgroup文件系统中的一个子系统。

1.2 为什么需要cpusets

在大型的计算机系统中，有多颗cpu，若干内存节点。尤其在NUMA架构下，cpu访问不同内存节点的速度不同，这种情况增加了进程调度和进程内存分配目标node管理的难度。比较新的小型系统使用linux内核自带的调度功能和内存管理方案就能得到很好表现，但是在比较大的系统中如果精心调整不同应用所在的cpu和内存节点会大大提高性能表现。

（NUMA架构）

cpuset在以下场景会更有价值：

1. 对于跑了很多相同的应用实例的大型web server
2. 对于跑了不同应用的大型server(例如：同时跑了web server相关应用，又跑了数据库应用）
3. 大型NUMA系统

cpuset必须允许动态调整，并且不影其他不相关cpuset中运行的进程。比如：可以将某个cpu动态的加入到某个cpuset，可以从某个cpuset中将某个cpu移除，可以将进程加入到cpuset，也可以将某个进程从一个cpuset迁移到另一个cpuset。内核的cpuset补丁，提供了最基本的实现上述功能的机制，在实现上最大限度使用原有的cpu和内存节点分配机制，尽可能避免影响现有的调度器，以及内存分配核心功能的代码。

1.3 cpusets是如何实现的

cpusets整体为层级树结构。由一个根节点（root）包含了系统所有的cpu和内存节点资源。由根节点，可以分支出一个或者多个子节点，同时子节点也可以在分支出孙子节点，以此类推。每个子节点所包含的资源都是父节点资源的子集。

为了理解层级树的概念：其实cpuset做为cgroup一个子系统实现，也遵循了cgroup层级树的概念。举个例子：有一个cpuset /，可以在cpuset /下再建立若干个cpuset。比如：建立background，foreground两个子cpuset。background和foreground互为兄弟，cpuset /属于父。子cpuset中的cpu和node节点集合必须是父亲cpuset中cpu和内存节点集合的子集。

对cpuset的操作都是通过cpuset文件系统来完成的，内核没有提供额外的系统调用对cpuset做修改、查看操作。在android下，cpusets的文件系统路径：

/dev/cpuset

在/proc/#pid/status中如下几行也可以说明一个进程运行在哪些cpu上，并且进程分配内存必须在哪些内存节点上：

Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list:      0-127
Mems_allowed:   ffffffff,ffffffff
Mems_allowed_list:      0-63

例如（android平台只有8个cpu core，1个内存节点）：

Cpus_allowed:    ff
Cpus_allowed_list:    0-7
Mems_allowed:    1
Mems_allowed_list:    0

每一个cpuset对应cgroup文件系统中的子系统，，目录下有些文件，用来描述cpuset的属性。对应的文件里列表如下：

1.cpuset.cpus 　　//cpuset中的cpu列表
2.cpuset.mems 　　//cpuset中的内存节点列表
3.cpuset.memory_migrate　　//cpuset内存迁移，见1.9
4.cpuset.cpu_exclusive 　　//cpuset是否是cpu互斥的，见1.4
5.cpuset.mem_exclusive 　　//cpuset是否是内存互斥的，见1.4
6.cpuset.mem_hardwall 　　 //cpuset是否是hardwalled的，见1.4
7.cpuset.memory_pressure  　　//内存使用的紧张程度，见1.5
8.cpuset.memory_spread_page 　　//如果被设置了，将该cpuset中进程上下文申请的page cache平均分布到cpuset中的各个节点，见1.6
9.cpuset.memory_spread_slab 　　//如果被设置了，将该cpuset中进程上下文申请到的slab对象平均分布到cpuset中的各个内存节点，见1.6
10.cpuset.sched_load_balance　　//如果被设置了，负载均衡会在cpuset配置的cpus中进行，见1.7
11.cpuset.sched_relax_domain_level　　//当要task迁移时，搜索的范围，见1.8

下面文件只有在根cpuset中才有：
12.cpuset.memory_pressure_enabled　　//使能memory pressure测量的flag

其实在cpusets之前，已经有一套机制来限制某个进程只能被调度到某些cpu上运行(sched_setaffinity)，限制某些进程的内存申请只能在某些内存节点上分配(mbind，set_mempolicy)。

而cpusets进行了扩展：

cpusets是cpu和memory节点的集合，并且对kernel可见的。
每个task struct中有一个指针指向了cgroup数据结构（cpuset是cgroup的一个子系统），通过这个指针，将进程添加到具体的cpuset中。
调用sched_setaffinity、mbind/set_mempolicy对应的cpu必须在对应task的cpuset中
根节点的cpuset包含了所有cpu和memory节点
对任意cpuset，都可以再定义其子cpuset，子cpuset中包含的cpu和内存节点是父cpuset的子集
cpusets的层级结构可以mount到/dev/cpuset，user space可以通过其进行查看和操作
如果一个cpuset被标记为专有，则该cpuset的兄弟cpuset中包含的cpu和内存节点不能和它的cpu和memory节点有交集
可以查看到任一cpuset上所有task的pid

cpusets的实现，需要在现有kernel中添加一些hook函数，而这些hook函数不会添加到kernel关键热点路径上（不影响性能）：

　　1.3.1 在init/main.c中，当系统boot时，初始化根节点的cpuset：

　　start_kernel()--> cpuset_init()

　　start_kernel()--> rest_init()--> kernel_thread(kernel_init, NULL, CLONE_FS) --> kernel_init() --> kernel_init_freeable() --> do_basic_setup() --> cpuset_init_smp()

struct cpuset {
    struct cgroup_subsys_state css;

    unsigned long flags;        /* "unsigned long" so bitops work */

    /*
     * On default hierarchy:
     *
     * The user-configured masks can only be changed by writing to
     * cpuset.cpus and cpuset.mems, and won't be limited by the
     * parent masks.
     *
     * The effective masks is the real masks that apply to the tasks
     * in the cpuset. They may be changed if the configured masks are
     * changed or hotplug happens.
     *
     * effective_mask == configured_mask & parent's effective_mask,
     * and if it ends up empty, it will inherit the parent's mask.
     *
     *
     * On legacy hierachy:
     *
     * The user-configured masks are always the same with effective masks.
     */

    /* user-configured CPUs and Memory Nodes allow to tasks */
    cpumask_var_t cpus_allowed;
    cpumask_var_t cpus_requested;
    nodemask_t mems_allowed;

    /* effective CPUs and Memory Nodes allow to tasks */
    cpumask_var_t effective_cpus;
    nodemask_t effective_mems;

    /*
     * This is old Memory Nodes tasks took on.
     *
     * - top_cpuset.old_mems_allowed is initialized to mems_allowed.
     * - A new cpuset's old_mems_allowed is initialized when some
     *   task is moved into it.
     * - old_mems_allowed is used in cpuset_migrate_mm() when we change
     *   cpuset.mems_allowed and have tasks' nodemask updated, and
     *   then old_mems_allowed is updated to mems_allowed.
     */
    nodemask_t old_mems_allowed;

    struct fmeter fmeter;        /* memory_pressure filter */

    /*
     * Tasks are being attached to this cpuset.  Used to prevent
     * zeroing cpus/mems_allowed between ->can_attach() and ->attach().
     */
    int attach_in_progress;

    /* partition number for rebuild_sched_domains() */
    int pn;

    /* for custom sched domain */
    int relax_domain_level;
};

int __init cpuset_init(void)
{
    int err = 0;

    BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
    BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
    BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_requested, GFP_KERNEL));

    cpumask_setall(top_cpuset.cpus_allowed);
    cpumask_setall(top_cpuset.cpus_requested);
    nodes_setall(top_cpuset.mems_allowed);
    cpumask_setall(top_cpuset.effective_cpus);
    nodes_setall(top_cpuset.effective_mems);

    fmeter_init(&top_cpuset.fmeter);
    set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);
    top_cpuset.relax_domain_level = -1;　　　　　　　　　　　　//以上这些都是在初始化根节点cpuset的参数

    err = register_filesystem(&cpuset_fs_type);　　　　　　　//注册文件系统
    if (err < 0)
        return err;

    BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));

    return 0;
}

/**
 * cpuset_init_smp - initialize cpus_allowed
 *
 * Description: Finish top cpuset after cpu, node maps are initialized
 */
void __init cpuset_init_smp(void)
{
    cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
    top_cpuset.mems_allowed = node_states[N_MEMORY];　　　　　　　　　　　　　　　 //node_states[N_MEMORY]是存放了所有online的内存节点，当内存hotplug时，会发生变化
    top_cpuset.old_mems_allowed = top_cpuset.mems_allowed;

    cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
    top_cpuset.effective_mems = node_states[N_MEMORY];　　　　　　　　　　　　　　 //在cpu、memory表初始化后，完成cpuset剩余参数的初始化

    register_hotmemory_notifier(&cpuset_track_online_nodes_nb);　　　　　　　　  //注册了一个notify，当cpuset中cpu或者memory发现改变了（hotplug），就会工作

    cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);　　//创建了一个工作队列
    BUG_ON(!cpuset_migrate_mm_wq);
}

　　1.3.2 在进程fork/exit时，会从对应的cpuset中执行attach/detach：

　　fork调用路径1：_do_fork() -> copy_process() -> cgroup_fork()

/**
 * cgroup_fork - initialize cgroup related fields during copy_process()
 * @child: pointer to task_struct of forking parent process.
 *
 * A task is associated with the init_css_set until cgroup_post_fork()
 * attaches it to the parent's css_set.  Empty cg_list indicates that
* @child isn't holding reference to its css_set.
 */
void cgroup_fork(struct task_struct *child)
{
    RCU_INIT_POINTER(child->cgroups, &init_css_set);　　　　//初始化子进程的css_set（cgroup subsystem set）
    INIT_LIST_HEAD(&child->cg_list);
}

　　fork调用路径2：_do_fork() -> copy_process() -> cgroup_post_fork() -> css_set_move_task() -> cgroup_move_task() -> rcu_assign_pointer(task->cgroups, to);

　　　　　　　　其中，cgroup_move_task()中，会将子进程移动到父进程的cgroup中，操作由css_set指针赋值完成：

　　　　　　　　to指针为父进程的css_set指针，而task->cgroups则是child的css_set指针，最后通过指针赋值，子进程将attach到父进程的cgroup set中。

　　fork调用路径3：_do_fork() -> copy_process() -> cgroup_post_fork() -> 调用每个cgroup子系统的fork(): ss->fork(child) -> cpuset_fork()

/*
 * Make sure the new task conform to the current state of its parent,
 * which could have been changed by cpuset just after it inherits the
 * state from the parent and before it sits on the cgroup's task list.
 */
static void cpuset_fork(struct task_struct *task)　　
{
    if (task_css_is_root(task, cpuset_cgrp_id))
        return;

    set_cpus_allowed_ptr(task, &current->cpus_allowed);
    task->mems_allowed = current->mems_allowed;　　　　//继承父进程current的mem_allowed
}

　　　　　　　　接着：cpuset_fork() ->set_cpus_allowed_ptr() -> __set_cpus_allowed_ptr()如下 -> do_set_cpus_allowed() -> p->sched_class->set_cpus_allowed(p, new_mask);

/*
 * Change a given task's CPU affinity. Migrate the thread to a
 * proper CPU and schedule it away if the CPU it's executing on
 * is removed from the allowed bitmask.
 *
 * NOTE: the caller must have a valid reference to the task, the
 * task must not exit() & deallocate itself prematurely. The
 * call is not atomic; no spinlocks may be held.
 */
static int __set_cpus_allowed_ptr(struct task_struct *p,
                  const struct cpumask *new_mask, bool check)
{
    const struct cpumask *cpu_valid_mask = cpu_active_mask;
    unsigned int dest_cpu;
    struct rq_flags rf;
    struct rq *rq;
    int ret = 0;
    cpumask_t allowed_mask;

    rq = task_rq_lock(p, &rf);
    update_rq_clock(rq);

    if (p->flags & PF_KTHREAD) {　　　　　　　　//所有kernel进程默认可以运行在所有online的cpu上
        /*
         * Kernel threads are allowed on online && !active CPUs
         */
        cpu_valid_mask = cpu_online_mask;
    }

    /*
     * Must re-check here, to close a race against __kthread_bind(),
     * sched_setaffinity() is not guaranteed to observe the flag.
     */
    if (check && (p->flags & PF_NO_SETAFFINITY)) {　　//thread不允许改变cpu亲和度
        ret = -EINVAL;
        goto out;
    }

    if (cpumask_equal(&p->cpus_allowed, new_mask))　　//当前可运行的cpu和要设置的cpu相等，那就不需要重复设置
        goto out;

    cpumask_andnot(&allowed_mask, new_mask, cpu_isolated_mask);　　//将父进程的cpus_allowed中去掉isolate的cpu
    cpumask_and(&allowed_mask, &allowed_mask, cpu_valid_mask);　　 //再从中筛选可以进行task迁移的cpu

    dest_cpu = cpumask_any(&allowed_mask);　　　　　　　　　　　　//最后再筛选出的结果中挑选一个dest_cpu
    if (dest_cpu >= nr_cpu_ids) {　　　　　　　　　　　　　　　　　　//如果dest_cpu的index超过最大cpu的index，则需要重新挑选
        cpumask_and(&allowed_mask, cpu_valid_mask, new_mask);
        dest_cpu = cpumask_any(&allowed_mask);
        if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
            ret = -EINVAL;
            goto out;
        }
    }

    do_set_cpus_allowed(p, new_mask);　　　　　　//详细见下

    if (p->flags & PF_KTHREAD) {
        /*
         * For kernel threads that do indeed end up on online &&
         * !active we want to ensure they are strict per-CPU threads.
         */
        WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
            !cpumask_intersects(new_mask, cpu_active_mask) &&
            p->nr_cpus_allowed != 1);
    }

    /* Can the task run on the task's current CPU? If so, we're done */
    if (cpumask_test_cpu(task_cpu(p), &allowed_mask))　　　　　　//如果task能运行在task本来运行的cpu上，则直接退出
        goto out;

    if (task_running(rq, p) || p->state == TASK_WAKING) {　　//判断task是否处于running或者waking状态（不同task状态，不同的迁移方式）
        struct migration_arg arg = { p, dest_cpu };
        /* Need help from migration thread: drop lock and wait. */
        task_rq_unlock(rq, p, &rf);
        stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);　　　　//在当前cpu上执行migration_cpu_stop函数，arg为函数的参数，执行完后stop当前cpu
        tlb_migrate_finish(p->mm);
        return 0;
    } else if (task_on_rq_queued(p)) {　　　　　　　　　　　　//task是否在rq中
        /*
         * OK, since we're going to drop the lock immediately
         * afterwards anyway.
         */
        rq = move_queued_task(rq, &rf, p, dest_cpu);　　//把task迁移到dest_cpu上
    }
out:
    task_rq_unlock(rq, p, &rf);

    return ret;
}

　　　　　　　　调用至对应sched class的set_cpu_allowed（除deadline class，其余class都调用的是set_cpus_allowed_common()），如下：

void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask)
{
    cpumask_copy(&p->cpus_allowed, new_mask);　　　　//继承父进程的cpus_allowed
    p->nr_cpus_allowed = cpumask_weight(new_mask);　　//计算子进程的cpus_allowed数量
}

/*
 * migration_cpu_stop - this will be executed by a highprio stopper thread
 * and performs thread migration by bumping thread off CPU then
 * 'pushing' onto another runqueue.
 */
static int migration_cpu_stop(void *data)
{
    struct migration_arg *arg = data;
    struct task_struct *p = arg->task;
    struct rq *rq = this_rq();
    struct rq_flags rf;

    /*
     * The original target CPU might have gone down and we might
     * be on another CPU but it doesn't matter.
     */
    local_irq_disable();
    /*
     * We need to explicitly wake pending tasks before running
     * __migrate_task() such that we will not miss enforcing cpus_allowed
     * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
     */
    sched_ttwu_pending();

    raw_spin_lock(&p->pi_lock);
    rq_lock(rq, &rf);
    /*
     * If task_rq(p) != rq, it cannot be migrated here, because we're
     * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because
     * we're holding p->pi_lock.
     */
    if (task_rq(p) == rq) {
        if (task_on_rq_queued(p))
            rq = __migrate_task(rq, &rf, p, arg->dest_cpu);　　//迁移到dest_cpu
        else
            p->wake_cpu = arg->dest_cpu;
    }
    rq_unlock(rq, &rf);
    raw_spin_unlock(&p->pi_lock);

    local_irq_enable();
    return 0;
}

　　　　　　　　然后，在_do_fork() -> wake_up_new_task() -> select_task_rq()中，选择满足cpus_allowed的cpu来执行子进程：

/*
 * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
 */
static inline
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
           int sibling_count_hint)
{
    bool allow_isolated = (p->flags & PF_KTHREAD);

    lockdep_assert_held(&p->pi_lock);

    if (p->nr_cpus_allowed > 1)
        cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,　　//通过调用对应sched class的select_task_rq来选择执行子进程的cpu
                             sibling_count_hint);
    else
        cpu = cpumask_any(&p->cpus_allowed);

    /*
     * In order not to call set_task_cpu() on a blocking task we need
     * to rely on ttwu() to place the task on a valid ->cpus_allowed
     * CPU.
     *
     * Since this is common to all placement strategies, this lives here.
     *
     * [ this allows ->select_task() to simply return task_cpu(p) and
     *   not worry about this generic constraint ]
     */
    if (unlikely(!is_cpu_allowed(p, cpu)) ||
            (cpu_isolated(cpu) && !allow_isolated))
        cpu = select_fallback_rq(task_cpu(p), p, allow_isolated);　　　　//筛选执行cpu满足cpu_allowed

    return cpu;
}

　　1.3.3 在设置cpu亲和度（sched_setaffinity）时，会过滤cpuset中配置：

　　当执行sched_setaffinity设置cpu亲和度的调用路径：sched_setaffinity() -> get_user_cpu_mask()

　　　　　　　　　　　　　　　　　　　　　　　　　 sched_setaffinity() -> sched_setaffinity()

　　首先在get_user_cpu_mask()中，从user space获取要设置的cpu mask；

　　之后在sched_setaffinity()在再进行详细的过滤，从而设置准确的task cpu mask，并挑选出合适的cpu执行task

long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
{
    cpumask_var_t cpus_allowed, new_mask;
    struct task_struct *p;
    int retval;
    int dest_cpu;
    cpumask_t allowed_mask;

    rcu_read_lock();

    p = find_process_by_pid(pid);
    if (!p) {
        rcu_read_unlock();
        return -ESRCH;
    }

    /* Prevent p going away */
    get_task_struct(p);
    rcu_read_unlock();

    if (p->flags & PF_NO_SETAFFINITY) {
        retval = -EINVAL;
        goto out_put_task;
    }
    if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
        retval = -ENOMEM;
        goto out_put_task;
    }
    if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
        retval = -ENOMEM;
        goto out_free_cpus_allowed;
    }
    retval = -EPERM;
    if (!check_same_owner(p)) {
        rcu_read_lock();
        if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
            rcu_read_unlock();
            goto out_free_new_mask;
        }
        rcu_read_unlock();
    }

    retval = security_task_setscheduler(p);
    if (retval)
        goto out_free_new_mask;


    cpuset_cpus_allowed(p, cpus_allowed);　　　　　　　　//获取task cpuset中allowed_cpu
    cpumask_and(new_mask, in_mask, cpus_allowed);　　　//将通过user space中获取的cpu mask与task的cpuset进行过滤，找出能同时满足条件的cpu（new_mask）

    /*
     * Since bandwidth control happens on root_domain basis,
     * if admission test is enabled, we only admit -deadline
     * tasks allowed to run on all the CPUs in the task's
     * root_domain.
     */
#ifdef CONFIG_SMP
    if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
        rcu_read_lock();
        if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
            retval = -EBUSY;
            rcu_read_unlock();
            goto out_free_new_mask;
        }
        rcu_read_unlock();
    }
#endif
again:
    cpumask_andnot(&allowed_mask, new_mask, cpu_isolated_mask);　　//从上面得到的new_mask中，过滤掉isolate的cpu，结果保存在allowed_mask
    dest_cpu = cpumask_any_and(cpu_active_mask, &allowed_mask);　　//从cpu_active_mask（处于active状态的cpu：可以进行task迁移）和allowed_mask中找到满足条件的dest_cpu
    if (dest_cpu < nr_cpu_ids) {　　　　　　　　　　　　　　　　　　　　//如果dest_cpu的index小于当前cpu核最大的index
        retval = __set_cpus_allowed_ptr(p, new_mask, true);　　　　//与fork的路径类似，为设置new_mask为task新的cpus_allowed，并将task执行在dest_cpu上
        if (!retval) {
            cpuset_cpus_allowed(p, cpus_allowed);　　　　　　　　　　//再重新获取task的cpus_allowed
            if (!cpumask_subset(new_mask, cpus_allowed)) {　　　　 //判断new_mask是否是cpus_allowed的子集，如果不是，则说明cpuset又被修改了。则需要更新task的cpus_allowed
                /*
                 * We must have raced with a concurrent cpuset
                 * update. Just reset the cpus_allowed to the
                 * cpuset's cpus_allowed
                 */
                cpumask_copy(new_mask, cpus_allowed);　　　　　　　　//赋值new_mask = cpus_allowed
                goto again;　　　　　　　　　　　　　　　　　　　　　　　 //重复上面步骤
            }
        }
    } else {
        retval = -EINVAL;
    }

    if (!retval && !(p->flags & PF_KTHREAD))　　　　　　　　　　　　　　//PF_KTHREAD代表是否是kernel space创建的进程
        cpumask_and(&p->cpus_requested, in_mask, cpu_possible_mask);　　//将user space设置的cpu mask与cpu_possible_mask（平台支持hotplug，所以它包含所有cpu核）
　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 //取同时满足上面2个条件的cpu，存放到task->cpus_requested中
out_free_new_mask:
    free_cpumask_var(new_mask);
out_free_cpus_allowed:
    free_cpumask_var(cpus_allowed);
out_put_task:
    put_task_struct(p);
    return retval;
}

　　1.3.4 在task迁移时，会遵循cpuset的配置：

　　task迁移本质上就是将此cpu rq中的task，放到另一个cpu的rq上。在上面讲到的fork中，子进程就会在创建之后，进行task迁移，当然同时会遵循cpuset的配置；除了fork，还有更一般的调用_migrate_task的进行task迁移的情况，同样也会对cpuset配置进行过滤。

/*
 * Move (not current) task off this CPU, onto the destination CPU. We're doing
 * this because either it can't run here any more (set_cpus_allowed()
 * away from this CPU, or CPU going down), or because we're
 * attempting to rebalance this task on exec (sched_exec).
 *
 * So we race with normal scheduler movements, but that's OK, as long
 * as the task is no longer on this CPU.
 */
static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
                 struct task_struct *p, int dest_cpu)
{
    /* Affinity changed (again). */
    if (!is_cpu_allowed(p, dest_cpu))
        return rq;

    update_rq_clock(rq);
    rq = move_queued_task(rq, rf, p, dest_cpu);　　//迁移task到dest_cpu

    return rq;
}

　　可以看到有调用is_cpu_allowed()来检测dest_cpu是否在task的cpus_allowed中，并根据是否是内核线程，进行进一步筛选：

/*
 * Per-CPU kthreads are allowed to run on !actie && online CPUs, see
 * __set_cpus_allowed_ptr() and select_fallback_rq().
 */
static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
{
    if (!cpumask_test_cpu(cpu, &p->cpus_allowed))　　//检测dest_cpu是否allow。如果不allow，则return false，且不满足迁移条件
        return false;

    if (is_per_cpu_kthread(p))　　　　 //检测task是否是kernel thread，内核进程默认是可以运行在所有online的cpu上
        return cpu_online(cpu);　　　　//cpu是否online（available to scheduler,可以调度运行）

    return cpu_active(cpu);　　　　　　//cpu是否active（available to migration,可以迁移）
}

　　1.3.5 在系统调用：mbind、set_mempolicy时，会过滤cpuset中的内存节点：

　　mbind调用路径：kernel_mbind() -> do_mbind() -> mpol_set_nodemask()

　　set_mempolicy调用路径：kernel_set_mempolicy() -> do_set_mempolicy() -> mpol_set_nodemask()

　　函数是在创建新mem policy之后，set up内存node mask

/*
 * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
 * any, for the new policy.  mpol_new() has already validated the nodes
 * parameter with respect to the policy mode and flags.  But, we need to
 * handle an empty nodemask with MPOL_PREFERRED here.
 *
 * Must be called holding task's alloc_lock to protect task's mems_allowed
 * and mempolicy.  May also be called holding the mmap_semaphore for write.
 */
static int mpol_set_nodemask(struct mempolicy *pol,
             const nodemask_t *nodes, struct nodemask_scratch *nsc)
{
    int ret;

    /* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
    if (pol == NULL)
        return 0;
    /* Check N_MEMORY */
    nodes_and(nsc->mask1,
          cpuset_current_mems_allowed, node_states[N_MEMORY]);　　//获取当前task所允许的内存节点，并与online的节点过滤，找出同时满足条件的节点，存到nsc->mask1中

    VM_BUG_ON(!nodes);
    if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
        nodes = NULL;    /* explicit local allocation */
    else {
        if (pol->flags & MPOL_F_RELATIVE_NODES)　　　　　　　　　　　　　　　　//根据flag不同，将上面得到的nsc->mask1，再次通过不同计算，得到不同的node mask存到nsc->mask2
            mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
        else
            nodes_and(nsc->mask2, *nodes, nsc->mask1);

        if (mpol_store_user_nodemask(pol))　　　　　　　　　　　　//根据flag不同，保存不同的node mask结果
            pol->w.user_nodemask = *nodes;　　　　　　　　　　　　//保存user space传入的内存节点
        else
            pol->w.cpuset_mems_allowed =
                        cpuset_current_mems_allowed;　　　　　　//保存task允许的内存节点
    }

    if (nodes)
        ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);　　//根据不同的policy，调用不同的create方式，allocate内存节点
    else
        ret = mpol_ops[pol->mode].create(pol, NULL);
    return ret;
}

　　1.3.6 在page_alloc.c中，会限制申请内存节点：

　　 kmalloc() -> kmalloc_large() -> kmalloc_order_trace() -> kmalloc_order() --> alloc_pages() -> alloc_pages_node() -> __alloc_pages_node() -> __alloc_pages() ->__alloc_pages_nodemask() -> get_page_from_freelist() -> __cpuset_zone_allowed()-> __cpuset_node_allowed()

　　这个函数用于判断是否可以申请一个内存节点。逻辑判断顺序如下：

　　中断中申请内存：　　　　　　✔
　　申请的节点在task mem_allowed内：✔
　　OOM被杀的进程：　　　　　 ✔
　　user space申请（user space申请都会设置__GFP_HARDWALL）：✖
　　正在退出的进程：　　　　　✔
　　未设置__GFP_HARDWALL，但是节点在mems_allowed之外，的内核进程申请：需要往上层寻找带有内存互斥exclusive或者hardwall，并对当前节点有影响的上层节点，并判断当前的node是否在其允许范围内。如果在范围内，则允许申请。

/**
 * cpuset_node_allowed - Can we allocate on a memory node?
 * @node: is this an allowed node?
 * @gfp_mask: memory allocation flags
 *
 * If we're in interrupt, yes, we can always allocate.  If @node is set in
 * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
 * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
 * yes.  If current has access to memory reserves as an oom victim, yes.
 * Otherwise, no.
 *
 * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
 * and do not allow allocations outside the current tasks cpuset
 * unless the task has been OOM killed.
 * GFP_KERNEL allocations are not so marked, so can escape to the
 * nearest enclosing hardwalled ancestor cpuset.
 *
 * Scanning up parent cpusets requires callback_lock.  The
 * __alloc_pages() routine only calls here with __GFP_HARDWALL bit
 * _not_ set if it's a GFP_KERNEL allocation, and all nodes in the
 * current tasks mems_allowed came up empty on the first pass over
 * the zonelist.  So only GFP_KERNEL allocations, if all nodes in the
 * cpuset are short of memory, might require taking the callback_lock.
 *
 * The first call here from mm/page_alloc:get_page_from_freelist()
 * has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets,
 * so no allocation on a node outside the cpuset is allowed (unless
 * in interrupt, of course).
 *
 * The second pass through get_page_from_freelist() doesn't even call
 * here for GFP_ATOMIC calls.  For those calls, the __alloc_pages()
 * variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
 * in alloc_flags.  That logic and the checks below have the combined
 * affect that:
 *    in_interrupt - any node ok (current task context irrelevant)
 *    GFP_ATOMIC   - any node ok
 *    tsk_is_oom_victim   - any node ok
 *    GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
 *    GFP_USER     - only nodes in current tasks mems allowed ok.
 */
bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
{
    struct cpuset *cs;        /* current cpuset ancestors */
    int allowed;            /* is allocation in zone z allowed? */
    unsigned long flags;

    if (in_interrupt())　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　//在中断中，永远允许内存申请
        return true;
    if (node_isset(node, current->mems_allowed))　　　　　　　　　　　　　　　//节点在task的允许范围内
        return true;
    /*
     * Allow tasks that have access to memory reserves because they have
     * been OOM killed to get memory anywhere.
     */
    if (unlikely(tsk_is_oom_victim(current)))　　　　　　　　　　　　　　　　　　//oom被kill掉的进程，允许申请
        return true;
    if (gfp_mask & __GFP_HARDWALL)    /* If hardwall request, stop here */　　//如果是user space的申请request，并且设置了__GFP_HARDWALL，那么不允许
        return false;

    if (current->flags & PF_EXITING) /* Let dying task have memory */　　　　//正在退出的进程，允许申请
        return true;

    /* Not hardwall and node outside mems_allowed: scan up cpusets */
    spin_lock_irqsave(&callback_lock, flags);

    rcu_read_lock();
    cs = nearest_hardwall_ancestor(task_cs(current));　　　　　　　　　　　　//往上层寻找带有内存互斥exclusive或者hardwall，并对当前节点有影响的上层节点
    allowed = node_isset(node, cs->mems_allowed);　　　　　　　　　　　　　　 //判断当前的node是否在上层节点的允许范围内
    rcu_read_unlock();

    spin_unlock_irqrestore(&callback_lock, flags);
    return allowed;
}

　　1.3.7 在vmscan.c中，限制page恢复到当前cpuset：

　　调用路径：shrink_zones() -> cpuset_zone_allowed() -> __cpuset_zone_allowed() -> __cpuset_node_allowed()

　　shrink_zones()函数是linux内存页回收的核心函数。而对于cpuset中内存节点的限制判断逻辑，与page_alloc.c中的完全一样。

/*
 * This is the direct reclaim path, for page-allocating processes.  We only
 * try to reclaim pages from zones which will satisfy the caller's allocation
 * request.
 *
 * If a zone is deemed to be full of pinned pages then just give it a light
 * scan then give up on it.
 */
static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
    struct zoneref *z;
    struct zone *zone;
    unsigned long nr_soft_reclaimed;
    unsigned long nr_soft_scanned;
    gfp_t orig_mask;
    pg_data_t *last_pgdat = NULL;

    /*
     * If the number of buffer_heads in the machine exceeds the maximum
     * allowed level, force direct reclaim to scan the highmem zone as
     * highmem pages could be pinning lowmem pages storing buffer_heads
     */
    orig_mask = sc->gfp_mask;
    if (buffer_heads_over_limit) {
        sc->gfp_mask |= __GFP_HIGHMEM;
        sc->reclaim_idx = gfp_zone(sc->gfp_mask);
    }

    for_each_zone_zonelist_nodemask(zone, z, zonelist,
                    sc->reclaim_idx, sc->nodemask) {
        /*
         * Take care memory controller reclaiming has small influence
         * to global LRU.
         */
        if (global_reclaim(sc)) {
            if (!cpuset_zone_allowed(zone,
                         GFP_KERNEL | __GFP_HARDWALL))　　　　　　//设置了__GFP_HARDWALL，所以只能是在task允许的范围内，或者在中断，或者是oom受害者，才可以进行内存回收
                continue;

            /*
             * If we already have plenty of memory free for
             * compaction in this zone, don't free any more.
             * Even though compaction is invoked for any
             * non-zero order, only frequent costly order
             * reclamation is disruptive enough to become a
             * noticeable problem, like transparent huge
             * page allocations.
             */
            if (IS_ENABLED(CONFIG_COMPACTION) &&
                sc->order > PAGE_ALLOC_COSTLY_ORDER &&
                compaction_ready(zone, sc)) {
                sc->compaction_ready = true;
                continue;
            }

            /*
             * Shrink each node in the zonelist once. If the
             * zonelist is ordered by zone (not the default) then a
             * node may be shrunk multiple times but in that case
             * the user prefers lower zones being preserved.
             */
            if (zone->zone_pgdat == last_pgdat)
                continue;

            /*
             * This steals pages from memory cgroups over softlimit
             * and returns the number of reclaimed pages and
             * scanned pages. This works for global memory pressure
             * and balancing, not for a memcg's limit.
             */
            nr_soft_scanned = 0;
            nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
                        sc->order, sc->gfp_mask,
                        &nr_soft_scanned);
            sc->nr_reclaimed += nr_soft_reclaimed;
            sc->nr_scanned += nr_soft_scanned;
            /* need some check for avoid more shrink_zone() */
        }

        /* See comment about same check for global reclaim above */
        if (zone->zone_pgdat == last_pgdat)
            continue;
        last_pgdat = zone->zone_pgdat;
        shrink_node(zone->zone_pgdat, sc);
    }

    /*
     * Restore to original mask to avoid the impact on the caller if we
     * promoted it to __GFP_HIGHMEM.
     */
    sc->gfp_mask = orig_mask;
}

init cpusets代码分析中，还要分析hotplug部分。。。。。。。。待续

1.4 什么是互斥cpuset?

cpu/mem互斥

cpuset中有两个bool类型的参数cpuset.mem_exclusive,cpuset.cpu_exclusive，分别表示当前的cpuset是否是内存互斥和cpu互斥的，自此同一层级的cpuset不能共享互斥的资源。

我们在上面已经理解了层级树的概念，那么cpu互斥就是：如果某个cpuset是cpu互斥的，那么该cpuset中的cpu是不能被该cpuset的兄弟cpuset共享的，当然该cpuset中的cpu是能够和该cpuset的祖先cpuset和子孙cpuset共享的。

注意：设置exclusive时，需要其父节点也为exclusive。否则会设置失败。所以，最终还是说明cpuset的互斥，就是从根节点开始，就已经互斥了。（当前android平台下，默认并没有设置cpu/内存互斥）

内存互斥也是一样，只不过互斥的资源换成了内存节点。

内存mem_hardwall

cpuset还有一个bool类型的参数cpuset.mem_hardwall，如果cpuset被设置成内存互斥（mem_exclusive）或者被设置了cpuset.mem_hardwall，那么都认为这个cpuset是mem_hardwalled。

mem_hardwalled的cpuset中的进程在内核态申请的内存(主要有直接申请的页，以及通过slab系统分配的内存，还有文件系统page cache)会受到cpuset的限制，进而在该cpuset内的内存节点中分配内存。对比：不管cpuset是否是mem_hardwalled，cpuset中的进程在用户态申请的内存都会受到cpuset中内存节点的限制。所以得出：而mem_exclusive除了提供了内存互斥的功能，也包含了cpuset.mem_hardwall的功能。

通过上面，得到如下结论：

1、cpuset.mem_hardwall限制的是cpuset中进程在内核态申请的内存（必须符合cpuset的内存限制）。而用户态申请内存，不管有没有设置mem_hardwall，都会受到cpuset限制。

2、cpuset mem_hardwall=1，那么进程内核态的内存分配会受cpuset中内存节点的限制；反之，内核态则不受cpuset中内存节点的限制。

3、mem_exclusive包含了mem_hardwall的功能。

1.5 memory_pressure

memory_pressure提供了一个监控cpuset内存压力的方法。它会记录为了满足内存足够剩余空间的要求，释放cpuset中内存节点的次数。它是一个int型数值，单位为：每秒该cpuset尝试回收内存的次数*1000，半衰期为10秒。

这个属性可以有助于监控和管理当前cpuset下，各个任务对内存的影响，从而进行进一步地内存节点调配，或任务分配。

使能这个功能需要提前对节点：/dev/cpuset/memory_pressure_enabled，写入1之后，memory_pressure才会开始计算。

数据的计算可以被任何一个attach到该cpuset上的进程执行。当进行调用页回收时，就会执行计算。因为在每组查询上避免对任务列表进行扫描，系统上减少了监视此度量标准的批处理调度程序所施加的系统负载。

1.6 memory spread

有2个flag在cpuset中用于控制kernel从哪里申请内存，并将之用于文件系统得buffer或者相关的kernel内核数据结构。它们分别是cpuset.memory_spread_page 和 cpuset.memory_spread_slab。对其，写入1，则使能；写入0，则关闭使能。

如果cpuset.memory_spread_page被设置了，文件系统的buffer（page cache）将会允许从所有内存mem_allowed的节点中，而不是仅在task所在的节点中寻找free page。

如果cpuset.memory_spread_slab被设置了，slab cache（例如inode，dentry）将会允许从所有内存mem_allowed的节点中，而不是仅在task所在的节点中寻找free page。

默认情况下，flag是disable状态，内存申请会遵循NUMA mempolicy和cpuset的配置；而当打开了memory spread之后，会忽视NUMA mempolicy，并进行spread。一旦再次disable flag，则会继续遵循NUMA mempolicy。

1.7 sched_load_balance

负载均衡（load balance）

kernel调度器会自动进行均衡task的负载：一个cpu没有被充分利用，那么在此CPU上的kernel code就会去寻找overload的CPU，并把task移到空闲的CPU上。分配任务（Task Placement）遵循cpuset配置和cpu亲和度相关系统调用（sched_setaffinity）。

调度域（sched domain）

但是负载均衡算法的开销，以及对一些kernel共享的核心数据结构（比如task list）的影响，会随着执行负载均衡cpu核数的增加而成倍增加。所以调度器会将cpu分为几个调度域，而负载均衡仅会作用在每一个调度域之内。简而言之，在2个更小的调度域内进行负载均衡的开销，比在一个大的调度域内进行负载均衡的开销更小。但是在调度域之间就不会进行负载均衡了。不同调度域之间没有重叠部分，有的cpu可能不在任何调度域内，因为它不需要进行负载均衡。

cpuset.sched_load_balance

cpuset.sched_load_balance被设置之后，需要当前cpuset.cpus中的cpu是在同一个的调度域中，从而在内cpuset.cpus内进行负载均衡。

cpuset.sched_load_balance被disable之后，就是停止cpuset.cpus内的负载均衡。但是如果cpuset.cpus与其他cpuset中配置有重叠，并且设置了cpuset.sched_load_balance，那么仍然会在其cpus中进行负载均衡，例如，在根节点的cpuset中设置cpuset.sched_load_balance，那么就会在在所有cpu中进行负载均衡。

因此，在设置cpuset.sched_load_balance的时候，应该保持top层的节点不设置cpuset.sched_load_balance，而只在child层或者一些更小的cpuset中，进行设置。

同时，由于cpuset是呈层级网状结构的；而cpuset.sched_load_balance是一维flat结构，它们不会出现重叠，每个cpu最多只会出现在一个调度域中。

那么假如在2个cpuset中有出现cpus部分重叠，并且都设置了cpuset.sched_load_balance，那么我们应该重新构造一个单独的、包含这2个cpuset的扩展集合。因为我们遵循不将task移出当前cpuset中的规则，这样可以防止调度器产生多余的开销。

假如在2个cpuset中有出现cpus部分重叠，并且其中之一设置了cpuset.sched_load_balance，那么就会出现负载均衡仅在重叠的cpu中进行。针对这样的情况，我们就不能把一些可能需要较多CPU资源的task放在这样的cpuset中。

我们当前SDM845 Android平台，是在top cpuset下，已经打开fully负载均衡的。也就是一个调度域涵盖了整个系统，其他cpuset的设置已经不生效了。

1.8 sched_relax_domain_level

在调度域中，主要有2种触发迁移task的方式：1. tick周期性负载均衡 2. 出现某些调度events。

当一个task被唤醒，就会被迁移到idle cpu上。例如：当task A在CPU X上运行，并唤醒了同在CPU X上的另一个task B。那么调度器就会把task B迁移到CPU Y（CPU Y为CPU X的兄弟）。这样task B可以直接在CPU Y上运行，避免了等待task A执行的情况。当CPU的task跑完了，在进入idle之前，会尝试从其他繁忙的CPU上迁移一些task，帮繁忙的CPU分担一些task。

当然，寻找可迁移的task和需要迁移的CPU会产生一些开销，但调度器并不是每次都会在域中搜索所有的CPU。实际上，在一些架构下，因调度events触发时，搜索的范围会被限制在cpu所在的socket或者node节点；而在周期性负载均衡时，搜索范围则是所有cpu。例如：假设CPU Z与CPU X相对较远（非兄弟），那么尽管CPU Z处于idle，而CPU X以及其兄弟CPU都很繁忙，调度器还是不会将刚唤醒的task B迁移到CPU Z上，因为CPU Z不在搜索范围内。最后，要么task B等待task A执行结束后被执行，要么等到下次周期性负载均衡时进行迁移。

cpuset.sched_relax_domain_level

sched_relax_domain_level属性允许了用户对搜索范围的修改。它用int型的值来体现搜索范围的设置，如下：

-1 : no request. use system default or follow request of others.
0 : no search.
1 : search siblings (hyperthreads in a core).
2 : search cores in a package.
3 : search cpus in a node [= system wide on non-NUMA system]
4 : search nodes in a chunk of node [on NUMA system]
5 : search system wide [on NUMA system]

这个属性影响的是cpuset所在的调度域，所以sched_load_balance这个flag不能disable，因为disable了的话，就没有调度域了。

如果有多个cpuset部分重叠，因此他们会形成一个单独的调度域，那么sched_relax_domain_level就会使用其中的最大值。注：如果有1个设置的值为0，而其他都为-1，那么生效的值为0。

修改这个属性，会同时产生正面和负面影响，当我们不确定的时候，不要改动此值。

1.9 cpuset.memory_migrate

一般情况下，allocate一个page之后，只要它一直保持着allocate，那么这个page就会一直保持着，即便cpuset.mems后来发生了改变。
而当设置了这个flag之后，当task迁移到新的cpuset下时，就会将在原先cpuset所allocate的内存迁移到新的cpuset下，并且是与原先相对应的page位置。
比如，原先内存申请在原cpuset的第2个内存节点；那么在新cpuset也会使用第2个内存节点。迁移操作时，就会尽可能地将新cpuset中的第2个节点释放出来。

2. 使用example和语法

Andoird下cpuset的路径：

htc_imedugl:/ # ls /dev/cpuset
audio-app             effective_cpus          mems                     
background            effective_mems          notify_on_release        
camera-daemon         foreground              release_agent            
camera-video          mem_exclusive           restricted               
cgroup.clone_children mem_hardwall            sched_load_balance       
cgroup.procs          memory_migrate          sched_relax_domain_level 
cgroup.sane_behavior  memory_pressure         system-background        
cpu_exclusive         memory_pressure_enabled tasks                    
cpus                  memory_spread_page      top-app                  
memory_spread_slab

查看cpus和mems资源：

htc_imedugl:/dev/cpuset # cat cpus                                             
0-7
htc_imedugl:/dev/cpuset # cat mems
0

当前Android下，内存节点mems资源仅有一个。所以可以理解为cpuset仅控制了cpu core资源的分配。

2.1 创建新cpuset

在根节点下，创建一个新的cpuset，命名为 cpuset-test：

mkdir /dev/cpuset/cpuset-test

创建好后，目录如下：

htc_imedugl:/ # ls /dev/cpuset/cpuset-test                                     
cgroup.clone_children mem_exclusive      mems                     
cgroup.procs          mem_hardwall       notify_on_release        
cpu_exclusive         memory_migrate     sched_load_balance       
cpus                  memory_pressure    sched_relax_domain_level 
effective_cpus        memory_spread_page tasks                    
effective_mems        memory_spread_slab

但是其中cpus，mems都为空。这个需要手动设置填充：

未设置前，为空（此时进程是不能attach进来的）：
htc_imedugl:/ # cat dev/cpuset/cpuset-test/mems

htc_imedugl:/ # cat dev/cpuset/cpuset-test/cpus

2.2 修改cpuset

预设cpus为2-3，那么attach到此cpuset下的进程，就会限制在CPU 2-3上运行；而mems则是使用根节点的值 0，不受限制。

填充：
htc_imedugl:/dev/cpuset/cpuset-test # echo 2-3 > cpus
htc_imedugl:/dev/cpuset/cpuset-test # echo 0 > mems

htc_imedugl:/dev/cpuset/cpuset-test # cat cpus                                         
2-3
htc_imedugl:/dev/cpuset/cpuset-test # cat mems                                         
0

2.3 进程attach

假设我们创建一个sh进程，获得它的pid：

htc_imedugl:/dev/cpuset/ljj # while true; do a=a+1; done &
[1] 25035

此时默认该进程是挂在根节点的cpuset下：

htc_imedugl:/dev/cpuset/cpuset-test # cat /proc/25035/cgroup
4:cpuset:/
3:cpu:/
2:schedtune:/
1:cpuacct:/uid_0/pid_5099
0::/

再把该进程attach到cpuset-test下面：

把进程的pid echo到cpuset-test/tasks，就可以完成attach：
htc_imedugl:/dev/cpuset/cpuset-test # echo 25035 > tasks

htc_imedugl:/dev/cpuset/cpuset-test # cat tasks                                
25035

并且查看进程信息：
htc_imedugl:/dev/cpuset/cpuset-test # cat /proc/25035/cgroup
4:cpuset:/cpuset-test
3:cpu:/
2:schedtune:/
1:cpuacct:/uid_0/pid_5099
0::/

抓取systrace，可以看到进程确实运行在CPU2，CPU3上：

假如再修改cpus限制为core 7:

htc_imedugl:/dev/cpuset/cpuset-test # echo 7 > cpus
htc_imedugl:/dev/cpuset/cpuset-test # cat cpus                                 
7

再抓取systrace，可以看到25035进程已经被限制运行在CPU 7上：

2.4 删除cpuset

当cpuset中仍然有进程attach，是无法删除的：

rmdir: cpuset-test/: Device or resource busy

所以，我们先要把tasks移出，然后执行：

htc_imedugl:/dev/cpuset # rmdir cpuset-test/

就可以把cpuset删除了

2.5 其他flag设置

设置flag的方法都是写入int值的方式。

例如，cpu互斥：

echo 1 > cpu_exclusive