Cpusets学习

1. cpusets

1.1 什么是cpusets

cpusets基本功能是限制某一组进程只运行在某些cpu和内存节点上,举个简单例子:系统中有4个进程,4个内存节点,4个cpu.利用cpuset可以让第1,2个进程只运行在第1,2颗cpu上并且只在第1,2个内存节点上分配内存。cpuset是基于cgroup子系统实现(关于cgroup子系统可以参考内核文档 Documentation/cgroups/cgroups.txt.)使用cpuset上述功能可以让系统管理员动态调整进程运行所在的cpu和内存节点。

cpusets是cgroup文件系统中的一个子系统。

1.2 为什么需要cpusets

在大型的计算机系统中,有多颗cpu,若干内存节点。尤其在NUMA架构下,cpu访问不同内存节点的速度不同,这种情况增加了进程调度和进程内存分配目标node管理的难度。比较新的小型系统使用linux内核自带的调度功能和内存管理方案就能得到很好表现,但是在比较大的系统中如果精心调整不同应用所在的cpu和内存节点会大大提高性能表现。 

(NUMA架构)

cpuset在以下场景会更有价值:

1. 对于跑了很多相同的应用实例的大型web server
2. 对于跑了不同应用的大型server(例如:同时跑了web server相关应用,又跑了数据库应用)
3. 大型NUMA系统

cpuset必须允许动态调整,并且不影其他不相关cpuset中运行的进程。比如:可以将某个cpu动态的加入到某个cpuset,可以从某个cpuset中将某个cpu移除,可以将进程加入到cpuset,也可以将某个进程从一个cpuset迁移到另一个cpuset。内核的cpuset补丁,提供了最基本的实现上述功能的机制,在实现上最大限度使用原有的cpu和内存节点分配机制,尽可能避免影响现有的调度器,以及内存分配核心功能的代码。

1.3 cpusets是如何实现的

cpusets整体为层级树结构。由一个根节点(root)包含了系统所有的cpu和内存节点资源。由根节点,可以分支出一个或者多个子节点,同时子节点也可以在分支出孙子节点,以此类推。每个子节点所包含的资源都是父节点资源的子集。

为了理解层级树的概念:其实cpuset做为cgroup一个子系统实现,也遵循了cgroup层级树的概念。举个例子:有一个cpuset /,可以在cpuset /下再建立若干个cpuset。比如:建立background,foreground两个子cpuset。background和foreground互为兄弟,cpuset /属于父。子cpuset中的cpu和node节点集合必须是父亲cpuset中cpu和内存节点集合的子集。

对cpuset的操作都是通过cpuset文件系统来完成的,内核没有提供额外的系统调用对cpuset做修改、查看操作。在android下,cpusets的文件系统路径:

/dev/cpuset

在/proc/#pid/status中如下几行也可以说明一个进程运行在哪些cpu上,并且进程分配内存必须在哪些内存节点上:

Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff
Cpus_allowed_list:      0-127
Mems_allowed:   ffffffff,ffffffff
Mems_allowed_list:      0-63

例如(android平台只有8个cpu core,1个内存节点):

Cpus_allowed:    ff
Cpus_allowed_list:    0-7
Mems_allowed:    1
Mems_allowed_list:    0

每一个cpuset对应cgroup文件系统中的子系统,,目录下有些文件,用来描述cpuset的属性。对应的文件里列表如下:

1.cpuset.cpus   //cpuset中的cpu列表
2.cpuset.mems   //cpuset中的内存节点列表
3.cpuset.memory_migrate  //cpuset内存迁移,见1.9
4.cpuset.cpu_exclusive   //cpuset是否是cpu互斥的,见1.4
5.cpuset.mem_exclusive   //cpuset是否是内存互斥的,见1.4
6.cpuset.mem_hardwall    //cpuset是否是hardwalled的,见1.4
7.cpuset.memory_pressure    //内存使用的紧张程度,见1.5
8.cpuset.memory_spread_page   //如果被设置了,将该cpuset中进程上下文申请的page cache平均分布到cpuset中的各个节点,见1.6
9.cpuset.memory_spread_slab   //如果被设置了,将该cpuset中进程上下文申请到的slab对象平均分布到cpuset中的各个内存节点,见1.6
10.cpuset.sched_load_balance  //如果被设置了,负载均衡会在cpuset配置的cpus中进行,见1.7
11.cpuset.sched_relax_domain_level  //当要task迁移时,搜索的范围,见1.8

下面文件只有在根cpuset中才有:
12.cpuset.memory_pressure_enabled  //使能memory pressure测量的flag

其实在cpusets之前,已经有一套机制来限制某个进程只能被调度到某些cpu上运行(sched_setaffinity),限制某些进程的内存申请只能在某些内存节点上分配(mbind,set_mempolicy)。

而cpusets进行了扩展:

  1. cpusets是cpu和memory节点的集合,并且对kernel可见的。
  2. 每个task struct中有一个指针指向了cgroup数据结构(cpuset是cgroup的一个子系统),通过这个指针,将进程添加到具体的cpuset中。
  3. 调用sched_setaffinity、mbind/set_mempolicy对应的cpu必须在对应task的cpuset中
  4. 根节点的cpuset包含了所有cpu和memory节点
  5. 对任意cpuset,都可以再定义其子cpuset,子cpuset中包含的cpu和内存节点是父cpuset的子集
  6. cpusets的层级结构可以mount到/dev/cpuset,user space可以通过其进行查看和操作
  7. 如果一个cpuset被标记为专有,则该cpuset的兄弟cpuset中包含的cpu和内存节点不能和它的cpu和memory节点有交集
  8. 可以查看到任一cpuset上所有task的pid

cpusets的实现,需要在现有kernel中添加一些hook函数,而这些hook函数不会添加到kernel关键热点路径上(不影响性能):

  1.3.1 在init/main.c中,当系统boot时,初始化根节点的cpuset:

  start_kernel()--> cpuset_init()

  start_kernel()--> rest_init()--> kernel_thread(kernel_init, NULL, CLONE_FS) --> kernel_init() --> kernel_init_freeable() --> do_basic_setup() --> cpuset_init_smp()

struct cpuset {
    struct cgroup_subsys_state css;

    unsigned long flags;        /* "unsigned long" so bitops work */

    /*
     * On default hierarchy:
     *
     * The user-configured masks can only be changed by writing to
     * cpuset.cpus and cpuset.mems, and won't be limited by the
     * parent masks.
     *
     * The effective masks is the real masks that apply to the tasks
     * in the cpuset. They may be changed if the configured masks are
     * changed or hotplug happens.
     *
     * effective_mask == configured_mask & parent's effective_mask,
     * and if it ends up empty, it will inherit the parent's mask.
     *
     *
     * On legacy hierachy:
     *
     * The user-configured masks are always the same with effective masks.
     */

    /* user-configured CPUs and Memory Nodes allow to tasks */
    cpumask_var_t cpus_allowed;
    cpumask_var_t cpus_requested;
    nodemask_t mems_allowed;

    /* effective CPUs and Memory Nodes allow to tasks */
    cpumask_var_t effective_cpus;
    nodemask_t effective_mems;

    /*
     * This is old Memory Nodes tasks took on.
     *
     * - top_cpuset.old_mems_allowed is initialized to mems_allowed.
     * - A new cpuset's old_mems_allowed is initialized when some
     *   task is moved into it.
     * - old_mems_allowed is used in cpuset_migrate_mm() when we change
     *   cpuset.mems_allowed and have tasks' nodemask updated, and
     *   then old_mems_allowed is updated to mems_allowed.
     */
    nodemask_t old_mems_allowed;

    struct fmeter fmeter;        /* memory_pressure filter */

    /*
     * Tasks are being attached to this cpuset.  Used to prevent
     * zeroing cpus/mems_allowed between ->can_attach() and ->attach().
     */
    int attach_in_progress;

    /* partition number for rebuild_sched_domains() */
    int pn;

    /* for custom sched domain */
    int relax_domain_level;
};
int __init cpuset_init(void)
{
    int err = 0;

    BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL));
    BUG_ON(!alloc_cpumask_var(&top_cpuset.effective_cpus, GFP_KERNEL));
    BUG_ON(!alloc_cpumask_var(&top_cpuset.cpus_requested, GFP_KERNEL));

    cpumask_setall(top_cpuset.cpus_allowed);
    cpumask_setall(top_cpuset.cpus_requested);
    nodes_setall(top_cpuset.mems_allowed);
    cpumask_setall(top_cpuset.effective_cpus);
    nodes_setall(top_cpuset.effective_mems);

    fmeter_init(&top_cpuset.fmeter);
    set_bit(CS_SCHED_LOAD_BALANCE, &top_cpuset.flags);
    top_cpuset.relax_domain_level = -1;            //以上这些都是在初始化根节点cpuset的参数

    err = register_filesystem(&cpuset_fs_type);       //注册文件系统
    if (err < 0)
        return err;

    BUG_ON(!alloc_cpumask_var(&cpus_attach, GFP_KERNEL));

    return 0;
}
/**
 * cpuset_init_smp - initialize cpus_allowed
 *
 * Description: Finish top cpuset after cpu, node maps are initialized
 */
void __init cpuset_init_smp(void)
{
    cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
    top_cpuset.mems_allowed = node_states[N_MEMORY];                //node_states[N_MEMORY]是存放了所有online的内存节点,当内存hotplug时,会发生变化
    top_cpuset.old_mems_allowed = top_cpuset.mems_allowed;

    cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask);
    top_cpuset.effective_mems = node_states[N_MEMORY];               //在cpu、memory表初始化后,完成cpuset剩余参数的初始化

    register_hotmemory_notifier(&cpuset_track_online_nodes_nb);          //注册了一个notify,当cpuset中cpu或者memory发现改变了(hotplug),就会工作

    cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0);  //创建了一个工作队列
    BUG_ON(!cpuset_migrate_mm_wq);
}

  1.3.2 在进程fork/exit时,会从对应的cpuset中执行attach/detach:

  fork调用路径1:_do_fork() -> copy_process() -> cgroup_fork()

/**
 * cgroup_fork - initialize cgroup related fields during copy_process()
 * @child: pointer to task_struct of forking parent process.
 *
 * A task is associated with the init_css_set until cgroup_post_fork()
 * attaches it to the parent's css_set.  Empty cg_list indicates that
* @child isn't holding reference to its css_set.
 */
void cgroup_fork(struct task_struct *child)
{
    RCU_INIT_POINTER(child->cgroups, &init_css_set);    //初始化子进程的css_set(cgroup subsystem set)
    INIT_LIST_HEAD(&child->cg_list);
}

  fork调用路径2:_do_fork() -> copy_process() -> cgroup_post_fork() -> css_set_move_task() -> cgroup_move_task() -> rcu_assign_pointer(task->cgroups, to);

        其中,cgroup_move_task()中,会将子进程移动到父进程的cgroup中,操作由css_set指针赋值完成:

        to指针为父进程的css_set指针,而task->cgroups则是child的css_set指针,最后通过指针赋值,子进程将attach到父进程的cgroup set中。

  fork调用路径3:_do_fork() -> copy_process() -> cgroup_post_fork() -> 调用每个cgroup子系统的fork(): ss->fork(child) -> cpuset_fork()

/*
 * Make sure the new task conform to the current state of its parent,
 * which could have been changed by cpuset just after it inherits the
 * state from the parent and before it sits on the cgroup's task list.
 */
static void cpuset_fork(struct task_struct *task)  
{
    if (task_css_is_root(task, cpuset_cgrp_id))
        return;

    set_cpus_allowed_ptr(task, &current->cpus_allowed);
    task->mems_allowed = current->mems_allowed;    //继承父进程current的mem_allowed
}

         接着:cpuset_fork() ->set_cpus_allowed_ptr() -> __set_cpus_allowed_ptr()如下 -> do_set_cpus_allowed() -> p->sched_class->set_cpus_allowed(p, new_mask);

/*
 * Change a given task's CPU affinity. Migrate the thread to a
 * proper CPU and schedule it away if the CPU it's executing on
 * is removed from the allowed bitmask.
 *
 * NOTE: the caller must have a valid reference to the task, the
 * task must not exit() & deallocate itself prematurely. The
 * call is not atomic; no spinlocks may be held.
 */
static int __set_cpus_allowed_ptr(struct task_struct *p,
                  const struct cpumask *new_mask, bool check)
{
    const struct cpumask *cpu_valid_mask = cpu_active_mask;
    unsigned int dest_cpu;
    struct rq_flags rf;
    struct rq *rq;
    int ret = 0;
    cpumask_t allowed_mask;

    rq = task_rq_lock(p, &rf);
    update_rq_clock(rq);

    if (p->flags & PF_KTHREAD) {        //所有kernel进程默认可以运行在所有online的cpu上
        /*
         * Kernel threads are allowed on online && !active CPUs
         */
        cpu_valid_mask = cpu_online_mask;
    }

    /*
     * Must re-check here, to close a race against __kthread_bind(),
     * sched_setaffinity() is not guaranteed to observe the flag.
     */
    if (check && (p->flags & PF_NO_SETAFFINITY)) {  //thread不允许改变cpu亲和度
        ret = -EINVAL;
        goto out;
    }

    if (cpumask_equal(&p->cpus_allowed, new_mask))  //当前可运行的cpu和要设置的cpu相等,那就不需要重复设置
        goto out;

    cpumask_andnot(&allowed_mask, new_mask, cpu_isolated_mask);  //将父进程的cpus_allowed中去掉isolate的cpu
    cpumask_and(&allowed_mask, &allowed_mask, cpu_valid_mask);   //再从中筛选可以进行task迁移的cpu

    dest_cpu = cpumask_any(&allowed_mask);            //最后再筛选出的结果中挑选一个dest_cpu
    if (dest_cpu >= nr_cpu_ids) {                  //如果dest_cpu的index超过最大cpu的index,则需要重新挑选
        cpumask_and(&allowed_mask, cpu_valid_mask, new_mask);
        dest_cpu = cpumask_any(&allowed_mask);
        if (!cpumask_intersects(new_mask, cpu_valid_mask)) {
            ret = -EINVAL;
            goto out;
        }
    }

    do_set_cpus_allowed(p, new_mask);      //详细见下

    if (p->flags & PF_KTHREAD) {
        /*
         * For kernel threads that do indeed end up on online &&
         * !active we want to ensure they are strict per-CPU threads.
         */
        WARN_ON(cpumask_intersects(new_mask, cpu_online_mask) &&
            !cpumask_intersects(new_mask, cpu_active_mask) &&
            p->nr_cpus_allowed != 1);
    }

    /* Can the task run on the task's current CPU? If so, we're done */
    if (cpumask_test_cpu(task_cpu(p), &allowed_mask))      //如果task能运行在task本来运行的cpu上,则直接退出
        goto out;

    if (task_running(rq, p) || p->state == TASK_WAKING) {  //判断task是否处于running或者waking状态(不同task状态,不同的迁移方式)
        struct migration_arg arg = { p, dest_cpu };
        /* Need help from migration thread: drop lock and wait. */
        task_rq_unlock(rq, p, &rf);
        stop_one_cpu(cpu_of(rq), migration_cpu_stop, &arg);    //在当前cpu上执行migration_cpu_stop函数,arg为函数的参数,执行完后stop当前cpu
        tlb_migrate_finish(p->mm);
        return 0;
    } else if (task_on_rq_queued(p)) {            //task是否在rq中
        /*
         * OK, since we're going to drop the lock immediately
         * afterwards anyway.
         */
        rq = move_queued_task(rq, &rf, p, dest_cpu);  //把task迁移到dest_cpu上
    }
out:
    task_rq_unlock(rq, p, &rf);

    return ret;
}

         调用至对应sched class的set_cpu_allowed(除deadline class,其余class都调用的是set_cpus_allowed_common()),如下:

void set_cpus_allowed_common(struct task_struct *p, const struct cpumask *new_mask)
{
    cpumask_copy(&p->cpus_allowed, new_mask);    //继承父进程的cpus_allowed
    p->nr_cpus_allowed = cpumask_weight(new_mask);  //计算子进程的cpus_allowed数量
}

        

/*
 * migration_cpu_stop - this will be executed by a highprio stopper thread
 * and performs thread migration by bumping thread off CPU then
 * 'pushing' onto another runqueue.
 */
static int migration_cpu_stop(void *data)
{
    struct migration_arg *arg = data;
    struct task_struct *p = arg->task;
    struct rq *rq = this_rq();
    struct rq_flags rf;

    /*
     * The original target CPU might have gone down and we might
     * be on another CPU but it doesn't matter.
     */
    local_irq_disable();
    /*
     * We need to explicitly wake pending tasks before running
     * __migrate_task() such that we will not miss enforcing cpus_allowed
     * during wakeups, see set_cpus_allowed_ptr()'s TASK_WAKING test.
     */
    sched_ttwu_pending();

    raw_spin_lock(&p->pi_lock);
    rq_lock(rq, &rf);
    /*
     * If task_rq(p) != rq, it cannot be migrated here, because we're
     * holding rq->lock, if p->on_rq == 0 it cannot get enqueued because
     * we're holding p->pi_lock.
     */
    if (task_rq(p) == rq) {
        if (task_on_rq_queued(p))
            rq = __migrate_task(rq, &rf, p, arg->dest_cpu);  //迁移到dest_cpu
        else
            p->wake_cpu = arg->dest_cpu;
    }
    rq_unlock(rq, &rf);
    raw_spin_unlock(&p->pi_lock);

    local_irq_enable();
    return 0;
}

         然后,在_do_fork() -> wake_up_new_task() -> select_task_rq()中,选择满足cpus_allowed的cpu来执行子进程:

/*
 * The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
 */
static inline
int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags,
           int sibling_count_hint)
{
    bool allow_isolated = (p->flags & PF_KTHREAD);

    lockdep_assert_held(&p->pi_lock);

    if (p->nr_cpus_allowed > 1)
        cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags,  //通过调用对应sched class的select_task_rq来选择执行子进程的cpu
                             sibling_count_hint);
    else
        cpu = cpumask_any(&p->cpus_allowed);

    /*
     * In order not to call set_task_cpu() on a blocking task we need
     * to rely on ttwu() to place the task on a valid ->cpus_allowed
     * CPU.
     *
     * Since this is common to all placement strategies, this lives here.
     *
     * [ this allows ->select_task() to simply return task_cpu(p) and
     *   not worry about this generic constraint ]
     */
    if (unlikely(!is_cpu_allowed(p, cpu)) ||
            (cpu_isolated(cpu) && !allow_isolated))
        cpu = select_fallback_rq(task_cpu(p), p, allow_isolated);    //筛选执行cpu满足cpu_allowed

    return cpu;
}

  1.3.3 在设置cpu亲和度(sched_setaffinity)时,会过滤cpuset中配置:

  当执行sched_setaffinity设置cpu亲和度的调用路径:sched_setaffinity() -> get_user_cpu_mask()

                          sched_setaffinity() -> sched_setaffinity()

  首先在get_user_cpu_mask()中,从user space获取要设置的cpu mask;

  之后在sched_setaffinity()在再进行详细的过滤,从而设置准确的task cpu mask,并挑选出合适的cpu执行task

long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
{
    cpumask_var_t cpus_allowed, new_mask;
    struct task_struct *p;
    int retval;
    int dest_cpu;
    cpumask_t allowed_mask;

    rcu_read_lock();

    p = find_process_by_pid(pid);
    if (!p) {
        rcu_read_unlock();
        return -ESRCH;
    }

    /* Prevent p going away */
    get_task_struct(p);
    rcu_read_unlock();

    if (p->flags & PF_NO_SETAFFINITY) {
        retval = -EINVAL;
        goto out_put_task;
    }
    if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
        retval = -ENOMEM;
        goto out_put_task;
    }
    if (!alloc_cpumask_var(&new_mask, GFP_KERNEL)) {
        retval = -ENOMEM;
        goto out_free_cpus_allowed;
    }
    retval = -EPERM;
    if (!check_same_owner(p)) {
        rcu_read_lock();
        if (!ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE)) {
            rcu_read_unlock();
            goto out_free_new_mask;
        }
        rcu_read_unlock();
    }

    retval = security_task_setscheduler(p);
    if (retval)
        goto out_free_new_mask;


    cpuset_cpus_allowed(p, cpus_allowed);        //获取task cpuset中allowed_cpu
    cpumask_and(new_mask, in_mask, cpus_allowed);   //将通过user space中获取的cpu mask与task的cpuset进行过滤,找出能同时满足条件的cpu(new_mask)

    /*
     * Since bandwidth control happens on root_domain basis,
     * if admission test is enabled, we only admit -deadline
     * tasks allowed to run on all the CPUs in the task's
     * root_domain.
     */
#ifdef CONFIG_SMP
    if (task_has_dl_policy(p) && dl_bandwidth_enabled()) {
        rcu_read_lock();
        if (!cpumask_subset(task_rq(p)->rd->span, new_mask)) {
            retval = -EBUSY;
            rcu_read_unlock();
            goto out_free_new_mask;
        }
        rcu_read_unlock();
    }
#endif
again:
    cpumask_andnot(&allowed_mask, new_mask, cpu_isolated_mask);  //从上面得到的new_mask中,过滤掉isolate的cpu,结果保存在allowed_mask
    dest_cpu = cpumask_any_and(cpu_active_mask, &allowed_mask);  //从cpu_active_mask(处于active状态的cpu:可以进行task迁移)和allowed_mask中找到满足条件的dest_cpu
    if (dest_cpu < nr_cpu_ids) {                    //如果dest_cpu的index小于当前cpu核最大的index
        retval = __set_cpus_allowed_ptr(p, new_mask, true);    //与fork的路径类似,为设置new_mask为task新的cpus_allowed,并将task执行在dest_cpu上
        if (!retval) {
            cpuset_cpus_allowed(p, cpus_allowed);          //再重新获取task的cpus_allowed
            if (!cpumask_subset(new_mask, cpus_allowed)) {     //判断new_mask是否是cpus_allowed的子集,如果不是,则说明cpuset又被修改了。则需要更新task的cpus_allowed
                /*
                 * We must have raced with a concurrent cpuset
                 * update. Just reset the cpus_allowed to the
                 * cpuset's cpus_allowed
                 */
                cpumask_copy(new_mask, cpus_allowed);        //赋值new_mask = cpus_allowed
                goto again;                        //重复上面步骤
            }
        }
    } else {
        retval = -EINVAL;
    }

    if (!retval && !(p->flags & PF_KTHREAD))              //PF_KTHREAD代表是否是kernel space创建的进程
        cpumask_and(&p->cpus_requested, in_mask, cpu_possible_mask);  //将user space设置的cpu mask与cpu_possible_mask(平台支持hotplug,所以它包含所有cpu核)
                                           //取同时满足上面2个条件的cpu,存放到task->cpus_requested中
out_free_new_mask:
    free_cpumask_var(new_mask);
out_free_cpus_allowed:
    free_cpumask_var(cpus_allowed);
out_put_task:
    put_task_struct(p);
    return retval;
}

  1.3.4 在task迁移时,会遵循cpuset的配置:

   task迁移本质上就是将此cpu rq中的task,放到另一个cpu的rq上。在上面讲到的fork中,子进程就会在创建之后,进行task迁移,当然同时会遵循cpuset的配置;除了fork,还有更一般的调用_migrate_task的进行task迁移的情况,同样也会对cpuset配置进行过滤。

/*
 * Move (not current) task off this CPU, onto the destination CPU. We're doing
 * this because either it can't run here any more (set_cpus_allowed()
 * away from this CPU, or CPU going down), or because we're
 * attempting to rebalance this task on exec (sched_exec).
 *
 * So we race with normal scheduler movements, but that's OK, as long
 * as the task is no longer on this CPU.
 */
static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
                 struct task_struct *p, int dest_cpu)
{
    /* Affinity changed (again). */
    if (!is_cpu_allowed(p, dest_cpu))
        return rq;

    update_rq_clock(rq);
    rq = move_queued_task(rq, rf, p, dest_cpu);  //迁移task到dest_cpu

    return rq;
}

  可以看到有调用is_cpu_allowed()来检测dest_cpu是否在task的cpus_allowed中,并根据是否是内核线程,进行进一步筛选:

/*
 * Per-CPU kthreads are allowed to run on !actie && online CPUs, see
 * __set_cpus_allowed_ptr() and select_fallback_rq().
 */
static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
{
    if (!cpumask_test_cpu(cpu, &p->cpus_allowed))  //检测dest_cpu是否allow。如果不allow,则return false,且不满足迁移条件
        return false;

    if (is_per_cpu_kthread(p))     //检测task是否是kernel thread,内核进程默认是可以运行在所有online的cpu上
        return cpu_online(cpu);    //cpu是否online(available to scheduler,可以调度运行)

    return cpu_active(cpu);      //cpu是否active(available to migration,可以迁移)
}

  1.3.5 在系统调用:mbind、set_mempolicy时,会过滤cpuset中的内存节点:

  mbind调用路径:kernel_mbind() -> do_mbind() -> mpol_set_nodemask() 

  set_mempolicy调用路径:kernel_set_mempolicy() -> do_set_mempolicy() -> mpol_set_nodemask()

  函数是在创建新mem policy之后,set up内存node mask

/*
 * mpol_set_nodemask is called after mpol_new() to set up the nodemask, if
 * any, for the new policy.  mpol_new() has already validated the nodes
 * parameter with respect to the policy mode and flags.  But, we need to
 * handle an empty nodemask with MPOL_PREFERRED here.
 *
 * Must be called holding task's alloc_lock to protect task's mems_allowed
 * and mempolicy.  May also be called holding the mmap_semaphore for write.
 */
static int mpol_set_nodemask(struct mempolicy *pol,
             const nodemask_t *nodes, struct nodemask_scratch *nsc)
{
    int ret;

    /* if mode is MPOL_DEFAULT, pol is NULL. This is right. */
    if (pol == NULL)
        return 0;
    /* Check N_MEMORY */
    nodes_and(nsc->mask1,
          cpuset_current_mems_allowed, node_states[N_MEMORY]);  //获取当前task所允许的内存节点,并与online的节点过滤,找出同时满足条件的节点,存到nsc->mask1中

    VM_BUG_ON(!nodes);
    if (pol->mode == MPOL_PREFERRED && nodes_empty(*nodes))
        nodes = NULL;    /* explicit local allocation */
    else {
        if (pol->flags & MPOL_F_RELATIVE_NODES)                //根据flag不同,将上面得到的nsc->mask1,再次通过不同计算,得到不同的node mask存到nsc->mask2
            mpol_relative_nodemask(&nsc->mask2, nodes, &nsc->mask1);
        else
            nodes_and(nsc->mask2, *nodes, nsc->mask1);

        if (mpol_store_user_nodemask(pol))            //根据flag不同,保存不同的node mask结果
            pol->w.user_nodemask = *nodes;            //保存user space传入的内存节点
        else
            pol->w.cpuset_mems_allowed =
                        cpuset_current_mems_allowed;      //保存task允许的内存节点
    }

    if (nodes)
        ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);  //根据不同的policy,调用不同的create方式,allocate内存节点
    else
        ret = mpol_ops[pol->mode].create(pol, NULL);
    return ret;
}

  1.3.6 在page_alloc.c中,会限制申请内存节点:

    kmalloc() -> kmalloc_large() -> kmalloc_order_trace() -> kmalloc_order() --> alloc_pages() -> alloc_pages_node() -> __alloc_pages_node() -> __alloc_pages()  ->__alloc_pages_nodemask() -> get_page_from_freelist() -> __cpuset_zone_allowed()-> __cpuset_node_allowed()

   这个函数用于判断是否可以申请一个内存节点。逻辑判断顺序如下:

  •   中断中申请内存:      ✔
  •   申请的节点在task mem_allowed内:✔
  •   OOM被杀的进程:       ✔
  •   user space申请(user space申请都会设置__GFP_HARDWALL):✖
  •   正在退出的进程:      ✔
  •   未设置__GFP_HARDWALL,但是节点在mems_allowed之外,的内核进程申请:需要往上层寻找带有内存互斥exclusive或者hardwall,并对当前节点有影响的上层节点,并判断当前的node是否在其允许范围内。如果在范围内,则允许申请。
/**
 * cpuset_node_allowed - Can we allocate on a memory node?
 * @node: is this an allowed node?
 * @gfp_mask: memory allocation flags
 *
 * If we're in interrupt, yes, we can always allocate.  If @node is set in
 * current's mems_allowed, yes.  If it's not a __GFP_HARDWALL request and this
 * node is set in the nearest hardwalled cpuset ancestor to current's cpuset,
 * yes.  If current has access to memory reserves as an oom victim, yes.
 * Otherwise, no.
 *
 * GFP_USER allocations are marked with the __GFP_HARDWALL bit,
 * and do not allow allocations outside the current tasks cpuset
 * unless the task has been OOM killed.
 * GFP_KERNEL allocations are not so marked, so can escape to the
 * nearest enclosing hardwalled ancestor cpuset.
 *
 * Scanning up parent cpusets requires callback_lock.  The
 * __alloc_pages() routine only calls here with __GFP_HARDWALL bit
 * _not_ set if it's a GFP_KERNEL allocation, and all nodes in the
 * current tasks mems_allowed came up empty on the first pass over
 * the zonelist.  So only GFP_KERNEL allocations, if all nodes in the
 * cpuset are short of memory, might require taking the callback_lock.
 *
 * The first call here from mm/page_alloc:get_page_from_freelist()
 * has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets,
 * so no allocation on a node outside the cpuset is allowed (unless
 * in interrupt, of course).
 *
 * The second pass through get_page_from_freelist() doesn't even call
 * here for GFP_ATOMIC calls.  For those calls, the __alloc_pages()
 * variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
 * in alloc_flags.  That logic and the checks below have the combined
 * affect that:
 *    in_interrupt - any node ok (current task context irrelevant)
 *    GFP_ATOMIC   - any node ok
 *    tsk_is_oom_victim   - any node ok
 *    GFP_KERNEL   - any node in enclosing hardwalled cpuset ok
 *    GFP_USER     - only nodes in current tasks mems allowed ok.
 */
bool __cpuset_node_allowed(int node, gfp_t gfp_mask)
{
    struct cpuset *cs;        /* current cpuset ancestors */
    int allowed;            /* is allocation in zone z allowed? */
    unsigned long flags;

    if (in_interrupt())                              //在中断中,永远允许内存申请
        return true;
    if (node_isset(node, current->mems_allowed))               //节点在task的允许范围内
        return true;
    /*
     * Allow tasks that have access to memory reserves because they have
     * been OOM killed to get memory anywhere.
     */
    if (unlikely(tsk_is_oom_victim(current)))                  //oom被kill掉的进程,允许申请
        return true;
    if (gfp_mask & __GFP_HARDWALL)    /* If hardwall request, stop here */  //如果是user space的申请request,并且设置了__GFP_HARDWALL,那么不允许
        return false;

    if (current->flags & PF_EXITING) /* Let dying task have memory */    //正在退出的进程,允许申请
        return true;

    /* Not hardwall and node outside mems_allowed: scan up cpusets */
    spin_lock_irqsave(&callback_lock, flags);

    rcu_read_lock();
    cs = nearest_hardwall_ancestor(task_cs(current));            //往上层寻找带有内存互斥exclusive或者hardwall,并对当前节点有影响的上层节点
    allowed = node_isset(node, cs->mems_allowed);               //判断当前的node是否在上层节点的允许范围内
    rcu_read_unlock();

    spin_unlock_irqrestore(&callback_lock, flags);
    return allowed;
}

  1.3.7 在vmscan.c中,限制page恢复到当前cpuset:

   调用路径:shrink_zones() -> cpuset_zone_allowed() -> __cpuset_zone_allowed() ->  __cpuset_node_allowed()

   shrink_zones()函数是linux内存页回收的核心函数。而对于cpuset中内存节点的限制判断逻辑,与page_alloc.c中的完全一样。

/*
 * This is the direct reclaim path, for page-allocating processes.  We only
 * try to reclaim pages from zones which will satisfy the caller's allocation
 * request.
 *
 * If a zone is deemed to be full of pinned pages then just give it a light
 * scan then give up on it.
 */
static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
{
    struct zoneref *z;
    struct zone *zone;
    unsigned long nr_soft_reclaimed;
    unsigned long nr_soft_scanned;
    gfp_t orig_mask;
    pg_data_t *last_pgdat = NULL;

    /*
     * If the number of buffer_heads in the machine exceeds the maximum
     * allowed level, force direct reclaim to scan the highmem zone as
     * highmem pages could be pinning lowmem pages storing buffer_heads
     */
    orig_mask = sc->gfp_mask;
    if (buffer_heads_over_limit) {
        sc->gfp_mask |= __GFP_HIGHMEM;
        sc->reclaim_idx = gfp_zone(sc->gfp_mask);
    }

    for_each_zone_zonelist_nodemask(zone, z, zonelist,
                    sc->reclaim_idx, sc->nodemask) {
        /*
         * Take care memory controller reclaiming has small influence
         * to global LRU.
         */
        if (global_reclaim(sc)) {
            if (!cpuset_zone_allowed(zone,
                         GFP_KERNEL | __GFP_HARDWALL))      //设置了__GFP_HARDWALL,所以只能是在task允许的范围内,或者在中断,或者是oom受害者,才可以进行内存回收
                continue;

            /*
             * If we already have plenty of memory free for
             * compaction in this zone, don't free any more.
             * Even though compaction is invoked for any
             * non-zero order, only frequent costly order
             * reclamation is disruptive enough to become a
             * noticeable problem, like transparent huge
             * page allocations.
             */
            if (IS_ENABLED(CONFIG_COMPACTION) &&
                sc->order > PAGE_ALLOC_COSTLY_ORDER &&
                compaction_ready(zone, sc)) {
                sc->compaction_ready = true;
                continue;
            }

            /*
             * Shrink each node in the zonelist once. If the
             * zonelist is ordered by zone (not the default) then a
             * node may be shrunk multiple times but in that case
             * the user prefers lower zones being preserved.
             */
            if (zone->zone_pgdat == last_pgdat)
                continue;

            /*
             * This steals pages from memory cgroups over softlimit
             * and returns the number of reclaimed pages and
             * scanned pages. This works for global memory pressure
             * and balancing, not for a memcg's limit.
             */
            nr_soft_scanned = 0;
            nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
                        sc->order, sc->gfp_mask,
                        &nr_soft_scanned);
            sc->nr_reclaimed += nr_soft_reclaimed;
            sc->nr_scanned += nr_soft_scanned;
            /* need some check for avoid more shrink_zone() */
        }

        /* See comment about same check for global reclaim above */
        if (zone->zone_pgdat == last_pgdat)
            continue;
        last_pgdat = zone->zone_pgdat;
        shrink_node(zone->zone_pgdat, sc);
    }

    /*
     * Restore to original mask to avoid the impact on the caller if we
     * promoted it to __GFP_HIGHMEM.
     */
    sc->gfp_mask = orig_mask;
}

init cpusets代码分析中,还要分析hotplug部分。。。。。。。。待续  

1.4  什么是互斥cpuset?

cpu/mem互斥

cpuset中有两个bool类型的参数cpuset.mem_exclusive,cpuset.cpu_exclusive,分别表示当前的cpuset是否是内存互斥和cpu互斥的,自此同一层级的cpuset不能共享互斥的资源。

我们在上面已经理解了层级树的概念,那么cpu互斥就是:如果某个cpuset是cpu互斥的,那么该cpuset中的cpu是不能被该cpuset的兄弟cpuset共享的,当然该cpuset中的cpu是能够和该cpuset的祖先cpuset和子孙cpuset共享的。

注意:设置exclusive时,需要其父节点也为exclusive。否则会设置失败。所以,最终还是说明cpuset的互斥,就是从根节点开始,就已经互斥了。(当前android平台下,默认并没有设置cpu/内存互斥)

内存互斥也是一样,只不过互斥的资源换成了内存节点。

内存mem_hardwall

cpuset还有一个bool类型的参数cpuset.mem_hardwall,如果cpuset被设置成内存互斥(mem_exclusive)或者被设置了cpuset.mem_hardwall,那么都认为这个cpuset是mem_hardwalled。

mem_hardwalled的cpuset中的进程在内核态申请的内存(主要有直接申请的页,以及通过slab系统分配的内存,还有文件系统page cache)会受到cpuset的限制,进而在该cpuset内的内存节点中分配内存。对比:不管cpuset是否是mem_hardwalled,cpuset中的进程在用户态申请的内存都会受到cpuset中内存节点的限制。所以得出:而mem_exclusive除了提供了内存互斥的功能,也包含了cpuset.mem_hardwall的功能。

通过上面,得到如下结论:

1、cpuset.mem_hardwall限制的是cpuset中进程在内核态申请的内存(必须符合cpuset的内存限制)。而用户态申请内存,不管有没有设置mem_hardwall,都会受到cpuset限制。

2、cpuset mem_hardwall=1,那么进程内核态的内存分配会cpuset中内存节点的限制;反之,内核态则不受cpuset中内存节点的限制。

3、mem_exclusive包含了mem_hardwall的功能。

1.5 memory_pressure

memory_pressure提供了一个监控cpuset内存压力的方法。它会记录为了满足内存足够剩余空间的要求,释放cpuset中内存节点的次数。它是一个int型数值,单位为:每秒该cpuset尝试回收内存的次数*1000,半衰期为10秒。

这个属性可以有助于监控和管理当前cpuset下,各个任务对内存的影响,从而进行进一步地内存节点调配,或任务分配。

使能这个功能需要提前对节点:/dev/cpuset/memory_pressure_enabled,写入1之后,memory_pressure才会开始计算。

数据的计算可以被任何一个attach到该cpuset上的进程执行。当进行调用页回收时,就会执行计算。因为在每组查询上避免对任务列表进行扫描,系统上减少了监视此度量标准的批处理调度程序所施加的系统负载。

 1.6 memory spread

有2个flag在cpuset中用于控制kernel从哪里申请内存,并将之用于文件系统得buffer或者相关的kernel内核数据结构。它们分别是cpuset.memory_spread_page 和 cpuset.memory_spread_slab。对其,写入1,则使能;写入0,则关闭使能。

如果cpuset.memory_spread_page被设置了,文件系统的buffer(page cache)将会允许从所有内存mem_allowed的节点中,而不是仅在task所在的节点中寻找free page。

如果cpuset.memory_spread_slab被设置了,slab cache(例如inode,dentry)将会允许从所有内存mem_allowed的节点中,而不是仅在task所在的节点中寻找free page。

默认情况下,flag是disable状态,内存申请会遵循NUMA mempolicy和cpuset的配置;而当打开了memory spread之后,会忽视NUMA mempolicy,并进行spread。一旦再次disable flag,则会继续遵循NUMA mempolicy。

1.7 sched_load_balance 

负载均衡(load balance)

kernel调度器会自动进行均衡task的负载:一个cpu没有被充分利用,那么在此CPU上的kernel code就会去寻找overload的CPU,并把task移到空闲的CPU上。分配任务(Task Placement)遵循cpuset配置和cpu亲和度相关系统调用(sched_setaffinity)。

调度域(sched domain)

但是负载均衡算法的开销,以及对一些kernel共享的核心数据结构(比如task list)的影响,会随着执行负载均衡cpu核数的增加而成倍增加。所以调度器会将cpu分为几个调度域,而负载均衡仅会作用在每一个调度域之内。简而言之,在2个更小的调度域内进行负载均衡的开销,比在一个大的调度域内进行负载均衡的开销更小。但是在调度域之间就不会进行负载均衡了。不同调度域之间没有重叠部分,有的cpu可能不在任何调度域内,因为它不需要进行负载均衡。

cpuset.sched_load_balance

cpuset.sched_load_balance被设置之后,需要当前cpuset.cpus中的cpu是在同一个的调度域中,从而在内cpuset.cpus内进行负载均衡。

cpuset.sched_load_balance被disable之后,就是停止cpuset.cpus内的负载均衡。但是如果cpuset.cpus与其他cpuset中配置有重叠,并且设置了cpuset.sched_load_balance,那么仍然会在其cpus中进行负载均衡,例如,在根节点的cpuset中设置cpuset.sched_load_balance,那么就会在在所有cpu中进行负载均衡。

因此,在设置cpuset.sched_load_balance的时候,应该保持top层的节点不设置cpuset.sched_load_balance,而只在child层或者一些更小的cpuset中,进行设置。

同时,由于cpuset是呈层级网状结构的;而cpuset.sched_load_balance是一维flat结构,它们不会出现重叠,每个cpu最多只会出现在一个调度域中。

那么假如在2个cpuset中有出现cpus部分重叠,并且都设置了cpuset.sched_load_balance,那么我们应该重新构造一个单独的、包含这2个cpuset的扩展集合。因为我们遵循不将task移出当前cpuset中的规则,这样可以防止调度器产生多余的开销。

假如在2个cpuset中有出现cpus部分重叠,并且其中之一设置了cpuset.sched_load_balance,那么就会出现负载均衡仅在重叠的cpu中进行。针对这样的情况,我们就不能把一些可能需要较多CPU资源的task放在这样的cpuset中。

我们当前SDM845 Android平台,是在top cpuset下,已经打开fully负载均衡的。也就是一个调度域涵盖了整个系统,其他cpuset的设置已经不生效了。

1.8 sched_relax_domain_level

在调度域中,主要有2种触发迁移task的方式:1. tick周期性负载均衡  2. 出现某些调度events。

当一个task被唤醒,就会被迁移到idle cpu上。例如:当task A在CPU X上运行,并唤醒了同在CPU X上的另一个task B。那么调度器就会把task B迁移到CPU Y(CPU Y为CPU X的兄弟)。这样task B可以直接在CPU Y上运行,避免了等待task A执行的情况。当CPU的task跑完了,在进入idle之前,会尝试从其他繁忙的CPU上迁移一些task,帮繁忙的CPU分担一些task。

当然,寻找可迁移的task和需要迁移的CPU会产生一些开销,但调度器并不是每次都会在域中搜索所有的CPU。实际上,在一些架构下,因调度events触发时,搜索的范围会被限制在cpu所在的socket或者node节点;而在周期性负载均衡时,搜索范围则是所有cpu。例如:假设CPU Z与CPU X相对较远(非兄弟),那么尽管CPU Z处于idle,而CPU X以及其兄弟CPU都很繁忙,调度器还是不会将刚唤醒的task B迁移到CPU Z上,因为CPU Z不在搜索范围内。最后,要么task B等待task A执行结束后被执行,要么等到下次周期性负载均衡时进行迁移。

cpuset.sched_relax_domain_level

sched_relax_domain_level属性允许了用户对搜索范围的修改。它用int型的值来体现搜索范围的设置,如下:

-1 : no request. use system default or follow request of others.
0 : no search.
1 : search siblings (hyperthreads in a core).
2 : search cores in a package.
3 : search cpus in a node [= system wide on non-NUMA system]
4 : search nodes in a chunk of node [on NUMA system]
5 : search system wide [on NUMA system]

这个属性影响的是cpuset所在的调度域,所以sched_load_balance这个flag不能disable,因为disable了的话,就没有调度域了。

如果有多个cpuset部分重叠,因此他们会形成一个单独的调度域,那么sched_relax_domain_level就会使用其中的最大值。注:如果有1个设置的值为0,而其他都为-1,那么生效的值为0。

修改这个属性,会同时产生正面和负面影响,当我们不确定的时候,不要改动此值。 

1.9 cpuset.memory_migrate

一般情况下,allocate一个page之后,只要它一直保持着allocate,那么这个page就会一直保持着,即便cpuset.mems后来发生了改变。
而当设置了这个flag之后,当task迁移到新的cpuset下时,就会将在原先cpuset所allocate的内存迁移到新的cpuset下,并且是与原先相对应的page位置。
比如,原先内存申请在原cpuset的第2个内存节点;那么在新cpuset也会使用第2个内存节点。迁移操作时,就会尽可能地将新cpuset中的第2个节点释放出来。

2. 使用example和语法

Andoird下cpuset的路径:

htc_imedugl:/ # ls /dev/cpuset
audio-app             effective_cpus          mems                     
background            effective_mems          notify_on_release        
camera-daemon         foreground              release_agent            
camera-video          mem_exclusive           restricted               
cgroup.clone_children mem_hardwall            sched_load_balance       
cgroup.procs          memory_migrate          sched_relax_domain_level 
cgroup.sane_behavior  memory_pressure         system-background        
cpu_exclusive         memory_pressure_enabled tasks                    
cpus                  memory_spread_page      top-app                  
memory_spread_slab

查看cpus和mems资源:

htc_imedugl:/dev/cpuset # cat cpus                                             
0-7
htc_imedugl:/dev/cpuset # cat mems
0

当前Android下,内存节点mems资源仅有一个。所以可以理解为cpuset仅控制了cpu core资源的分配。

2.1 创建新cpuset

在根节点下,创建一个新的cpuset,命名为 cpuset-test:

mkdir /dev/cpuset/cpuset-test

 创建好后,目录如下:

htc_imedugl:/ # ls /dev/cpuset/cpuset-test                                     
cgroup.clone_children mem_exclusive      mems                     
cgroup.procs          mem_hardwall       notify_on_release        
cpu_exclusive         memory_migrate     sched_load_balance       
cpus                  memory_pressure    sched_relax_domain_level 
effective_cpus        memory_spread_page tasks                    
effective_mems        memory_spread_slab 

但是其中cpus,mems都为空。这个需要手动设置填充:

未设置前,为空(此时进程是不能attach进来的):
htc_imedugl:/ # cat dev/cpuset/cpuset-test/mems

htc_imedugl:/ # cat dev/cpuset/cpuset-test/cpus

2.2 修改cpuset

预设cpus为2-3,那么attach到此cpuset下的进程,就会限制在CPU 2-3上运行;而mems则是使用根节点的值 0,不受限制。

填充:
htc_imedugl:/dev/cpuset/cpuset-test # echo 2-3 > cpus
htc_imedugl:/dev/cpuset/cpuset-test # echo 0 > mems

htc_imedugl:/dev/cpuset/cpuset-test # cat cpus                                         
2-3
htc_imedugl:/dev/cpuset/cpuset-test # cat mems                                         
0

2.3 进程attach

假设我们创建一个sh进程,获得它的pid:

htc_imedugl:/dev/cpuset/ljj # while true; do a=a+1; done &
[1] 25035

此时默认该进程是挂在根节点的cpuset下:

htc_imedugl:/dev/cpuset/cpuset-test # cat /proc/25035/cgroup
4:cpuset:/
3:cpu:/
2:schedtune:/
1:cpuacct:/uid_0/pid_5099
0::/

再把该进程attach到cpuset-test下面:

把进程的pid echo到cpuset-test/tasks,就可以完成attach:
htc_imedugl:/dev/cpuset/cpuset-test # echo 25035 > tasks htc_imedugl:/dev/cpuset/cpuset-test # cat tasks 25035 并且查看进程信息: htc_imedugl:/dev/cpuset/cpuset-test # cat /proc/25035/cgroup 4:cpuset:/cpuset-test 3:cpu:/ 2:schedtune:/ 1:cpuacct:/uid_0/pid_5099 0::/

 抓取systrace,可以看到进程确实运行在CPU2,CPU3上:

假如再修改cpus限制为core 7:

htc_imedugl:/dev/cpuset/cpuset-test # echo 7 > cpus
htc_imedugl:/dev/cpuset/cpuset-test # cat cpus                                 
7

再抓取systrace,可以看到25035进程已经被限制运行在CPU 7上:

2.4 删除cpuset

当cpuset中仍然有进程attach,是无法删除的:

rmdir: cpuset-test/: Device or resource busy

所以,我们先要把tasks移出,然后执行:

htc_imedugl:/dev/cpuset # rmdir cpuset-test/

 就可以把cpuset删除了

2.5 其他flag设置

设置flag的方法都是写入int值的方式。

例如,cpu互斥:

echo 1 > cpu_exclusive
原文地址:https://www.cnblogs.com/lingjiajun/p/12378474.html