调度器26—Linux内核中的各种时间频率 Hello

一、各种时间的打印

1. per-cpu的各种类型的使用时间

# ls -l /proc/stat
-r--r--r-- 1 root root 0 2021-01-01 19:46 /proc/stat
# cat /proc/stat
cpu  203632 46353 386930 31815547 3869 274339 68486 0 0 0
cpu0 26704 7709 39012 3916272 49 87626 23620 0 0 0
cpu1 14682 9898 25125 4055433 68 8755 3338 0 0 0
cpu2 5588 8202 7818 4098854 47 2215 901 0 0 0
cpu3 21765 10971 40654 4014299 341 19606 3900 0 0 0
cpu4 28157 1362 52559 3983416 725 25697 6661 0 0 0
cpu5 58390 2212 140189 3718682 1273 96146 17063 0 0 0
cpu6 42753 1587 70162 3930832 1008 32193 11836 0 0 0
cpu7 5588 4407 11408 4097755 355 2097 1164 0 0 0
intr 71408793 0 32194638 9259224 0 0 56084 91247 0 0 0 0 0 0 0 0 0 0 0 0 0 23940117 0 0 0 0 1022833 0 0 0 0 0 0 0 0 739 1176966 83 213 253 2243389 758 207033 6503 1916 0 0 9173 0 12210 0 0 0 0 0 140 0 0 10 2058 554 0 0 0 18070 0 0 5083 0 0 0 0 224 0 48 0 0 0 2984 0 0 0 29162 0 49591 0 9466 0 0 0 0 0 0 0 0 159 159 0 0 374 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8365 0 0 0 0 25095 0 0 0 3686 0 0 7767 0 0 0 0 0 0 0 0 0 16034 0 0 0 0 0 231848 0 0 0 25090 0 0 0 3558 0 0 8736 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3144 0 3036 181465 0 0 1400 2 1403 1 504929 32592 637 0 0 12 15 0 0 3 0 3 30 0 0 2 0 6653 9 0 279 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 168 0 0 0 0 96 0 8 0 0 0 0 0 0 0 0 0 0 520 40 0 0 0 0 131 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 98 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 133 0 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 24 8 0 0 0 2 67 98 126 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 61029826
btime 1609501574
processes 27212
procs_running 1
procs_blocked 0
softirq 8564148 1172 1338008 1 3 243852 0 1611 5229125 0 1750376

对应的时间类型定义在内核头文件 include/linux/kernel_stat.h,上图中 cpu[0...7] 后的数值跟这些类型依次对应:

/*
 * 'kernel_stat.h' contains the definitions needed for doing
 * some kernel statistics (CPU usage, context switches ...),
 * used by rstatd/perfmeter
 */
enum cpu_usage_stat {
    CPUTIME_USER, //用户空间占用cpu时间
    CPUTIME_NICE, //高nice任务(第优先级),用户空间占用时间
    CPUTIME_SYSTEM, //内核态占用cpu时间
    CPUTIME_SOFTIRQ, //软中断占用cpu时间
    CPUTIME_IRQ, //硬中断占用cpu时间
    CPUTIME_IDLE, //cpu空闲时间
    CPUTIME_IOWAIT, //cpu等待io时间
    CPUTIME_STEAL, //GuestOS等待real cpu时间
    CPUTIME_GUEST, //GuestOS消耗的时间
    CPUTIME_GUEST_NICE, //高nice任务(第优先级),GuestOS消耗的时间
    NR_STATS,
};

打印函数为 fs/proc/stat.c 中的 show_stat(),单位为 jiffie。在linux系统中,cputime模块具有重要的意义。它记录了设备中所有cpu在各个状态下经过的时间。我们所熟悉的top工具就是用cputime换算出的cpu利用率。

2. per-cluster的在其各个频点下驻留的时间

cpufreq_stats 模块的开启需要使能 CONFIG_CPU_FREQ_STAT 宏。当系统使能该特性后,cpufreq driver sysfs下生成 stats 目录:

/sys/devices/system/cpu/cpufreq/policy0/stats # ls -l
total 0
--w-------    reset //可以对统计进行reset
-r--r--r--    time_in_state //本cluster在各频点下驻留的时间,单位jiffy
-r--r--r--    total_trans //频点之间总切换次数
-r--r--r--    trans_table //频点转换表

# cat /sys/devices/system/cpu/cpufreq/policy0/stats/time_in_state
1800000 5647
1700000 7
...
200000 4221664

表示的是该 cpufreq policy 内分别处于各个频点的时间,单位为 jiffies。有了这个功能,我们就能获取每个 cluster 运行最多的频点是哪些,进而针对性的对系统功耗性能进行优化。

3. per-线程在各个频点下驻留的时间

# cat /proc/913/time_in_state
cpu0
1800000 0
...
1250000 2638
...
200000 0
cpu4
2850000 0
...
200000 0
cpu7
3050000 0
...
1300000 9

该节点记录了该线程在各个 cpufreq policy 的各个频点下驻留的时间, 单位为 clock_t。clock_t 是由 USER_HZ 来决定,该系统中 USER_HZ 为250,则 clock_t 代表4ms。

4. per-cpu的cpuidle time

# ls -l /sys/devices/system/cpu/cpu0/cpuidle
drwxr-xr-x    driver
drwxr-xr-x    state0
drwxr-xr-x    state1
drwxr-xr-x    state2
drwxr-xr-x    state3
drwxr-xr-x    state4
drwxr-xr-x    state5
drwxr-xr-x    state6

# ls -l /sys/devices/system/cpu/cpu0/cpuidle/state0
...
-r--r--r-- 1 root root 4096 2021-01-02 19:51 time

# cat /sys/devices/system/cpu/cpu0/cpuidle/state*/time
2675541339
13746613328
0
0
460
24621035515
0

cpuidle time 模块的工作就是记录每个cpu在各层深度中睡了多久,即每次开机以来,每个核在每个 C-state下的时长,单位为 us。


二、各种时间统计原理

1. per-cpu的各种类型的使用时间

cputime 模块代码位于 kernel/sched/cputime.c。由上图可见,统计的时间精度是1个tick。当每次timer中断来临时,kernel经过由中断处理函数调用到 irqtime_account_process_tick()(需要使能特性宏 CONFIG_IRQ_TIME_ACCOUNTING,将irq/softirq的统计囊括其中)。通过判断当前task是否为 softirq/user tick/idle进程/guest系统进程/内核进程,将经历的cpu时间(通常为1个tick)统计到各个类型中去。

/*
 * Account a tick to a process and cpustat
 * @p: the process that the CPU time gets accounted to
 * @user_tick: is the tick from userspace
 * @rq: the pointer to rq
 *
 * Tick demultiplexing follows the order
 * - pending hardirq update
 * - pending softirq update
 * - user_time
 * - idle_time
 * - system time
 *   - check for guest_time
 *   - else account as system_time
 *
 * Check for hardirq is done both for system and user time as there is
 * no timer going off while we are on hardirq and hence we may never get an
 * opportunity to update it solely in system time.
 * p->stime and friends are only updated on system time and not on irq
 * softirq as those do not count in task exec_runtime any more.
 */
static void irqtime_account_process_tick(struct task_struct *p, int user_tick, int ticks)
{
    u64 other, cputime = TICK_NSEC * ticks;

    /*
     * When returning from idle, many ticks can get accounted at
     * once, including some ticks of steal, irq, and softirq time.
     * Subtract those ticks from the amount of time accounted to
     * idle, or potentially user or system time. Due to rounding,
     * other time can exceed ticks occasionally.
     */
    other = account_other_time(ULONG_MAX);
    if (other >= cputime)
        return;

    cputime -= other;

    if (this_cpu_ksoftirqd() == p) {
        /*
         * ksoftirqd time do not get accounted in cpu_softirq_time.
         * So, we have to handle it separately here.
         * Also, p->stime needs to be updated for ksoftirqd.
         */
        account_system_index_time(p, cputime, CPUTIME_SOFTIRQ);
    } else if (user_tick) {
        account_user_time(p, cputime);
    } else if (p == this_rq()->idle) {
        account_idle_time(cputime);
    } else if (p->flags & PF_VCPU) { /* System time or guest time */
        account_guest_time(p, cputime);
    } else {
        account_system_index_time(p, cputime, CPUTIME_SYSTEM);
    }
}

2. per-cluster的在其各个频点下驻留的时间

cpufreq_times 模块代码位于 drivers/cpufreq/cpufreq_times.c,它的更新涉及到 cpufreq driver 与 cputime 两个模块。当 cpufreq policy 频率改变时,cpufreq driver 通过 cpufreq_notify_transition(普通调频模式)或者 cpufreq_driver_fast_switch(快速调频模式)调用 cpufreq_times_record_transition 函数,通知 cpufreq_times 模块当前该 policy 处于哪一个频点。当 cputime 模块接收到 timer 中断后,会调用 cpufreq_acct_update_power(),将该 tick 添加到 cpufreq_times 模块当前任务及当前频点的统计上。

3. per-线程在各个频点下驻留的时间

cpufreq_stats 模块代码位于 drivers/cpufreq/cpufreq_stats.c。它的更新有些类似于 cpufreq_times, 但与其不同的是只涉及 cpufreq driver 一个外部模块。当 cpufreq policy 频率改变时,cpufreq driver 通过 cpufreq_notify_transition(普通调频模式)或者 cpufreq_driver_fast_switch(快速调频模式)调用 cpufreq_times_record_transition 函数调用 cpufreq_stats_record_transition 函数,通知 cpufreq_stats 模块此刻发生调频以及要切换到哪一个目标频点。cpufreq_state 模块则调用 cpufreq_stats_update 获取当前 jiffies, 并与上一次更新时的 jiffies 相减,最后将差值添加到上个频点的时间统计中:

//drivers\cpufreq\cpufreq_stats.c
static void cpufreq_stats_update(struct cpufreq_stats *stats, unsigned long long time)
{
    unsigned long long cur_time = get_jiffies_64();

    stats->time_in_state[stats->last_index] += cur_time - time;
    stats->last_time = cur_time;
}

4. per-cpu的cpuidle time

cpuidle time 模块代码在 drivers/cpuidle/cpuidle.c。当某个 cpu runqueue 上没有 runnable 状态的任务时,该cpu调度到idle进程,经过层层调用,最后执行到 cpuidle_enter_state()函数。

/**
 * cpuidle_enter_state - enter the state and update stats
 * @dev: cpuidle device for this cpu
 * @drv: cpuidle driver for this cpu
 * @index: index into the states table in @drv of the state to enter
 */
int cpuidle_enter_state(struct cpuidle_device *dev, struct cpuidle_driver *drv, int index) //drivers/cpuidle/cpuidle.c
{
    int entered_state;
    ktime_t time_start, time_end;
    
    ...
    time_start = ns_to_ktime(local_clock());
    ...
    entered_state = target_state->enter(dev, drv, index);
    ...
    time_end = ns_to_ktime(local_clock());
    ...
    diff = ktime_sub(time_end, time_start);
    ...
    dev->last_residency_ns = diff;
    dev->states_usage[entered_state].time_ns += diff;
    ...
}
原文地址:https://www.cnblogs.com/hellokitty2/p/15666357.html