Simultaneous Multithreading: Maximizing On-Chip Parallelism（精读）

Time

2020.11.02

Summary

1.introduce several SM models
2.evaluate th performance of those models relative to superscalar and fine-grain multithreading
3.show how to tune the cache hierarchy for SM processors
4.demonstrate the potential for performance and real-estate advantages of SM

Research Objective

Simultaneous Multithreading(SM)

Problem Statement

1.simultaneous multithreading has excellent potential to increase processor utilization,it can add complexity to the design.

2.The objective of SM is to increase processor utilization.

3.in fact, we have found that the instruction throughput of the various SM models is somewhat hampered by the sharing of the caches and TLBs.

Method(s)

1.We have developed a simulation environment that defines an implementation of a simultaneous multithreaded architecture
2.Our workload is the SPEC92 benchmark suite. To gauge the raw instruction throughput achievable by multithreaded superscalar processors.
3.We compile each program with the Multiflow trace scheduling compiler, modified to produce Alpha code scheduled for our target machine.
4.The following models reflect several possible design choices for a combined multithreaded,superscalar processor.(the basic machine is a wide superscalar with 10 functional units capable of issuing 8 instruction per cycle)
The models are:
4.1 Fine-Grain Multithreading
Only one thread issues instructions each cycle.This model does not feature simultaneous multithreading.
4.2 SM:Full Simultaneous Issue
A completely flexible simultaneous multithreaded superscalar.but is the least realistic model in terms of hardware complexity.
4.2.1 SM:Single Issue, SM:Dual Issue, and SM:Four Issue
These three models limit the number of instructions each thread can issue, or have active in the scheduling window.
4.2.2 SM:Llmited Connection
Each hardware context is directly connected to exactly one of each type of functional Each hardware context is directly.
5.Our study focuses on the organization of the first-level (Ll ) caches, comparing the use of private per-thread caches to shared caches for both instructions and data. (We assume that L2 and L3 caches are shared among all threads.) All experiments use the 4-issue model with up to 8 threads.

Evaluation

1.To place our evaluation in the context of modern superscalar processors, we simulate a base architecture derived from the 300 MHz Alpha 21164.
为了将我们的评估放在现代超标量处理器的上下文中，我们模拟了一个从300 MHz Alpha 21164派生的基本架构。
2.We evaluate MPs with 1, 2, and 4 issues per cycle on each processor. We evaluate SM processors with 4 and 8 issues per cycle;

Conclusion

1.Our results show the limits of superscalar execution and traditional multithreading to increase instruction throughput in future processors.
1.1 an 8-issue superscalar architecture fails to sustain 1.5 instructions per cycle.
1.2 a fine-grain multithreaded processor (capable of switching contexts every cycle at no cost) utilizes only about 40% of a wide superscalar
2.Simultaneous Multithreading(SM) provides significant performance in instruction throughput,and is only limited by the issue bandwidth of the processor.
3.simultaneous multithreading is superior to multiprocessing in its ability to utilize processor resources.
4.Traditional multithreading (coarse-grain or fine-grain) can fill cycles that contribute to vertical waste.
5.This result is similar to previous studies [2, 1,19, 14, 33, 31] for both coarse-grain and fine-grain multithreading single-issue processors, which have concluded that multithreading is only beneficial for 2 to 5 threads.
6.the simultaneous multithreading models, which achieve maximum speedups over single-thread superscalar execution ranging from 3.2 to 4.2, with an issue rate as high as 6.3 IPC
7.The four-issue model gets nearly the performance of the full simultaneous issue model, and even the dual-issue model is quite competitive, reaching 94% of full simulta-neous issue at 8 threads.
8.The SM results be optimistic in tow respects-----

Notes

1.The binding between threa and functional unit is completly dynamic.
2.Simultaneous mukithreading combines the multiple-issue-per-instruction features with the latency-hiding ability
3.Multiple instruction issue is limited by instruction dependencies and long-latency operations(Figure 1 can illustrate)
3.1 vertical waste:when the processor issues no instructions in a cycle.
3.2 horizontal waste:when not all issue slots can be filled in a cycle.
3.3 Superscalar execution both introduces horizontal waste and increases the amount of vertical waste.
4. Traditional multithreading hides memory and firnctional unit latencies, attacking vertical waste.
5. Simultaneous multithreading attacks both horizontal and vertical waste.
6. a traditional means of achieving parallelism is the conventional multiprocessor.
7. caches are more strained by a multithreaded workload than a single-thread workload, due to a decrease in locality [21, 33, 1, 31].
8.
Simultaneous multithreading V.S. small-scale,single-chip multiprocessing(MP)
8.1 both have multiple register sets,multiple functional units and high issue bandwidth on a single chip.
8.2 The key dieeerence is in the way those resources are partitioned and scheduled:
8.2.1 the multiprocessor statically partitions resources, devoting a fixed number of functional units to each thread;
8.2.2 the SM processor allows the partitioning to change every cycle;
9.The distance between the load/store units and the data cache can have a large impact on cache access time.

Terminology

1.lockup-free caches
参考链接：https://my.oschina.net/u/4254706/blog/4504914
2.The SPEC92 benchmark suite
The SPEC92 benchmark suite consists of twenty public-domain, non-trivial programs that are widely used to measure the performance of computer systems
3.benchmark suite 基准程序组
定义：一套基准程序和控制条件及过程的一组特定规则。包括输入数据、输出结果、被测试的平台环境等。
4.vertical waste (completely idle cycles)
5.horizontal waste (unused issue slots in a non-idle cycle)
6.I-Cache（指令缓存）
在ICache中存储有微处理器需要的指令，在微处理器的取指阶段，通过程序计数器PC提供给ICache的地址，微处理器可以获取需要的指令。
7.D-Cache（数据缓存）
DCache作为一个数据的存储，并提供对于Load/Store指令所要操作地址的数据，它地址来自于ALU运算的结果。

Words

speculated 推测
magnitude 大小
develope 开发
deviate from 背离
control hazard 控制风险
to be common 更常见...
interleaving 交织
direct-mapped 直相联
virtually identical 几乎相同
be extrapolate from 从...推断
uniprocessor 单处理器
additive 累加的
composite bar 综合柱
different degree 不同程度
perormance results 表现结果
insight to 洞察
the forwarding logic 转发逻辑
vary considerably 有很大不同Granularity
Granularity 粒度
contention 争夺
commensurate 相称的

Sentence

1.evaluate th performance of those models relative to superscalar and fine-grain multithreading
评估这些模型相对于超标量和细粒度多线程的性能
2.To place our evaluation in the context of modern superscalar processors, we simulate a base architecture derived from the 300 MHz Alpha 21164.
为了将我们的评估放在现代超标量处理器的背景下，我们模拟了一个从300 MHz Alpha 21164派生的基本架构。
3.we have developed a simulation environment that defines an implementation of a simultaneous multithreaded architecture
我们已经开发了一个模拟环境，该环境定义了同时多线程体系结构的实现
4.caching of partially decoded instructions
缓存部分解码指令
5.fast emulated execution.
快速仿真执行
6.augment first for
首先增加
7.An instruction cache access occurs whenever the program counter crosses a 32-byte boundary
每当程序计数器越过32字节边界时，就会发生指令高速缓存访问。
8.the executable with the lowest single-thread execution time on our target hardware was used for all experiments
所有实验均使用在目标硬件上具有最少单线程执行时间的可执行文件
9. if the next instruction cannot be scheduled in the same cycle as the current instruction
如果不能在与当前指令相同的周期内调度下一条指令
10.achieve most of the available gains in that regard.
实现在这一方面的大部分收益
11.2 times the throughput.
两倍的效率