Simultaneous Multithreading: Maximizing On-Chip Parallelism（2）

Time

2020.10.28

Summary

Section 2 defines in detail our basic machine model, the workloads that we measure, and the simulation environment that we constmcted.

Section 3 evaluates the performance of a single-threaded superscalar architecture

Section 4 presents the performance of a range of SM architectures and compares them to the superscalar architecture,as well as a fine-grain multithreaded processor.

Section 5 explores the effect of cache design alternatives on the performance of simultaneous multithreading.

Section 6 compares the SM approach with conventional multiprocessor architectures.

Section 7 We discuss related work

Section 8 we summarize our results

Research Objective

Problem Statement

Method(s)

Our goal is to evaluate several architectural alternatives as defined in the previous section: wide superscalars, traditional multithreaded processors, simultaneous multithreaded processors, and small-scale multiple-issue multiprocessors. To do this, we have developed a simulation environment that defines an implementation of a simultaneous multithreaded architecture;

Evaluation

Conclusion

Our results show the limits of superscalar execution and traditional multithreading to increase instruction throughput in future processors.
For example:

We compare these two approaches and show that simultaneous multithreading is potentially superior to mukiprocessing in its ability to utilize processor resources.

Notes

A more traditional means of achieving parallelism is the conventional multiprocessor.

The Standard Performance Evaluation Corporation (SPEC) is an American non-profit corporation that aims to "produce, establish, maintain and endorse a standardized set" of performance benchmarks for computers.SPEC benchmarks are widely used to evaluate the performance of computer systems;

Words

throughput
吞吐量
viable
可行的
outperforms
胜过
tradeoffs
折中
Methodology
方法论
model
对...建模
hit rates
命中率
deviates
背离
pipeline
流水线化
scheduling window
调度窗口
complement
补充
direct-mapped
直接变换的
hint
提示
particular address
特定地址
accommodate
适应
the raw instruction
原始指令
uniprocessor applications,
单处理器应用程序
benchmark
基准
permutations
排列
compilation
汇编
Bottlenecks
瓶颈
specific
特定的
bounding
限制
idle cycle
空闲周期
appropriately
适当的
quantified
量化
composite
综合
dominant
优势的、主要的
parallelism
并行性
coarsegrain or fine-grain
粗颗粒或细颗粒
assumptions
假设
serially
连续地
issue 执行
partitioning 分割
vary considerably 有很大不同
IPC (Instructions Per Clock) 每个时钟周期正在完成多少指令
bound 限制

Sentence

two close organizational alternatives
两个紧密的组织替代方案
in the number of pipeline stages required for instruction issue
发出指令所需的流水线级数
Our simulator uses emulation-based instruction-level simulation
我们的模拟器使用基于仿真的指令级仿真
Each of the B runs uses a different ordering of the benchmarks
每个B轮次使用基准的不同顺序
the Multiflow trace scheduling compiler
多流跟踪调度编译器
It is thus unlikely that
因此，不太可能
Not only is there no dominant cause of wasted cycles — there appears to be no dominant solution.
不仅没有周期浪费的主要原因，而且似乎没有主要解决方案。
Figure 3 shows the performance of the various models as a function
of the number of threads.
图3显示了各种模型的性能与线程数的关系