The Renewed Case for the Reduced Instruction Set
Computer: Avoiding ISA Bloat with Macro-Op Fusion
for RISC-V
Christopher Celio
Daniel Dabbelt
David A. Patterson
Krste Asanović
Electrical Engineering and Computer Sciences
University of California at Berkeley
Technical Report No. UCB/EECS-2016-130
http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-130.html
July 8, 2016
Copyright © 2016, by the author(s).
All rights reserved.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Acknowledgement
The authors would like to thank Scott Beamer, Brian Case, David Ditzel,
and Eric Love for their valuable feedback. Research partially funded by
DARPA Award Number HR0011-12-2-0016, the Center for Future
Architecture Re- search, a member of STARnet, a Semiconductor
Research Corporation program sponsored by MARCO and DARPA, and
ASPIRE Lab industrial sponsors and affiliates Intel, Google, HPE, Huawei,
LGE, Nokia, NVIDIA, Oracle, and Samsung. Any opinions, findings,
conclusions, or recommendations in this paper are solely those of the
authors and does not necessarily reflect the position or the policy of the
sponsors.
The Renewed Case for the Reduced Instruction Set Computer:
Avoiding ISA Bloat with Macro-Op Fusion for RISC-V
Christopher Celio, Palmer Dabbelt, David Patterson, Krste Asanovi´
c
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley
celio@eecs.berkeley.edu
Abstract—This
report makes the case that a well-designed
Reduced Instruction Set Computer (RISC) can match, and even
exceed, the performance and code density of existing commercial
Complex Instruction Set Computers (CISC) while maintaining
the simplicity and cost-effectiveness that underpins the original
RISC goals [12].
We begin by comparing the dynamic instruction counts and
dynamic instruction bytes fetched for the popular proprietary
ARMv7, ARMv8, IA-32, and x86-64 Instruction Set Architectures
(ISAs) against the free and open RISC-V RV64G and RV64GC
ISAs when running the SPEC CINT2006 benchmark suite. RISC-
V was designed as a very small ISA to support a wide range
of implementations, and has a less mature compiler toolchain.
However, we observe that on SPEC CINT2006 RV64G executes
on average 16% more instructions than x86-64, 3% more
instructions than IA-32, 9% more instructions than ARMv8, but
4% fewer instructions than ARMv7.
CISC x86 implementations break up complex instructions into
smaller internal RISC-like
micro-ops,
and the RV64G instruction
count is within 2% of the x86-64 retired micro-op count.
RV64GC, the compressed variant of RV64G, is the densest ISA
studied, fetching 8% fewer dynamic instruction bytes than x86-
64. We observed that much of the increased RISC-V instruction
count is due to a small set of common multi-instruction idioms.
Exploiting this fact, the RV64G and RV64GC
effective instruc-
tion
count can be reduced by 5.4% on average by leveraging
macro-op fusion.
Combining the compressed RISC-V ISA exten-
sion with macro-op fusion provides both the densest ISA and the
fewest dynamic operations retired per program, reducing the
motivation to add more instructions to the ISA. This approach
retains a single simple ISA suitable for both low-end and high-
end implementations, where high-end implementations can boost
performance through microarchitectural techniques.
Compiler tool chains are a continual work-in-progress, and the
results shown are a snapshot of the state as of July 2016 and are
subject to change.
program, but in reality, the performance is more accurately
described by the Iron Law of Performance [8]:
cycles
seconds instructions
seconds
program
=
instruction
∗
cycle
∗
program
The ISA is just an abstract boundary; behind the scenes
the processor may choose to implement instructions in any
cycles
number of ways that trade off
instruction
, or
CPI,
and
seconds
,
cycle
or
frequency.
For example, a fairly powerful x86 instruction is the
repeat
move
instruction (rep
movs),
which copies
C
bytes of data
from one memory location to another:
// psuedo-code for a ‘repeat move’ instruction
for (i=0; i < C; i++)
d[i] = s[i];
I. I
NTRODUCTION
The Instruction Set Architecture (ISA) specifies the set of
instructions that a processor must understand and the expected
effects of each instruction. One of the goals of the RISC-V
project was to produce an ISA suitable for a wide range of
implementations from tiny microcontrollers to the largest su-
percomputers [14]. Hence, RISC-V was designed with a much
smaller number of simple standard instructions compared to
other popular ISAs, including other RISC-inspired ISAs. A
simple ISA is clearly a benefit for a small resource-constrained
microcontroller, but how much performance is lost for high-
performance implementations by not supporting the numerous
instruction variants provided by popular proprietary ISAs?
A casual observer might argue that a processor’s perfor-
mance increases when it executes fewer instructions for a given
Implementations of the x86 ISA break up the
repeat move
instruction into smaller operations, or
micro-ops,
that indi-
vidually perform the required operations of loading the data
from the old location, storing the data to the new location,
incrementing the address pointers, and checking to see if the
end condition has been met. Therefore, a raw comparison of
instruction counts may hide a significant amount of work and
complexity to execute a particular benchmark.
In contrast to the process of generating many micro-ops
from a single ISA instruction, several commercial micro-
processors perform
macro-op fusion,
where several ISA in-
structions are fused in the decode stage and handled as one
internal operation. As an example, compare-and-branch is a
very commonly executed idiom, and the RISC-V ISA includes
a full register-register magnitude comparison in its branch in-
structions. However, both ARM and x86 typically require two
ISA instructions to specify a compare-and-branch. The first
instruction performs the
comparison
and sets a condition code,
and the second instruction performs the
jump-on-condition-
code.
While it would seem that ARM and x86 would have
a penalty of one additional instruction on nearly every loop
compared to RISC-V, the reality is more complicated. Both
ARM and Intel employ the technique of macro-op fusion,
in which the processor front-end detects these two-instruction
compare-and-branch sequences in the instruction stream and
“fuses” them together into a single
macro-op,
which can then
be handled as a single compare-and-branch instruction by the
processor back-end to reduce the effective dynamic instruction
count.
1
reality can be even more complicated. Depending on the micro-
architecture, the front-end may fuse the two instructions together to save
decode, allocation, and commit bandwidth, but break them apart in the
execution pipeline for critical path or complexity reasons [6].
1
The
Macro-op fusion is a very powerful technique to lower the
effective instruction count. One of the main contributions of
this report is to show that macro-op fusion, in combination
with the existing compressed instruction set extensions for
RISC-V, can provide the effect of a richer instruction set
for RISC-V without requiring any ISA extensions, thus en-
abling support for both low-end implementations and high-
end implementations from a single simple common code base.
The resulting ISA design can provide both a low number of
effective instructions executed and a low number of dynamic
instruction bytes fetched.
II. M
ETHODOLOGY
In this section, we describe the benchmark suite and
methodology used to obtain dynamic instruction counts, dy-
namic instruction bytes, and effective instructions executed for
the ISAs under consideration.
A. SPEC CINT2006
We used the SPEC CINT2006 benchmark suite [9] for
comparing the different ISAs. SPECInt2006 is composed of
35 different workloads across 12 different benchmarks with a
focus on desktop and workstation-class applications such as
compilation, simulation, decoding, and artificial intelligence.
These applications are largely CPU-intensive with working
sets of tens of megabytes and a required total memory usage
of less than 2 GB.
B. GCC Compiler
We used GCC for all targets as it is widely used and
the only compiler available for all systems. Vendor-specific
compilers will surely provide different results, but we did
not analyze them here. All benchmarks were compiled us-
ing the latest
GNU gcc 5.3
with the parameters shown in
Table I. The
400.perlbench
benchmark requires speci-
fying
-std=gnu98
to compile under
gcc 5.3.
We used
the
Speckle
suite to compile and execute SPECInt using
reference
inputs to completion [2]. The benchmarks were
compiled
statically
to make it easier to analyze the binaries.
Unless otherwise specified, data was collected using the
perf
utility [1] while running the benchmarks on native hardware.
C. RISC-V RV64
The RISC-V ISA is a free and open ISA produced by
the University of California, Berkeley and first released in
2010 [3]. For this report, we will use the standard RISC-
V RV64G ISA variant, which contains all ISA extensions
for executing 64-bit “general-purpose” code [14]. We will
also explore the “C” Standard Extension for Compressed
Instructions (RVC). All instructions in RV64G are 4-bytes
in size, however, the C extension adds 2-byte forms of the
most common instructions. The resulting RV64GC ISA is very
dense, both statically and dynamically [13].
We cross-compiled RV64G and RV64GC benchmarks us-
ing the compiler settings shown in Table I. The RV64GC
benchmarks were built using a compressed glibc library.
The benchmarks were then executed using the
spike
ISA
simulator running on top of
Linux
version 3.14, which was
compiled against version 1.7 of the RISC-V privileged ISA. A
side-channel process grabbed the retired instruction count at
the beginning and end of each workload. We did not analyze
RV32G, as there does not yet exist an RV32 port of the
Linux
operating system.
For the
483.xalancbmk
benchmark, 34% of the RISC-V
instruction count is taken up by an OS kernel spin-loop waiting
on the test-harness I/O. These instructions are an artifact of
our testing infrastructure and were removed from any further
analysis.
D. ARMv7
The 32-bit ARMv7 benchmarks were compiled and ex-
ecuted on an Samsung Exynos 5250 (Cortex A-15). The
march=native
flag resolves to the ARMv7ve ISA and the
mtune=native
flag resolves to the
cortex-a15
proces-
sor.
E. ARMv8
The 64-bit ARMv8 benchmarks were compiled and
executed on a Snapdragon 410c (Cortex A-53). The
march
flag was set to the ARMv8-a ISA and the
mtune
flag was set to the
cortex-a53
processor.
The errata flags for
-mfix-cortex-a53-835769
and
-mfix-cortex-a53-843419
are set. The 1 GB of RAM
on the 410c board is not sufficient to run some of the
workloads from
401.bzip2, 403.gcc,
and
429.mcf.
To
manage this issue, we used a swapfile to provide access
to a larger pool of memory and only measured user-level
instruction counts for the problematic workloads.
F. IA-32
The architecture targeted is the
i686
architecture and was
compiled and executed on an Intel Xeon E5-2667v2 (Ivy
Bridge).
G. x86-64
The x86-64 benchmarks were compiled and executed on an
Intel Xeon E5-2667v2 (Ivy Bridge). The
march
flag resolves
to the
ivybridge
ISA.
H. Instruction Count Histogram Collection
Histograms of the instruction counts for RV64G, RV64GC,
and x86-64 were collected allowing us to more easily compare
the hot loops across ISAs. We were also able to compute the
dynamic instruction bytes by cross-referencing the histogram
data with the static objdump data. x86-64 histograms were
collected by writing a histogram-building tool for the Intel Pin
dynamic binary translation tool [11]. Histograms for RV64G
and RV64GC were collected using an existing histogram tool
built into the RISC-V
spike
ISA simulator.
2
TABLE I: Compiler options for
gcc 5.3.
ISA
RV64G
RV64GC
IA-32
x86-64
ARMv7ve
ARMv8-a
compiler
riscv64-unknown-gnu-linux-g++
riscv64-unknown-gnu-linux-g++
g++-5
g++-5
g++
g++-5
flags
-O3 -static
-O3 -mrvc -mno-save-restore -static
-O3 -m32 -march=ivybridge -mtune=native -static
-O3 -march=ivybridge -mtune=native -static
-O3 -march=armv7ve -mtune=cortex-a15 -static
-O3 -march=armv8-a -mtune=cortex-a53 -static
-mfix-cortex-a53-835769 -mfix-cortex-a53-843419
I. SIMD ISA Extensions
Although a vector extension is planned for RISC-V, there
is no existing vector facility. To compare against the scalar
RV64G ISA, we verified that the ARM and x86 code were
compiled in a manner that generally avoided generating any
SIMD or vector instructions for the SPECInt2006 benchmarks.
An analysis of the x86-64 histograms showed that, with
the exception of the
memset
routine in
403.gcc
and a
strcmp
routine in
471.omnetpp,
no SSE instructions were
generated that appeared in the 80% most executed instructions.
To further reinforce this conclusion, we built a
gcc
and
glibc
x86-64 toolchain that explicitly forbade MMX and
AVX extensions. Vectorization analysis was also disabled. The
resulting instruction counts for SPECInt2006 were virtually
unchanged.
Although the MMX and AVX extensions may be disabled in
gcc,
it is not possible to disable SSE instruction generation
as it is a mandatory part of the x86-64 floating point ABI.
However, we note that the only significant usage of SSE
instructions were 128-bit SSE stores found in the
memset
routine in
403.gcc
(≈20%) and a very small usage (<2%)
of packed SIMD found in
strcmp
in
471.omnetpp.
III. R
ESULTS
All comparisons between ISAs in this report are based on
the geometric mean across the 12 SPECInt2006 benchmarks.
A. Instruction Counts
TABLE II: Total dynamic instructions normalized to x86-64.
x86-64 x86-64 IA-32 ARMv7 ARMv8 RV64G RV64GC+
micro-ops
fusion
400.perlbench
1.13
1.00 1.04 1.16
1.07
1.17
1.14
401.bzip2
1.12
1.00 1.05 1.04
1.03
1.33
1.08
403.gcc
1.19
1.00 1.03 1.29
0.97
1.36
1.34
429.mcf
1.04
1.00 1.07 1.19
1.02
0.94
0.93
445.gobmk
1.11
1.00 1.00 1.19
1.10
1.18
1.11
456.hmmer
1.47
1.00 1.19 1.45
1.21
1.16
1.16
458.sjeng
1.07
1.00 1.06 1.22
1.12
1.29
1.16
462.libquantum 0.88
1.00 1.62 1.38
0.95
0.83
0.83
464.h264ref
1.47
1.00 1.03 1.17
1.14
1.64
1.46
471.omnetpp
1.24
1.00 1.20 1.08
0.98
1.05
1.03
473.astar
1.04
1.00 1.11 1.17
1.05
0.99
0.89
483.xalancbmk 1.07
1.00 1.10 1.18
1.05
1.15
1.14
geomean
1.14
1.00 1.12 1.21
1.06
1.16
1.09
benchmark
IA-32, 9% more instructions than ARMv8, and 4% fewer
instructions than ARMv7. The raw instruction counts can be
found in Figure VI.
B. Micro-op Counts
The number of x86-64
retired micro-ops
was also collected
and is reported in Figure 1. On average, the Intel Ivy Bridge
processor used in this study emitted 1.14 micro-ops per x86-
64 instruction, which puts the RV64G instruction count within
2% of the x86-64
retired micro-op
count.
C. Dynamic Instruction Bytes
TABLE III: Total dynamic bytes normalized to x86-64.
benchmark
x86-64 ARMv7 ARMv8 RV64G RV64GC
400.perlbench
1.00
1.21
1.11
1.22
0.92
401.bzip2
1.00
1.07
1.07
1.38
1.06
403.gcc
1.00
1.40
1.05
1.47
1.03
429.mcf
1.00
1.40
1.20
1.11
0.83
445.gobmk
1.00
1.18
1.09
1.17
0.87
456.hmmer
1.00
1.41
1.18
1.13
0.90
458.sjeng
1.00
1.19
1.09
1.25
0.92
462.libquantum 1.00
1.90
1.30
1.14
0.82
464.h264ref
1.00
1.14
1.12
1.61
1.28
471.omnetpp
1.00
1.17
1.06
1.13
0.79
473.astar
1.00
1.22
1.10
1.03
0.82
483.xalancbmk 1.00
1.28
1.14
1.24
0.91
geomean
1.00
1.28
1.12
1.23
0.92
The total dynamic instruction bytes fetched is reported
in Figure 2 (and Table III). RV64G, with its fixed 4-byte
instruction size, fetches 23% more bytes per program than
x86-64. Unexpectedly, x86-64 is not very dense, averaging
3.71 bytes per instruction (with a standard deviation of 0.34
bytes). Like RV64G, both ARMv7 and ARMv8 use a fixed
4-byte instruction size.
Using the RISC-V “C” Compressed ISA extension,
RV64GC fetches 8% fewer dynamic instruction bytes relative
to x86-64, with an average of 3.00 bytes per instruction.
There are only three benchmarks (401.bzip2,
403.gcc,
464.h264ref)
where RV64GC fetches more dynamic bytes
than x86-64, and two of those three benchmarks make heavy
use of
memset
and
memcpy.
RV64GC also fetches consid-
erably fewer bytes than either ARMv7 or ARMv8.
IV. D
ISCUSSION
As shown in Figure 1 (and Table II), RV64G executes 16%
more instructions than x86-64, 3% more instructions than
We discuss briefly the three outliers where RISC-V performs
poorly, as well as general trends observed across all of the
3
评论