下载中心>资源分类>FPGA/CPLD>其他>The Renew Case for RISC - Avoiding ISA Bloat with Macro-Op Fusion for RISC-V

pdf

The Renew Case for RISC - Avoiding ISA Bloat with Macro-Op Fusion for RISC-V

1星
2021-02-17
440.37KB
需要1积分
1次下载

下载资源

文档简介
猜您喜欢
用户评论0

标签： risc v

The Renew Case for RISC - Avoiding ISA Bloat with Macro-Op Fusion for RISC-V

The Renewed Case for the Reduced Instruction Set

Computer: Avoiding ISA Bloat with Macro-Op Fusion

for RISC-V

Christopher Celio

Daniel Dabbelt

David A. Patterson

Krste Asanović

Electrical Engineering and Computer Sciences

University of California at Berkeley

Technical Report No. UCB/EECS-2016-130

http://www.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-130.html

July 8, 2016

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that copies

bear this notice and the full citation on the first page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior specific

permission.

Acknowledgement

The authors would like to thank Scott Beamer, Brian Case, David Ditzel,

and Eric Love for their valuable feedback. Research partially funded by

DARPA Award Number HR0011-12-2-0016, the Center for Future

Architecture Re- search, a member of STARnet, a Semiconductor

Research Corporation program sponsored by MARCO and DARPA, and

ASPIRE Lab industrial sponsors and affiliates Intel, Google, HPE, Huawei,

LGE, Nokia, NVIDIA, Oracle, and Samsung. Any opinions, findings,

conclusions, or recommendations in this paper are solely those of the

authors and does not necessarily reflect the position or the policy of the

sponsors.

The Renewed Case for the Reduced Instruction Set Computer:

Avoiding ISA Bloat with Macro-Op Fusion for RISC-V

Christopher Celio, Palmer Dabbelt, David Patterson, Krste Asanovi´

Department of Electrical Engineering and Computer Sciences, University of California, Berkeley

celio@eecs.berkeley.edu

Abstract—This

report makes the case that a well-designed

Reduced Instruction Set Computer (RISC) can match, and even

exceed, the performance and code density of existing commercial

Complex Instruction Set Computers (CISC) while maintaining

the simplicity and cost-effectiveness that underpins the original

RISC goals [12].

We begin by comparing the dynamic instruction counts and

dynamic instruction bytes fetched for the popular proprietary

ARMv7, ARMv8, IA-32, and x86-64 Instruction Set Architectures

(ISAs) against the free and open RISC-V RV64G and RV64GC

ISAs when running the SPEC CINT2006 benchmark suite. RISC-

V was designed as a very small ISA to support a wide range

of implementations, and has a less mature compiler toolchain.

However, we observe that on SPEC CINT2006 RV64G executes

on average 16% more instructions than x86-64, 3% more

instructions than IA-32, 9% more instructions than ARMv8, but

4% fewer instructions than ARMv7.

CISC x86 implementations break up complex instructions into

smaller internal RISC-like

micro-ops,

and the RV64G instruction

count is within 2% of the x86-64 retired micro-op count.

RV64GC, the compressed variant of RV64G, is the densest ISA

studied, fetching 8% fewer dynamic instruction bytes than x86-

64. We observed that much of the increased RISC-V instruction

count is due to a small set of common multi-instruction idioms.

Exploiting this fact, the RV64G and RV64GC

effective instruc-

tion

count can be reduced by 5.4% on average by leveraging

macro-op fusion.

Combining the compressed RISC-V ISA exten-

sion with macro-op fusion provides both the densest ISA and the

fewest dynamic operations retired per program, reducing the

motivation to add more instructions to the ISA. This approach

retains a single simple ISA suitable for both low-end and high-

end implementations, where high-end implementations can boost

performance through microarchitectural techniques.

Compiler tool chains are a continual work-in-progress, and the

results shown are a snapshot of the state as of July 2016 and are

subject to change.

program, but in reality, the performance is more accurately

described by the Iron Law of Performance [8]:

cycles

seconds instructions

seconds

program

instruction

∗

cycle

∗

program

The ISA is just an abstract boundary; behind the scenes

the processor may choose to implement instructions in any

cycles

number of ways that trade off

instruction

, or

CPI,

and

seconds

cycle

frequency.

For example, a fairly powerful x86 instruction is the

repeat

move

instruction (rep

movs),

which copies

bytes of data

from one memory location to another:

// psuedo-code for a ‘repeat move’ instruction

for (i=0; i < C; i++)

d[i] = s[i];

I. I

NTRODUCTION

The Instruction Set Architecture (ISA) speciﬁes the set of

instructions that a processor must understand and the expected

effects of each instruction. One of the goals of the RISC-V

project was to produce an ISA suitable for a wide range of

implementations from tiny microcontrollers to the largest su-

percomputers [14]. Hence, RISC-V was designed with a much

smaller number of simple standard instructions compared to

other popular ISAs, including other RISC-inspired ISAs. A

simple ISA is clearly a beneﬁt for a small resource-constrained

microcontroller, but how much performance is lost for high-

performance implementations by not supporting the numerous

instruction variants provided by popular proprietary ISAs?

A casual observer might argue that a processor’s perfor-

mance increases when it executes fewer instructions for a given

Implementations of the x86 ISA break up the

repeat move

instruction into smaller operations, or

micro-ops,

that indi-

vidually perform the required operations of loading the data

from the old location, storing the data to the new location,

incrementing the address pointers, and checking to see if the

end condition has been met. Therefore, a raw comparison of

instruction counts may hide a signiﬁcant amount of work and

complexity to execute a particular benchmark.

In contrast to the process of generating many micro-ops

from a single ISA instruction, several commercial micro-

processors perform

macro-op fusion,

where several ISA in-

structions are fused in the decode stage and handled as one

internal operation. As an example, compare-and-branch is a

very commonly executed idiom, and the RISC-V ISA includes

a full register-register magnitude comparison in its branch in-

structions. However, both ARM and x86 typically require two

ISA instructions to specify a compare-and-branch. The ﬁrst

instruction performs the

comparison

and sets a condition code,

and the second instruction performs the

jump-on-condition-

code.

While it would seem that ARM and x86 would have

a penalty of one additional instruction on nearly every loop

compared to RISC-V, the reality is more complicated. Both

ARM and Intel employ the technique of macro-op fusion,

in which the processor front-end detects these two-instruction

compare-and-branch sequences in the instruction stream and

“fuses” them together into a single

macro-op,

which can then

be handled as a single compare-and-branch instruction by the

processor back-end to reduce the effective dynamic instruction

count.

reality can be even more complicated. Depending on the micro-

architecture, the front-end may fuse the two instructions together to save

decode, allocation, and commit bandwidth, but break them apart in the

execution pipeline for critical path or complexity reasons [6].

The

Macro-op fusion is a very powerful technique to lower the

effective instruction count. One of the main contributions of

this report is to show that macro-op fusion, in combination

with the existing compressed instruction set extensions for

RISC-V, can provide the effect of a richer instruction set

for RISC-V without requiring any ISA extensions, thus en-

abling support for both low-end implementations and high-

end implementations from a single simple common code base.

The resulting ISA design can provide both a low number of

effective instructions executed and a low number of dynamic

instruction bytes fetched.

II. M

ETHODOLOGY

In this section, we describe the benchmark suite and

methodology used to obtain dynamic instruction counts, dy-

namic instruction bytes, and effective instructions executed for

the ISAs under consideration.

A. SPEC CINT2006

We used the SPEC CINT2006 benchmark suite [9] for

comparing the different ISAs. SPECInt2006 is composed of

35 different workloads across 12 different benchmarks with a

focus on desktop and workstation-class applications such as

compilation, simulation, decoding, and artiﬁcial intelligence.

These applications are largely CPU-intensive with working

sets of tens of megabytes and a required total memory usage

of less than 2 GB.

B. GCC Compiler

We used GCC for all targets as it is widely used and

the only compiler available for all systems. Vendor-speciﬁc

compilers will surely provide different results, but we did

not analyze them here. All benchmarks were compiled us-

ing the latest

GNU gcc 5.3

with the parameters shown in

Table I. The

400.perlbench

benchmark requires speci-

fying

-std=gnu98

to compile under

gcc 5.3.

We used

the

Speckle

suite to compile and execute SPECInt using

reference

inputs to completion [2]. The benchmarks were

compiled

statically

to make it easier to analyze the binaries.

Unless otherwise speciﬁed, data was collected using the

perf

utility [1] while running the benchmarks on native hardware.

C. RISC-V RV64

The RISC-V ISA is a free and open ISA produced by

the University of California, Berkeley and ﬁrst released in

2010 [3]. For this report, we will use the standard RISC-

V RV64G ISA variant, which contains all ISA extensions

for executing 64-bit “general-purpose” code [14]. We will

also explore the “C” Standard Extension for Compressed

Instructions (RVC). All instructions in RV64G are 4-bytes

in size, however, the C extension adds 2-byte forms of the

most common instructions. The resulting RV64GC ISA is very

dense, both statically and dynamically [13].

We cross-compiled RV64G and RV64GC benchmarks us-

ing the compiler settings shown in Table I. The RV64GC

benchmarks were built using a compressed glibc library.

The benchmarks were then executed using the

spike

ISA

simulator running on top of

Linux

version 3.14, which was

compiled against version 1.7 of the RISC-V privileged ISA. A

side-channel process grabbed the retired instruction count at

the beginning and end of each workload. We did not analyze

RV32G, as there does not yet exist an RV32 port of the

Linux

operating system.

For the

483.xalancbmk

benchmark, 34% of the RISC-V

instruction count is taken up by an OS kernel spin-loop waiting

on the test-harness I/O. These instructions are an artifact of

our testing infrastructure and were removed from any further

analysis.

D. ARMv7

The 32-bit ARMv7 benchmarks were compiled and ex-

ecuted on an Samsung Exynos 5250 (Cortex A-15). The

march=native

ﬂag resolves to the ARMv7ve ISA and the

mtune=native

ﬂag resolves to the

cortex-a15

proces-

sor.

E. ARMv8

The 64-bit ARMv8 benchmarks were compiled and

executed on a Snapdragon 410c (Cortex A-53). The

march

ﬂag was set to the ARMv8-a ISA and the

mtune

ﬂag was set to the

cortex-a53

processor.

The errata ﬂags for

-mfix-cortex-a53-835769

and

-mfix-cortex-a53-843419

are set. The 1 GB of RAM

on the 410c board is not sufﬁcient to run some of the

workloads from

401.bzip2, 403.gcc,

and

429.mcf.

manage this issue, we used a swapﬁle to provide access

to a larger pool of memory and only measured user-level

instruction counts for the problematic workloads.

F. IA-32

The architecture targeted is the

i686

architecture and was

compiled and executed on an Intel Xeon E5-2667v2 (Ivy

Bridge).

G. x86-64

The x86-64 benchmarks were compiled and executed on an

Intel Xeon E5-2667v2 (Ivy Bridge). The

march

ﬂag resolves

to the

ivybridge

ISA.

H. Instruction Count Histogram Collection

Histograms of the instruction counts for RV64G, RV64GC,

and x86-64 were collected allowing us to more easily compare

the hot loops across ISAs. We were also able to compute the

dynamic instruction bytes by cross-referencing the histogram

data with the static objdump data. x86-64 histograms were

collected by writing a histogram-building tool for the Intel Pin

dynamic binary translation tool [11]. Histograms for RV64G

and RV64GC were collected using an existing histogram tool

built into the RISC-V

spike

ISA simulator.

TABLE I: Compiler options for

gcc 5.3.

ISA

RV64G

RV64GC

IA-32

x86-64

ARMv7ve

ARMv8-a

compiler

riscv64-unknown-gnu-linux-g++

g++-5

g++

g++-5

ﬂags

-O3 -static

-O3 -mrvc -mno-save-restore -static

-O3 -m32 -march=ivybridge -mtune=native -static

-O3 -march=ivybridge -mtune=native -static

-O3 -march=armv7ve -mtune=cortex-a15 -static

-O3 -march=armv8-a -mtune=cortex-a53 -static

-mﬁx-cortex-a53-835769 -mﬁx-cortex-a53-843419

I. SIMD ISA Extensions

Although a vector extension is planned for RISC-V, there

is no existing vector facility. To compare against the scalar

RV64G ISA, we veriﬁed that the ARM and x86 code were

compiled in a manner that generally avoided generating any

SIMD or vector instructions for the SPECInt2006 benchmarks.

An analysis of the x86-64 histograms showed that, with

the exception of the

memset

routine in

403.gcc

and a

strcmp

routine in

471.omnetpp,

no SSE instructions were

generated that appeared in the 80% most executed instructions.

To further reinforce this conclusion, we built a

gcc

and

glibc

x86-64 toolchain that explicitly forbade MMX and

AVX extensions. Vectorization analysis was also disabled. The

resulting instruction counts for SPECInt2006 were virtually

unchanged.

Although the MMX and AVX extensions may be disabled in

gcc,

it is not possible to disable SSE instruction generation

as it is a mandatory part of the x86-64 ﬂoating point ABI.

However, we note that the only signiﬁcant usage of SSE

instructions were 128-bit SSE stores found in the

memset

routine in

403.gcc

(≈20%) and a very small usage (<2%)

of packed SIMD found in

strcmp

471.omnetpp.

III. R

ESULTS

All comparisons between ISAs in this report are based on

the geometric mean across the 12 SPECInt2006 benchmarks.

A. Instruction Counts

TABLE II: Total dynamic instructions normalized to x86-64.

x86-64 x86-64 IA-32 ARMv7 ARMv8 RV64G RV64GC+

micro-ops

fusion

400.perlbench

1.13

1.00 1.04 1.16

1.07

1.17

1.14

401.bzip2

1.12

1.00 1.05 1.04

1.03

1.33

1.08

403.gcc

1.19

1.00 1.03 1.29

0.97

1.36

1.34

429.mcf

1.04

1.00 1.07 1.19

1.02

0.94

0.93

445.gobmk

1.11

1.00 1.00 1.19

1.10

1.18

1.11

456.hmmer

1.47

1.00 1.19 1.45

1.21

1.16

458.sjeng

1.07

1.00 1.06 1.22

1.12

1.29

1.16

462.libquantum 0.88

1.00 1.62 1.38

0.95

0.83

464.h264ref

1.47

1.00 1.03 1.17

1.14

1.64

1.46

471.omnetpp

1.24

1.00 1.20 1.08

0.98

1.05

1.03

473.astar

1.04

1.00 1.11 1.17

1.05

0.99

0.89

483.xalancbmk 1.07

1.00 1.10 1.18

1.05

1.15

1.14

geomean

1.14

1.00 1.12 1.21

1.06

1.16

1.09

benchmark

IA-32, 9% more instructions than ARMv8, and 4% fewer

instructions than ARMv7. The raw instruction counts can be

found in Figure VI.

B. Micro-op Counts

The number of x86-64

retired micro-ops

was also collected

and is reported in Figure 1. On average, the Intel Ivy Bridge

processor used in this study emitted 1.14 micro-ops per x86-

64 instruction, which puts the RV64G instruction count within

2% of the x86-64

retired micro-op

count.

C. Dynamic Instruction Bytes

TABLE III: Total dynamic bytes normalized to x86-64.

benchmark

x86-64 ARMv7 ARMv8 RV64G RV64GC

400.perlbench

1.00

1.21

1.11

1.22

0.92

401.bzip2

1.00

1.07

1.38

1.06

403.gcc

1.00

1.40

1.05

1.47

1.03

429.mcf

1.00

1.40

1.20

1.11

0.83

445.gobmk

1.00

1.18

1.09

1.17

0.87

456.hmmer

1.00

1.41

1.18

1.13

0.90

458.sjeng

1.00

1.19

1.09

1.25

0.92

462.libquantum 1.00

1.90

1.30

1.14

0.82

464.h264ref

1.00

1.14

1.12

1.61

1.28

471.omnetpp

1.00

1.17

1.06

1.13

0.79

473.astar

1.00

1.22

1.10

1.03

0.82

483.xalancbmk 1.00

1.28

1.14

1.24

0.91

geomean

1.00

1.28

1.12

1.23

0.92

The total dynamic instruction bytes fetched is reported

in Figure 2 (and Table III). RV64G, with its ﬁxed 4-byte

instruction size, fetches 23% more bytes per program than

x86-64. Unexpectedly, x86-64 is not very dense, averaging

3.71 bytes per instruction (with a standard deviation of 0.34

bytes). Like RV64G, both ARMv7 and ARMv8 use a ﬁxed

4-byte instruction size.

Using the RISC-V “C” Compressed ISA extension,

RV64GC fetches 8% fewer dynamic instruction bytes relative

to x86-64, with an average of 3.00 bytes per instruction.

There are only three benchmarks (401.bzip2,

403.gcc,

464.h264ref)

where RV64GC fetches more dynamic bytes

than x86-64, and two of those three benchmarks make heavy

use of

memset

and

memcpy.

RV64GC also fetches consid-

erably fewer bytes than either ARMv7 or ARMv8.

IV. D

ISCUSSION

As shown in Figure 1 (and Table II), RV64G executes 16%

more instructions than x86-64, 3% more instructions than

We discuss brieﬂy the three outliers where RISC-V performs

poorly, as well as general trends observed across all of the

展开预览

猜您喜欢

上传者

: sigma; 查看他的其他资源

TI 文字链专区

举报人：
被举报人：	sigma
举报的资源分：	1
* 类型：
	请您提供公司营业执照和软件相关版权到service@eeworld.com.cn
* 详细原因：

The Renew Case for RISC - Avoiding ISA Bloat with Macro-Op Fusion for RISC-V

文档简介

评论

汽车 模拟

汽车模拟