下载中心>资源分类>嵌入式开发>DSP>computer architecture

pdf

computer architecture

1星
2015-10-17
922.65KB
需要1积分
0次下载

文档简介
猜您喜欢
推荐下载
推荐帖子
用户评论0

标签：计算机架构

计算机架构第五章，一本好书

文档内容节选

44 Solutions to Case Studies and Exercises Chapter 5 Solutions Case Study 1 SingleChip Multicore Multiprocessor 51 a P0 read 120 b P0 write 120 cid197 80 cid198 P0B0 M 120 0080 P3B0 I 120 0020 cid198 P0B0 S 120 0020 returns 0020 c P3 write 120 cid197 80 cid198 P3B0 M 120 0080 d P1 read 110 e P0 write 108 cid197 48 cid198 P0B1 M 108 0048 cid198 P1B2 S 110 0010 returns 0010 P3B1 I 108 0008 f P0 write 130 cid197 78 cid198 P0B2 M 130 0078 g P3 write 130 cid197 78 cid198 P3B2 M 130 0078 a P0 ......

■

Solutions to Case Studies and Exercises

Chapter 5 Solutions

Case Study 1: Single-Chip Multicore Multiprocessor

5.1

P0: read 120

P0: write 120

P3: write 120

P1: read 110

P0: write 108

P0: write 130

P0.B0: (S, 120, 0020) returns 0020

P0.B0: (M, 120, 0080)

P3.B0: (I, 120, 0020)

P3.B0: (M, 120, 0080)

P1.B2: (S, 110, 0010) returns 0010

P0.B1: (M, 108, 0048)

P3.B1: (I, 108, 0008)

P0.B2: (M, 130, 0078)

M: 110

P3: write 130

5.2

0030 (writeback to memory)

P3.B2: (M, 130, 0078)

P0: read 120, Read miss, satisfied by memory

P0: read 128, Read miss, satisfied by P1’s cache

P0: read 130, Read miss, satisfied by memory, writeback 110

Implementation 1: 100 + 40 + 10 + 100 + 10 = 260 stall cycles

Implementation 2: 100 + 130 + 10 + 100 + 10 = 350 stall cycles

P0: read 100, Read miss, satisfied by memory

P0: write 108

P0: write 130

48, Write hit, sends invalidate

78, Write miss, satisfied by memory, write back 110

Implementation 1: 100 + 15 + 10 + 100 = 225 stall cycles

Implementation 2: 100 + 15 + 10 + 100 = 225 stall cycles

P1: read 120, Read miss, satisfied by memory

P1: read 128, Read hit

P1: read 130, Read miss, satisfied by memory

Implementation 1: 100 + 0 + 100 = 200 stall cycles

Implementation 2: 100 + 0 + 100 = 200 stall cycles

P1: read 100, Read miss, satisfied by memory

P1: write 108

P1: write 130

48, Write miss, satisfied by memory, write back 128

78, Write miss, satisfied by memory

Implementation 1: 100 + 100 + 10 + 100 = 310 stall cycles

Implementation 2: 100 + 100 + 10 + 100 = 310 stall cycles

5.3

See Figure S.28

Chapter 5 Solutions

■

Write miss or invalidate

for this block

Invalid

CPU read

Place read miss on bus

Shared

Place write miss on bus

on ac

wri

bu e in

s va

lid

Writeback block;

abort memory access

CPU read hit

CPU write

lid

r t

ck;

blo

rit

teb

Write miss for

this block

Modified

Read miss

Writeback block; abort

memory access

CPU write

Place invalidate on bus

Owned

CPU write hit

CPU read hit

Figure S.28

Protocol diagram.

5.4

(Showing results for implementation 1)

P1: read 110, Read miss, P0’s cache

P3: read 110, Read miss, MSI satisfies in memory, MOSI satisfies in P0’s

cache

P0: read 110, Read hit

MSI: 40 + 10 + 100 + 0 = 150 stall cycles

MOSI: 40 + 10 + 40 + 10 + 0 = 100 stall cycles

P1: read 120, Read miss, satisfied in memory

P3: read 120, Read hit

P0: read 120, Read miss, satisfied in memory

Both protocols: 100 + 0 + 100 = 200 stall cycles

P0: write 120

80, Write miss, invalidates P3

P3: read 120, Read miss, P0’s cache

P0: read 120, Read hit

Both protocols: 100 + 40 + 10 + 0 = 150 stall cycles

■

Solutions to Case Studies and Exercises

P0: write 108

5.5

See Figure S.29

88, Send invalidate, invalidate P3

98, Send invalidate, invalidate P3

P3: read 108, Read miss, P0’s cache

Both protocols: 15 + 40 + 10 + 15 = 80 stall cycles

Write miss or invalidate

for this block

Invalid

CPU read, other shared block

Place read miss on bus

on ac

bu e in

ite

s va

lid

Shared

lid

n b

, n

rit

Write miss for this block

Writeback block;

abort memory access

CPU write

Place write miss on bus

CPU read hit

Modified

a rit

ac bor eba

iss

ce t m ck

ss e b

m lo

or c

y k;

CPU write hit

Read miss

Excl.

CPU read hit

CPU write hit

CPU read hit

Figure S.29

Diagram for a MESI protocol.

5.6

p0: read 100, Read miss, satisfied in memory, no sharers MSI: S, MESI: E

p0: write 100

40, MSI: send invalidate, MESI: silent transition from E to M

MSI: 100 + 15 = 115 stall cycles

MESI: 100 + 0 = 100 stall cycles

p0: read 120, Read miss, satisfied in memory, sharers both to S

p0: write 120

60, Both send invalidates

Both: 100 + 15 = 115 stall cycles

p0: read 100, Read miss, satisfied in memory, no sharers MSI: S, MESI: E

p0: read 120, Read miss, memory, silently replace 120 from S or E

Both: 100 + 100 = 200 stall cycles, silent replacement from E

Chapter 5 Solutions

■

p0: read 100, Read miss, satisfied in memory, no sharers MSI: S, MESI: E

p1: write 100

60, Write miss, satisfied in memory regardless of protocol

Both: 100 + 100 = 200 stall cycles, don’t supply data in E state (some

protocols do)

p0: read 100, Read miss, satisfied in memory, no sharers MSI: S, MESI: E

p0: write 100

p1: write 100

60, MSI: send invalidate, MESI: silent transition from E to M

40, Write miss, P0’s cache, writeback data to memory

MSI: 100 + 15 + 40 + 10 = 165 stall cycles

MESI: 100 + 0 + 40 + 10 = 150 stall cycles

5.7

Assume the processors acquire the lock in order. P0 will acquire it first, incur-

ring 100 stall cycles to retrieve the block from memory. P1 and P3 will stall

until P0’s critical section ends (ping-ponging the block back and forth) 1000

cycles later. P0 will stall for (about) 40 cycles while it fetches the block to

invalidate it; then P1 takes 40 cycles to acquire it. P1’s critical section is 1000

cycles, plus 40 to handle the write miss at release. Finally, P3 grabs the block

for a final 40 cycles of stall. So, P0 stalls for 100 cycles to acquire, 10 to give

it to P1, 40 to release the lock, and a final 10 to hand it off to P1, for a total of

160 stall cycles. P1 essentially stalls until P0 releases the lock, which will be

100 + 1000 + 10 + 40 = 1150 cycles, plus 40 to get the lock, 10 to give it to

P3, 40 to get it back to release the lock, and a final 10 to hand it back to P3.

This is a total of 1250 stall cycles. P3 stalls until P1 hands it off the released

lock, which will be 1150 + 40 + 10 + 1000 + 40 = 2240 cycles. Finally, P3

gets the lock 40 cycles later, so it stalls a total of 2280 cycles.

The optimized spin lock will have many fewer stall cycles than the regular

spin lock because it spends most of the critical section sitting in a spin loop

(which while useless, is not defined as a stall cycle). Using the analysis below

for the interconnect transactions, the stall cycles will be 3 read memory misses

(300), 1 upgrade (15) and 1 write miss to a cache (40 + 10) and 1 write miss to

memory (100), 1 read cache miss to cache (40 + 10), 1 write miss to memory

(100), 1 read miss to cache and 1 read miss to memory (40 + 10 + 100),

followed by an upgrade (15) and a write miss to cache (40 + 10), and finally a

write miss to cache (40 + 10) followed by a read miss to cache (40 + 10) and

an upgrade (15). So approximately 945 cycles total.

Approximately 31 interconnect transactions. The first processor to win arbi-

tration for the interconnect gets the block on its first try (1); the other two

ping-pong the block back and forth during the critical section. Because the

latency is 40 cycles, this will occur about 25 times (25). The first processor

does a write to release the lock, causing another bus transaction (1), and the

second processor does a transaction to perform its test and set (1). The last

processor gets the block (1) and spins on it until the second processor releases

it (1). Finally the last processor grabs the block (1).

■

Solutions to Case Studies and Exercises

Approximately 15 interconnect transactions. Assume processors acquire the

lock in order. All three processors do a test, causing a read miss, then a test

and set, causing the first processor to upgrade and the other two to write

miss (6). The losers sit in the test loop, and one of them needs to get back a

shared block first (1). When the first processor releases the lock, it takes a

write miss (1) and then the two losers take read misses (2). Both have their

test succeed, so the new winner does an upgrade and the new loser takes a

write miss (2). The loser spins on an exclusive block until the winner releases

the lock (1). The loser first tests the block (1) and then test-and-sets it, which

requires an upgrade (1).

5.8

Latencies in implementation 1 of Figure 5.36 are used.

P0: write 110

P0: read 108

P0: write 100

P0: read 108

P0: write 110

P0: write 100

Hit in P0’s cache, no stall cycles for either TSO or SC

Miss, TSO satisfies write in write buffer (0 stall cycles)

SC must wait until it receives the data (100 stall cycles)

Hit, but must wait for preceding operation: TSO = 0,

SC = 100

Hit in P0’s cache, no stall cycles for either TSO or SC

Miss, TSO satisfies write in write buffer (0 stall

cycles) SC must wait until it receives the data (100

stall cycles)

Miss, TSO satisfies write in write buffer (0 stall

cycles) SC must wait until it receives the data (100

stall cycles)

Hit, but must wait for preceding operation:

TSO = 0, SC = 100

P0: write 100

P0: write 110

Case Study 2: Simple Directory-Based Coherence

5.9

P0,0: read 100

P0,0: read 128

L1 hit returns 0x0010, state unchanged (M)

L1 miss and L2 miss will replace B1 in L1 and B1 in

L2 which has address 108.

L1 will have 128 in B1 (shared), L2 also will have it

(DS, P0,0)

Memory directory entry for 108 will become <DS, C1>

Memory directory entry for 128 will become <DS, C0>

c, d, …, h: follow same approach

展开预览

猜您喜欢

推荐帖子

上传者

: flyyi; 查看他的其他资源

TI 文字链专区

举报人：
被举报人：	flyyi
举报的资源分：	1
* 类型：
	请您提供公司营业执照和软件相关版权到service@eeworld.com.cn
* 详细原因：

computer architecture

文档简介

评论

汽车 模拟

汽车模拟