From 41e2b8f9d359d5a0be77c548920355bd3796fb39 Mon Sep 17 00:00:00 2001
From: Luke Kenneth Casson Leighton <lkcl@lkcl.net>
Date: Tue, 4 Dec 2018 13:48:40 +0000
Subject: [PATCH] reword

---
 updates/003_2018dec04_microarchitecture.mdwn | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/updates/003_2018dec04_microarchitecture.mdwn b/updates/003_2018dec04_microarchitecture.mdwn
index 2d6af92..55bae1d 100644
--- a/updates/003_2018dec04_microarchitecture.mdwn
+++ b/updates/003_2018dec04_microarchitecture.mdwn
@@ -29,11 +29,12 @@ another uses register 10, *both* of them could actually be "redirected"
 to use register 112, for example.  One of those could even be changed
 to 32-bit operations whilst the other is set to 16-bit element widths.
 
-Our initial thoughts were to try a standard simple in-order SIMD architecture,
+Our initial thoughts advocated a standard simple in-order SIMD architecture,
 with predication bits passed down into the SIMD ALUs.  If a bit is "off",
 that "lane" within the ALU does not calculate a result, saving power.
-However, a pre-analysis engine is required that re-orders the registers,
-packs lanes of data together so that it fits into one SIMD ALU, and, on
+However, in SV, when the element width is set to 32, 16 or 8-bit, a
+pre-issue engine is required that re-orders *parts* of the registers,
+packing lanes of data together so that it fits into one SIMD ALU, and, on
 exit from the ALU, it may be necessary to split and "redirect" parts of the
 data to *multiple* actual 64-bit registers.  In other words, bit-level
 (or byte-level) manipulation is required, both pre- and post- ALU.
@@ -48,7 +49,9 @@ different paradigm from standard vector processors, where a loop allocates
 elements to "lanes", and if a predication bit is not set, the lane
 runs "empty".  By contrast, with the multi-issue execution model, an
 operation that is predicated out means that the element-based instruction
-does not even make it into the instruction queue.  Thus, unlike in a
+does not even make it into the instruction queue, leaving it free for
+use by following instructions, even in the same cycle, and even if the
+operation is totally different.  Thus, unlike in a
 traditional vectore architecture, ALUs may be occupied by elements from 
 other "Lanes", because of the pre-existing decoupling between the multi-issue
 instruction queue and the ALUs.
-- 
2.30.2