reword

[crowdsupply.git] / updates / 003_2018dec04_microarchitecture.mdwn
diff --git a/updates/003_2018dec04_microarchitecture.mdwn b/updates/003_2018dec04_microarchitecture.mdwn

index 2d6af927aa01f81540e8f0cba6ec55cc4a5b39df..88093b9539fc184391119edc96dffd0336818bf2 100644 (file)
--- a/updates/003_2018dec04_microarchitecture.mdwn
+++ b/updates/003_2018dec04_microarchitecture.mdwn
@@ -29,11 +29,12 @@ another uses register 10, *both* of them could actually be "redirected"
  to use register 112, for example.  One of those could even be changed
  to 32-bit operations whilst the other is set to 16-bit element widths.
  
-Our initial thoughts were to try a standard simple in-order SIMD architecture,
+Our initial thoughts advocated a standard simple in-order SIMD architecture,
  with predication bits passed down into the SIMD ALUs.  If a bit is "off",
  that "lane" within the ALU does not calculate a result, saving power.
-However, a pre-analysis engine is required that re-orders the registers,
-packs lanes of data together so that it fits into one SIMD ALU, and, on
+However, in SV, when the element width is set to 32, 16 or 8-bit, a
+pre-issue engine is required that re-orders *parts* of the registers,
+packing lanes of data together so that it fits into one SIMD ALU, and, on
  exit from the ALU, it may be necessary to split and "redirect" parts of the
  data to *multiple* actual 64-bit registers.  In other words, bit-level
  (or byte-level) manipulation is required, both pre- and post- ALU.
@@ -48,15 +49,15 @@ different paradigm from standard vector processors, where a loop allocates
  elements to "lanes", and if a predication bit is not set, the lane
  runs "empty".  By contrast, with the multi-issue execution model, an
  operation that is predicated out means that the element-based instruction
-does not even make it into the instruction queue.  Thus, unlike in a
+does not even make it into the instruction queue, leaving it free for
+use by following instructions, even in the same cycle, and even if the
+operation is totally different.  Thus, unlike in a
  traditional vectore architecture, ALUs may be occupied by elements from 
-other "Lanes", because of the pre-existing decoupling between the multi-issue
-instruction queue and the ALUs.
+other "Lanes", because the pre-existing decoupling between the multi-issue
+instruction queue and the ALUs is efficiently leveraged.
  
  Simple!
  
-[[reorder_buffer.jpg]]
-
  There are many other benefits to a multi-issue microarchitecture, and
  these are being discussed
  [here](http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2018-December/000198.html)
@@ -70,6 +71,8 @@ to hit a speed wall.  That in particular means that, firstly, it's extremely
  commonly taught in Universities, and, secondly, patents on the algorithm
  have long since expired.
  
+[[reorder_buffer.jpg]]
+
  Also, there are both memory hazards and register hazards that a Reorder
  Buffer augmented Tomasulo algorithm takes care of, whilst also allowing
  for branch prediction and really simple roll-back, preservation of
@@ -81,7 +84,7 @@ extend the Reorder Buffer tags to accomodate SIMD-style characteristics.
  We also may need to have simple Branch Prediction, because some of the
  loops in [Kazan](https://salsa.debian.org/Kazan-team/kazan/) are particularly
  tight.  A Reorder Buffer can easily be used to implement Branch Prediction,
-because, just as with an Exception, the ROB needs to be cleared out
+because, just as with an Exception, the ROB can to be cleared out
  (flushed) if the branch is mispredicted.  As it is necessary to respect
  Exceptions, the logic has to exist to clear out the ROB: Branch Prediction
  simply uses this pre-existing logic.
@@ -103,9 +106,24 @@ to Reorder Buffers:
  * There's no clear way to handle branch prediction, where the Reorder
    Buffer of Tomasulo handles it really cleanly.
  
+However there are downsides to Reorder Buffers:
+
+* The Common Data Bus may become a serious bottleneck, as it delivers
+  data from multiple ALUs which may be generating results simultaneously.
+  To keep up with result generation, *multiple* CDBs may be needed, which
+  results in each receiver having multiple ports
+* The Destination field in the ROB has to act as a key in a CAM (Content
+  Addressble Memory).  As a result, power consumption of the ROB may be
+  quite high.  It may or may not be possible to reduce power consumption
+  by testing an "active" bitfield (separate from but augmenting the ROB)
+  to indicate whether Destination Registers are in use.  If inactive,
+  the CAM lookup need not take place.
+
  Whilst nothing's firmly set in stone, here, as we have a Charter that
  requires unanimous decision-making from contributors, so far it's leaning
  towards Reorder Buffers and Tomasulo as a good, clean fit.  In part that
  is down to more research having been done on that particular algorithm.
+For completeness, scoreboarding and explicit register renaming need
+to be properly and comprehensively investigted.
  More as it happens...