3d_gpu/architecture.mdwn

   1 # Top Level page for core architecture
   2
   3 The primary design is based around the CDC 6600 (not a historical
   4 exact replica of its design), specifically its Dependency Matrices
   5 which provide superscalar out-of-order execution and full register
   6 renaming with very little in the way of gates or power consumption.
   7 Modifying the 6600 concept to be multi-issue, thanks to help from Mitch
   8 Alsup, is near-trivial and an O(N) linear complexity.  Additionally,
   9 Mitch helped us to add "Precise exceptions", which is the same pathway
  10 used for branch speculation and predication.
  11
  12 The use of Dependency Matrices allows a mixture of variable-length
  13 completion time ALUs, including dynamic pipelines and blocking FSMs,
  14 to be mixed together and the Dependency Matrices, maintaining a Directed
  15 Acyclic Graph of all Read-Write hazards, simply take care of it.
  16
  17 The selection of the 6600 as the core engine has far-reaching
  18 implications.  Note: the standard academic literature on the 6600 -
  19 all of it - completely and systematically fails to comprehend or explain
  20 why it is so elegant.  In fact, several modern microarchitectures have
  21 *reinvented* aspects of the 6600, not realising that the 6600 was the
  22 first ever microarchitecture to provide full register renaming combined
  23 with out-of-order execution in such a superbly gate-efficient fashion.
  24
  25 Anyone wishing to understand that there is a direct one-for-one equivalence
  26 between properly and fully implemented Scoreboards (not: "implementing the
  27 Q-Table patent and then thinking that's all there is to it") and the Tomasulo
  28 algorithm, there is a page available describing how to convert from Tomasulo
  29 to Scoreboards: [[tomasulo_transformation]].  The dis-service that the standard
  30 academic literature has done to Scoreboards by focussing exclusively on
  31 the Q-Tables is equivalent to implementing a Tomasulo Reorder Buffer (only)
  32 then claiming (accurately) that this one component is not an Out-of-Order
  33 system.
  34
  35 # Basic principle
  36
  37 The basic principle: the front-end ISA is variable-length Vectorised, with a hardware-level for-loop in front of a predicated SIMD backend suite of ALUs.  Instructions issued at the front-end are first SIMD-grouped, then the remaining "elements" (or groups of SIMD'd elements) are thrown at the multi-issue OoO execution engine and the augmented-6600 Matrices left to their own devices.
  38
  39 Predication, branch speculation, register file bypass and exceptions all use the same mechanism: shadowing (thanks to Mitch for explaining how this is done). Shadowing holds a latch that prevents and prohibits the *write* commit phase of the OoO Matrices but not the *execution* phase, simply by hooking into GOWRITE.  Once the result is definitely known to be able to proceed the shadow latch is dropped.
  40
  41 If there was an exception (or a predicate bit is retrospectively found to be zero, or a branch found to go the other way) the "GODIE" is called instead, and because all downstream dependent instructions are also held by the same shadow line, no register writes nor memory writes were allowed to occur and consequently the instructions *can* be cancelled with impunity.
  42
  43 # Dynamic SIMD partitioning
  44
  45 There are no separate scalar ALUs and separate SIMD ALUs. The ALUs are dynamically partitioned. This adds 50% silicon when compared to a scalar ALU however it saves 200% or greater silicon due to ALU duplication removal.
  46
  47 The only reason this design can even remotely be considered is down to the use of standard python Software Engineering Object-Orientated techniques, on top of nmigen, which industry-standard HDLs such as VHDL and Verilog completely lack.
  48
  49 See [[architecture/dynamic_simd]] for details.
  50
  51 # Dynamic Pipeline length adjustment
  52
  53 There are pipeline register bypasses on every other pipeline stage in
  54 the ALU, simply implemented as a combinatorial mux. This allows pairs of
  55 pipeline stages to *either* become a single higher-latency combinatorial
  56 block *or* a straight standard chain of clock-synced pipeline stages
  57 just like all other pipeline stages.
  58
  59 Dynamically.
  60
  61 This means that in low clock rate modes the length of the whole pipeline
  62 may be reduced in latency (the number of effective stages is halved).
  63 Whilst in higher clock rate modes where the long stage latency would
  64 otherwise be a serious limiting factor, the intermediary latches are
  65 enabled, thus doubling the pipeline length into much shorter stages with
  66 lower latency, and the problem is solved.
  67
  68 The only reason why this ingenious and elegant trick (deployed first by
  69 IBM in the 1990s) can be considered is down to the fact that the 6600
  70 Style Dependency Matrices do not care about actual completion time,
  71 they only care about availability of the result.
  72
  73 Links:
  74
  75 * <http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2020-March/005459.html>
  76
  77 # 6600 Engine
  78
  79 See [[architecture/6600scoreboard]]
  80
  81 # Decoder
  82
  83 TODO, see [[architecture/decoder]]
  84
  85 # Memory and Cache arrangement
  86
  87 Section TODO, with own page [[architecture/memory_and_cache]]  LD/ST accesses are controlled by the 6600-style Dependency Matrices
  88
  89 # Bus arrangement
  90
  91 Wishbone was chosen.  to expand why (related to patents).
  92
  93 # Register Files.
  94
  95 See [[architecture/regfile]]
  96
  97 # Computation Unit
  98
  99 See [[architecture/compunit]]
 100
 101 # IOMMU
 102
 103 Section TODO, an IOMMU is an integral part of protecting processes from directly accessing peripherals (and other memory areas) that they shouldn't.