add Vectorisation note

[libreriscv.git] / openpower / sv / rfc / ls012.mdwn
diff --git a/openpower/sv/rfc/ls012.mdwn b/openpower/sv/rfc/ls012.mdwn

index b6472b5e59ca3cf45c3f4b8f31cde6d8a19d5914..43c3226e5026739a99d8b81dd27f845e0ef1bf0c 100644 (file)
--- a/openpower/sv/rfc/ls012.mdwn
+++ b/openpower/sv/rfc/ls012.mdwn
@@ -2,15 +2,15 @@
  
  **Date: 2023apr10. v1**
  
-* Funded by NLnet Grants under EU Horizon 2020 and 2023
+* Funded by NLnet Grants under EU Horizon Grants 101069594 825310
  * <https://git.openpower.foundation/isa/PowerISA/issues/121>
  * <https://bugs.libre-soc.org/show_bug.cgi?id=1051>
  * <https://bugs.libre-soc.org/show_bug.cgi?id=1052>
  
  The purpose of this RFC is:
  
-* to give a full list of the upcoming Scalar opcodes developed by Libre-SOC
-  (respecting and being cognisant that *all* of them are Vectorisable)
+* to give a full list of upcoming Scalar opcodes developed by Libre-SOC
+  (being cognisant that *all* of them are Vectoriseable)
  * to give OPF Members and non-Members alike the opportunity to comment and get
    involved early in RFC submission
  * formally agree a priority order on an iterative basis with new versions
@@ -19,13 +19,13 @@ The purpose of this RFC is:
    not proposed at all,
  * keep readers summarily informed of ongoing RFC submissions, with new versions
    of this RFC,
-* and for IBM (in their capacity as Allocator of Opcode resources)
-  to get a clear overall advance picture of the Opcode Allocation needs
-  *prior* to actual RFC submission
+* for IBM (in their capacity as Allocator of Opcodes)
+  to get a clear advance picture of Opcode Allocation
+  *prior* to submission
  
-As this is a Formal ISA RFC the evaluation shall ultimatly define
+As this is a Formal ISA RFC the evaluation shall ultimately define
  (in advance of the actual submission of the instructions themselves)
-which instructions will be submitted over the next 8-18 months.
+which instructions will be submitted over the next 1-18 months.
  
  *It is expected that readers visit and interact with the Libre-SOC
  resources in order to do due-diligence on the prioritisation
@@ -46,17 +46,18 @@ or may not be Vectoriseable, but that every "Defined Word" should have
  merits on its own, not just when Vectorised.  An example of a borderline
  Vectoriseable Defined Word is `mv.swizzle` which only really becomes
  high-priority for Audio/Video, Vector GPU and HPC Workloads, but has
-less merit as a Scalar-only operation.
+less merit as a Scalar-only operation, yet when SVP64Single-Prefixed
+can be part of an atomic Compare-and-Swap sequence.
  
  Although one of the top world-class ISAs,
  Power ISA Scalar (SFFS) has not been significantly advanced in 12
  years: IBM's primary focus has understandably been on PackedSIMD VSX.
  Unfortunately, with VSX being 914 instructions and 128-bit it is far too
-much for any new team to consider (10 years development effort) and far
+much for any new team to consider (10+ years development effort) and far
  outside of Embedded or Tablet/Desktop/Laptop power budgets. Thus bringing
  Power Scalar up-to-date to modern standards *and on its own merits*
  is a reasonable goal, and the advantages of the reduced focus is that
-SFFS remains RISC-paradigm, and  that lessons can be learned from other
+SFFS remains RISC-paradigm, with lessons being be learned from other
  ISAs from the intervening years.  Good examples here include `bmask`.
  
  SVP64 Prefixing - also known by the terms "Zero-Overhead-Loop-Prefixing"
@@ -68,7 +69,7 @@ their value when Vector-Prefixed, *as well as* SVP64Single-Prefixed.
  **Target areas**
  
  Whilst entirely general-purpose there are some categories that these
-instructions are targetting: Bitmanipulation, Big-integer, cryptography,
+instructions are targetting: Bit-manipulation, Big-integer, cryptography,
  Audio/Visual, High-Performance Compute, GPU workloads and DSP.
  
  **Instruction count guide and approximate priority order**
@@ -96,6 +97,8 @@ Summary tables are created below by different sort categories. Additional
  columns (and tables) as necessary can be requested to be added as part of update revisions
  to this RFC.
  
+\newpage{}
+
  # Target Area summaries
  
  Please note that there are some instructions developed thanks to NLnet
@@ -136,7 +139,7 @@ required for Warshall Transitive Closure (on top of a cumulatively-applied
  max instruction).
  
  The Management Instructions themselves are all Scalar Operations, so
-PO1-Prefixing is perfecly reasonable.  SVP64 Management instructions of
+PO1-Prefixing is perfectly reasonable.  SVP64 Management instructions of
  which there are only 6 are all 5 or 6 bit XO, meaning that the opcode
  space they take up in EXT0xx is not alarmingly high for their intrinsic
  strategic value.
@@ -154,7 +157,7 @@ There are a **lot** of operations here, and they also bring Power
  ISA up-to-date to IEEE754-2019.  Fortunately the number of critical
  instructions is quite low, but the caveat is that if those operations
  are utilised to synthesise other IEEE754 operations (divide by `pi` for
-example) full bitlevel accuracy (a hard requirement for IEEE754) is lost.
+example) full bit-level accuracy (a hard requirement for IEEE754) is lost.
  
  Also worth noting that the Khronos Group defines minimum acceptable
  bit-accuracy levels for 3D Graphics: these are **nowhere near** the full
@@ -169,8 +172,8 @@ when 3D Graphics simply has no need for full accuracy.
  Found at [[sv/av_opcodes]] these do not require Saturated variants
  because Saturation is added via [[sv/svp64]] (Vector Prefixing) and via
  [[sv/svp64_single]] Scalar Prefixing. This is important to note for
-Opcode Allocation because placing these operations in the UnVectoriseble
-areas would irrediemably damage their value.  Unlike PackedSIMD ISAs
+Opcode Allocation because placing these operations in the UnVectoriseable
+areas would irredeemably damage their value.  Unlike PackedSIMD ISAs
  the actual number of AV Opcodes is remarkably small once the usual
  cascading-option-multipliers (SIMD width, bitwidth, saturation,
  HI/LO) are abstracted out to RISC-paradigm Prefixing, leaving just
@@ -188,7 +191,7 @@ DSP can do full FFT triple loops in one VLIW group.
  
  It should be pretty clear this is high priority.
  
-With SVP64  [[sv/remap]] providing the Loop Schedules it falls to
+With SVP64 [[sv/remap]] providing the Loop Schedules it falls to
  the Scalar side of the ISA to add the prerequisite "Twin Butterfly"
  operations, typically performing for example one multiply but in-place
  subtracting that product from one operand and adding it to the other.
@@ -207,13 +210,13 @@ hot-loops is considered high priority.
  An additional need is to do popcount on CR Field bit vectors but adding
  such instructions to the *Condition Register* side was deemed to be far
  too much. Therefore, priority was given instead to transferring several
-CR Field bits into GPRs, whereupon the full set of tandard Scalar GPR
+CR Field bits into GPRs, whereupon the full set of Standard Scalar GPR
  Logical Operations may be used. This strategy has the side-effect of
  keeping the CRweird group down to only five instructions.
  
  ## Big-integer Math
  
-[[sv/biginteger]]  has always been a high priority area for commercial
+[[sv/biginteger]] has always been a high priority area for commercial
  applications, privacy, Banking, as well as HPC Numerical Accuracy:
  libgmp as well as cryptographic uses in Asymmetric Ciphers. poly1305
  and ec25519 are finding their way into everyday use via OpenSSL.
@@ -234,14 +237,14 @@ require a 128-bit shifter to replace the existing Scalar Power ISA
  64-bit shifters.
  
  The reduction in instruction count these operations bring, in critical
-hotloops, is remarkably high, to the extent where a Scalar-to-Vector
+hot loops, is remarkably high, to the extent where a Scalar-to-Vector
  operation of *arbitrary length* becomes just the one Vector-Prefixed
  instruction.
  
  Whilst these are 5-6 bit XO their utility is considered high strategic
  value and as such are strongly advocated to be in EXT04. The alternative
  is to bring back a 64-bit Carry SPR but how it is retrospectively
-applicable to pre-existing Scalar Power ISA mutiply, divide, and shift
+applicable to pre-existing Scalar Power ISA multiply, divide, and shift
  operations at this late stage of maturity of the Power ISA is an entire
  area of research on its own deemed unlikely to be achievable.
  
@@ -258,7 +261,7 @@ Similar arguments apply to the GPR-INT move operations, proposed in
  [[ls006]], with the opportunity taken to add rounding modes present
  in other ISAs that Power ISA VSX PackedSIMD does not have. Javascript
  rounding, one of the worst offenders of Computer Science, requires a
-phenomental 35 instructions with *six branches* to emulate in Power
+phenomenal 35 instructions with *six branches* to emulate in Power
  ISA! For desktop as well as Server HTML/JS back-end execution of
  javascript this becomes an obvious priority, recognised already by ARM
  as just one example.
@@ -278,7 +281,7 @@ priority, and again just like in the CRweird group the opportunity was
  taken to work on *all* bits of a CR Field rather than just one bit as
  is done with the existing CR operations crand, cror etc.
  
-The other high strategic value instruction is `grevlut` (and  `grevluti`
+The other high strategic value instruction is `grevlut` (and `grevluti`
  which can generate a remarkably large number of regular-patterned magic
  constants).  The grevlut set require of the order of 20,000 gates but
  provide an astonishing plethora of innovative bit-permuting instructions
@@ -314,12 +317,12 @@ introduce mv Swizzle operations, which can always be Macro-op fused
  in exactly the same way that ARM SVE predicated-move extends 3-operand
  "overwrite" opcodes to full independent 3-in 1-out.
  
-# BMI (bitmanipulation) group.
+## BMI (bit-manipulation) group.
  
  Whilst the [[sv/vector_ops]] instructions are only two in number, in
  reality the `bmask` instruction has a Mode field allowing it to cover
  **24** instructions, more than have been added to any other CPUs by
-ARM, Intel or AMD.  Analyis of the BMI sets of these CPUs shows simple
+ARM, Intel or AMD.  Analysis of the BMI sets of these CPUs shows simple
  patterns that can greatly simplify both Decode and implementation. These
  are sufficiently commonly used, saving instruction count regularly,
  that they justify going into EXT0xx.
@@ -336,17 +339,17 @@ instructions into one. However it is still not a huge priority unlike
  Very easily justified.  As explained in [[ls002]] these always saves one
  LD L1/2/3 D-Cache memory-lookup operation, by virtue of the Immediate
  FP value being in the I-Cache side.  It is such a high priority that
-these instuctions are easily justifiable adding into EXT0xx, despite
+these instructions are easily justifiable adding into EXT0xx, despite
  requiring a 16-bit immediate.  By designing the second-half instruction
-as a Read-Modify-Write it saves on XO bitlength (only 5 bits), and can be
+as a Read-Modify-Write it saves on XO bit-length (only 5 bits), and can be
  macro-op fused with its first-half to store a full IEEE754 FP32 immediate
  into a register.
  
  There is little point in putting these instructions into EXT2xx. Their
  very benefit and inherent value *is* as 32-bit instructions, not 64-bit
-ones. Likewise there is less value in taking up EXT1xx Enoding space
+ones. Likewise there is less value in taking up EXT1xx Encoding space
  because EXT1xx only brings an additional 16 bits (approx) to the table,
-and that is provided already by the second-half instuction.
+and that is provided already by the second-half instruction.
  
  Thus they qualify as both high priority and also EXT0xx candidates.
  
@@ -385,6 +388,188 @@ Being a 10-bit XO it would be somewhat punitive to place these in EXT2xx
  when their whole purpose and value is to reduce binary size in Address
  offset computation, thus they are best placed in EXT0xx.
  
+\newpage{}
+
+# Vectorisation: SVP64 and SVP64Single
+
+To be submitted as part of [[ls001]], [[ls008]], [[ls009]] and [[ls010]],
+with SVP64Single to follow in a subsequent RFC, SVP64 is conceptually
+identical to the 50+ year old 8080 `REP` instruction and the Zilog Z80
+`CPIR` and `LDIR` instructions.  Parallelism is best achieved by exploiting
+a Multi-Issue Out-of-Order Micro-architecture.  It is extremely important
+to bear in mind that at no time does SVP64 add even one single actual
+Vector instruction.  It is a *pure* RISC-paradigm Prefixing concept only.
+
+This has some implications which need unpacking.  Firstly: in the future,
+the Prefixing may be applied to VSX.  The only reason it was not included
+in the initial proposal of SVP64 is because due to the number of VSX
+instructions the Due Diligence required is obviously five times higher
+than the 3+ years work done so far on the SFFS Subset.
+
+Secondly: **any** Scalar instruction involving registers **automatically**
+becomes a candidate for Vector-Prefixing.  This in turn means that when
+a new instruction is proposed, it becomes a hard requirement to consider
+not only the implications of its inclusion as a Scalar-only instruction,
+but how it will best be utilised as a Vectorised instruction **as well**.
+Extreme examples of this are the Big-Integer 3-in 2-out instructions that
+use one 64-bit register effectively as a Carry-in and Carry-out. The
+instructions were designed in a *Scalar* context to be inline-efficient
+in hardware (use of Operand-Forwarding to reduce the chain down to 2-in 1-out),
+but in a *Vector* context it is extremely straightforward to Micro-code
+an entire batch onto 128-bit SIMD pipelines, 256-bit SIMD pipelines, and
+to perform a large internal Forward-Carry-Propagation on for example the
+Vectorised-Multiply instruction.
+
+Thirdly: as far as Opcode Allocation is concerned, SVP64 needs to be
+considered as an independent stand-alone instruction (just like `REP`).
+In other words, the Suffix **never** gets decoded as a completely different
+instruction just because of the Prefix.  The cost of doing so is simply
+too high in hardware.
+
+--------
+
+# Guidance for evaluation
+
+Deciding which instructions go into an ISA is extremely complex, costly,
+and a huge responsibility. In public standards mistakes are irrevocable,
+and in the case of an ISA the Opcode Allocation is a finite resource,
+meaning that mistakes punish future instructions as well.  This section
+therefore provides some Evaluation Guidance on the decision process,
+particularly for people new to ISA development, given that this RFC
+is circulated widely and publicly.  Constructive feedback from experienced
+ISA Architects welcomed to improve this section.
+
+**Does anyone want it?**
+
+Sounds like an obvious question but if there is no driving need (no
+"Stakeholder") then why is the instruction being proposed? If it is
+purely out of curiosity or part of a Research effort not intended for
+production then it's probably best left in the EXT022 Sandbox.
+
+**How many registers does it need?**
+
+The basic RISC Paradigm is not only to make instruction encoding simple
+(often "wasting" encoding space compared to highly-compacted ISAs such
+as x86), but also to keep the number of registers used down to a minimum.
+
+Counter-examples are FMAC which had to be added to IEEE754 because the
+*internal* product requires more accuracy than can fit into a register
+(it is well-known that FMUL followed by FADD performs an additional
+rounding on the intermediate register which loses accuracy compared to
+FMAC).  Another would be a dot-product instruction, which again requires
+an accumulator of at least double the width of the two vector inputs.
+And in the AMDGPU ISA, there are Texture-mapping instructions taking up
+to an astounding *twelve* input operands!
+
+The downside of going too far however has to be a trade-off with the
+next question. Both MIPS and RISC-V lack Condition Codes, which means
+that emulating x86 Branch-Conditional requires *ten* MIPS instructions.
+
+The downside of creating too complex instructions is that the Dependency
+Hazard Management in high-performance multi-issue out-of-order
+microarchitectures becomes infeasibly large, and even simple in-order
+systems may have performance severely compromised by an overabundance
+of stalls.  Also worth remembering is that register file ports are
+insanely costly, not just to design but also use considerable power.
+
+That said there do exist genuine reasons why more registers is better than
+less: Compare-and-Swap has huge benefits but is costly to implement,
+and DCT/FFT Twin-Butterfly instructions allow creation of in-place
+in-register algorithms reducing the number of registers needed and
+thus saving power due to making the *overall* algorithm more efficient,
+as opposed to micro-focussing on a localised power increase.
+
+**How many register files does it use?**
+
+Complex instructions pulling in data from multiple register files can
+create unnecessary issues surrounding Dependency Hazard Management in
+Out-of-Order systems.  As a general rule it is better to keep complex
+instructions reading and writing to the same register file, relying
+on much simpler (1-in 1-out) instructions to transfer data between
+register files.
+
+**Can other existing instructions (plural) do the same job**
+
+The general rule being: if two or more instructions can do the
+same job, leave it out...  *unless* the number of occurrences of
+that instruction being missing is causing huge increases in binary
+size.  RISC-V has gone too far in this regard, as explained here:
+<https://news.ycombinator.com/item?id=24459314>
+
+Good examples are LD-ST-Indexed-shifted (multiply RB by 2, 4 8 or 16)
+which are high-priority instructions in x86 and ARM, but lacking in
+Power ISA, MIPS, and RISC-V. With many critical hot-loops in Computer
+Science having to perform shift and add as explicit instructions,
+adding LD/ST-shifted should be considered high priority, except that
+the sheer *number* of such instructions needing to be added takes us
+into the next question
+
+**How costly is the encoding?**
+
+This can either be a single instruction that is costly (several operands
+or a few long ones) or it could be a group of simpler ones that purely
+due to their number increases overall encoding cost.  An example of an
+extreme costly instruction would be those with their own Primary Opcode:
+addi is a good candidate.  However the sheer overwhelming number of
+times that instruction is used easily makes a case for its inclusion.
+
+Mentioned above was Load-Store-Indexed-Shifted, which only needs 2
+bits to specify how much to shift: x2 x4 x8 or x16. And they are all
+a 10-bit XO Field, so not that costly for any one given instruction.
+Unfortunately there are *around 30* Load-Store-Indexed Instructions in the
+Power ISA, which means an extra *five* bits taken up of precious XO space.
+Then let us not forget the two needed for the Shift amount. Now we are
+up to *three* bit XO for the group.
+
+Is this a worthwhile tradeoff? Honestly it could well be.  And that's
+the decision process that the OpenPOWER ISA Working Group could use some
+assistance on, to make the evaluation easier.
+
+**How many gates does it need?**
+
+`grevlut` comes in at an astonishing 20,000 gates, where for comparison
+an FP64 Multiply typically takes between 12 to 15,000.  Not counting
+the cost in hardware terms is just asking for trouble.
+
+**How long will it take to complete?**
+
+In the case of divide or Transcendentals the algorithms needed are so
+complex that simple implementations can often take an astounding 128
+clock cycles to complete.  Other instructions waiting for the results
+will back up and eventually stall, where in-order systems pretty much
+just stall straight away.
+
+Less extreme examples include instructions that take only a few cycles
+to complete, but if used in tight loops with Conditional Branches, an
+Out-of-Order system with Speculative capability may need significantly
+more Reservation Stations to hold in-flight data for instructions which
+take longer than those which do not.
+
+**Can one instruction do the job of many?**
+
+Large numbers of disparate instructions adversely affects resource
+utilisation in In-Order systems.  However it is not always that simple:
+every one of the Power ISA "add" and "subtract" instructions, as shown by
+the Microwatt source code, may be micro-coded as one single instruction
+where RA may optionally be inverted, output likewise, and Carry-In set to
+1, 0 or XER.CA.  From these options the *entire* suite of add/subtract
+may be synthesised (subtract by inverting RA and adding an extra 1 it
+produces a 2s-complement of RA).
+
+`bmask` for example is to be proposed as a single instruction with
+a 5-bit "Mode" operand, greatly simplifying some micro-architectural
+implementations. Likewise the FP-INT conversion instructions are grouped
+as a set of four, instead of over 30 separate instructions.  Aside from
+anything this strategy makes the ISA Working Group's evaluation task
+easier, as well as reducing the work of writing a Compliance Test Suite.
+
+**Summary**
+
+There are many tradeoffs here, it is a huge list of considerations: any
+others known about please do submit feedback so they may be included,
+here.  Then the evaluation process may take place: again, constructive
+feedback on that as to which instructions are a priority also appreciated.
+The above helps explain the columns in the tables that follow.
  
  # Tables
  
@@ -421,10 +606,13 @@ The key to headings and sections are as follows:
    instead.  see [[sv/po9_encoding]].
  * **regs** - a guide to register usage, to how costly Hazard Management
    will be, in hardware:
-  - 1R: reads one GPR/FPR/SPR/CR.
-  - 1W: writes one GPR/FPR/SPR/CR. 
-  - 1r: reads one CR *Field* (not necessarily the entire CR)
-  - 1w: writes one CR *Field* (not necessarily the entire CR)
+
+```
+     - 1R: reads one GPR/FPR/SPR/CR.
+     - 1W: writes one GPR/FPR/SPR/CR.
+     - 1r: reads one CR *Field* (not necessarily the entire CR)
+     - 1w: writes one CR *Field* (not necessarily the entire CR)
+```
  
  [[!inline pages="openpower/sv/rfc/ls012/areas.mdwn" raw=yes ]]
  [[!inline pages="openpower/sv/rfc/ls012/xo_cost.mdwn" raw=yes ]]