From 958bd7104ab9c31f44199ff7b4a0130994e31f5d Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Tue, 11 Apr 2023 10:22:57 +0100 Subject: [PATCH] whitespace --- openpower/sv/rfc/ls012.mdwn | 199 +++++++++++++++++++----------------- 1 file changed, 103 insertions(+), 96 deletions(-) diff --git a/openpower/sv/rfc/ls012.mdwn b/openpower/sv/rfc/ls012.mdwn index d24a68aad..b4fdc7a9d 100644 --- a/openpower/sv/rfc/ls012.mdwn +++ b/openpower/sv/rfc/ls012.mdwn @@ -391,136 +391,143 @@ offset computation, thus they are best placed in EXT0xx. # Guidance for evaluation -Deciding which instructions go into an ISA is extremely complex, costly, and a huge -responsibility. In public standards mistakes are irrevocable, and in the case of an ISA -the Opcode Allocation is a finite resource, meaning that mistakes punish future instructions -as well. This section therefore provides some Evaluation Guidance on the decision process. +Deciding which instructions go into an ISA is extremely complex, costly, +and a huge responsibility. In public standards mistakes are irrevocable, +and in the case of an ISA the Opcode Allocation is a finite resource, +meaning that mistakes punish future instructions as well. This section +therefore provides some Evaluation Guidance on the decision process. **Does anyone want it?** -Sounds like an obvious question but if there is no driving need (no "Stakeholder") -then why is the instruction being proposed? If it is purely out of curiosity or -part of a Research effort not intended for production then it's probably best left in the -EXT022 Sandbox. +Sounds like an obvious question but if there is no driving need (no +"Stakeholder") then why is the instruction being proposed? If it is +purely out of curiosity or part of a Research effort not intended for +production then it's probably best left in the EXT022 Sandbox. **How many registers does it need?** -The basic RISC Paradigm is not only to make instruction encoding simple (often -"wasting" encoding space compared to highly-compacted ISAs such as x86), but -also to keep the number of registers used down to a minimum. +The basic RISC Paradigm is not only to make instruction encoding simple +(often "wasting" encoding space compared to highly-compacted ISAs such +as x86), but also to keep the number of registers used down to a minimum. Counter-examples are FMAC which had to be added to IEEE754 because the -*internal* product requires more accuracy than can fit into a register. -Another would be a dot-product instruction, which again requires an accumulator -of at least double the width of the two vector inputs. And in the AMDGPU -ISA, there are Texture-mapping instructions taking up to an astounding -*twelve* input operands! - -The downside of going too far however has to be a trade-off with the next -question. Both MIPS and RISC-V lack Condition Codes, which means that emulating -x86 Branch-Conditional requires *ten* MIPS instructions. - -The downside of creating too complex instructions is that the Dependency Hazard -Management in high-performance multi-issue out-of-order microarchitectures -becomes infeasibly large, and even simple in-order systems may have performance -severely compromised by an overabundance of stalls. Also worth remembering -is that register file ports are insanely costly, not just to design but also -use considerable power. - -That said there do exist genuine reasons why more registers is better than less: -Compare-and-Swap has huge benefits but is costly to implement, and DCT/FFT Twin-Butterfly -instructions allow creation of in-place in-register algorithms reducing the number -of registers needed and thus saving power due to making the *overall* algorithm -more efficient, as opposed to micro-focussing on a localised power increase. +*internal* product requires more accuracy than can fit into a register +(it is well-known that FMUL followed by FADD performs an additional +rounding on the intermediate register which loses accuracy compared to +FMAC). Another would be a dot-product instruction, which again requires +an accumulator of at least double the width of the two vector inputs. +And in the AMDGPU ISA, there are Texture-mapping instructions taking up +to an astounding *twelve* input operands! + +The downside of going too far however has to be a trade-off with the +next question. Both MIPS and RISC-V lack Condition Codes, which means +that emulating x86 Branch-Conditional requires *ten* MIPS instructions. + +The downside of creating too complex instructions is that the Dependency +Hazard Management in high-performance multi-issue out-of-order +microarchitectures becomes infeasibly large, and even simple in-order +systems may have performance severely compromised by an overabundance +of stalls. Also worth remembering is that register file ports are +insanely costly, not just to design but also use considerable power. + +That said there do exist genuine reasons why more registers is better than +less: Compare-and-Swap has huge benefits but is costly to implement, +and DCT/FFT Twin-Butterfly instructions allow creation of in-place +in-register algorithms reducing the number of registers needed and +thus saving power due to making the *overall* algorithm more efficient, +as opposed to micro-focussing on a localised power increase. **How many register files does it use?** -Complex instructions pulling in data from multiple register files can create unnecessary -issues surrounding Dependency Hazard Management in Out-of-Order systems. As a general -rule it is better to keep complex instructions reading and writing to the same -register file, relying on much simpler (1-in 1-out) instructions to transfer data -between register files. +Complex instructions pulling in data from multiple register files can +create unnecessary issues surrounding Dependency Hazard Management in +Out-of-Order systems. As a general rule it is better to keep complex +instructions reading and writing to the same register file, relying +on much simpler (1-in 1-out) instructions to transfer data between +register files. **Can other existing instructions (plural) do the same job** -The general -rule being: if two or more instructions can do the same job, leave it out... -*unless* the number of occurrences of that instruction being missing is causing -huge increases in binary size. RISC-V has gone too far in this regard, -as explained here: +The general rule being: if two or more instructions can do the +same job, leave it out... *unless* the number of occurrences of +that instruction being missing is causing huge increases in binary +size. RISC-V has gone too far in this regard, as explained here: + Good examples are LD-ST-Indexed-shifted (multiply RB by 2, 4 8 or 16) which are high-priority instructions in x86 and ARM, but lacking in Power ISA, MIPS, and RISC-V. With many critical hot-loops in Computer -Science having to perform shift and add as explicit instructions, adding -LD/ST-shifted should be considered high priority, except that the sheer -*number* of such instructions needing to be added takes us into the next -question +Science having to perform shift and add as explicit instructions, +adding LD/ST-shifted should be considered high priority, except that +the sheer *number* of such instructions needing to be added takes us +into the next question **How costly is the encoding?** -This can either be a single instruction that is costly (several -operands or a few long ones) or it could be a -group of simpler ones that purely due to their number increases overall -encoding cost. An example of an extreme costly instruction would be -those with their own Primary Opcode: addi is a good candidate. However -the sheer overwhelming -number of times that instruction is used easily makes a case for its inclusion. - -Mentioned above was Load-Store-Indexed-Shifted, which only needs 2 bits -to specify how much to shift: x2 x4 x8 or x16. And they are all a 10-bit XO -Field, so not that costly for any one given instruction. -Unfortunately there are *around 30* Load-Store-Indexed Instructions in the Power ISA, -which means an extra *five* bits taken up of precious XO space. -Then let us not forget -the two needed for the Shift amount. Now we are up to *three* bit XO for the group. - -Is this a worthwhile tradeoff? Honestly it could well be. And that's the decision -process that the OpenPOWER ISA Working Group could use some assistance on, to make -the evaluation easier. +This can either be a single instruction that is costly (several operands +or a few long ones) or it could be a group of simpler ones that purely +due to their number increases overall encoding cost. An example of an +extreme costly instruction would be those with their own Primary Opcode: +addi is a good candidate. However the sheer overwhelming number of +times that instruction is used easily makes a case for its inclusion. + +Mentioned above was Load-Store-Indexed-Shifted, which only needs 2 +bits to specify how much to shift: x2 x4 x8 or x16. And they are all +a 10-bit XO Field, so not that costly for any one given instruction. +Unfortunately there are *around 30* Load-Store-Indexed Instructions in the +Power ISA, which means an extra *five* bits taken up of precious XO space. +Then let us not forget the two needed for the Shift amount. Now we are +up to *three* bit XO for the group. + +Is this a worthwhile tradeoff? Honestly it could well be. And that's +the decision process that the OpenPOWER ISA Working Group could use some +assistance on, to make the evaluation easier. **How many gates does it need?** -`grevlut` comes in at an astonishing 20,000 gates, where for comparison an FP64 -Multiply typically takes between 12 to 15,000. Not counting the cost in hardware -terms is just asking for trouble. +`grevlut` comes in at an astonishing 20,000 gates, where for comparison +an FP64 Multiply typically takes between 12 to 15,000. Not counting +the cost in hardware terms is just asking for trouble. **How long will it take to complete?** -In the case of divide or Transcendentals the algorithms needed are so complex that simple -implementations can often take an astounding 128 clock cycles to complete. -Other instructions waiting for the results will back up and eventually stall, -where in-order systems pretty much just stall straight away. +In the case of divide or Transcendentals the algorithms needed are so +complex that simple implementations can often take an astounding 128 +clock cycles to complete. Other instructions waiting for the results +will back up and eventually stall, where in-order systems pretty much +just stall straight away. -Less extreme examples include instructions that take only a few cycles to complete, -but if used in tight loops with Conditional Branches, an Out-of-Order system with -Speculative capability may need significantly more Reservation Stations to hold -in-flight data for instructions which take longer than those which do not. +Less extreme examples include instructions that take only a few cycles +to complete, but if used in tight loops with Conditional Branches, an +Out-of-Order system with Speculative capability may need significantly +more Reservation Stations to hold in-flight data for instructions which +take longer than those which do not. **Can one instruction do the job of many?** -Large numbers of disparate instructions adversely affects resource utilisation in -In-Order systems. However it is not always that simple: every one of the Power -ISA "add" and "subtract" instructions, as shown by the Microwatt source code, may -be micro-coded as one single instruction where RA may optionally be inverted, -output likewise, and Carry-In set to 1, 0 or XER.CA. From these options the -*entire* suite of add/subtract may be synthesised (subtract by inverting RA and -adding an extra 1 it produces a 2s-complement of RA). - -`bmask` for example is to be proposed as a single instruction with a 5-bit "Mode" -operand, greatly simplifying some micro-architectural implementations. Likewise -the FP-INT conversion instructions are grouped as a set of four, instead of -over 30 separate instructions. Aside from anything this strategy makes -the ISA Working Group's evaluation task easier, as well as reducing the work -of writing a Compliance Test Suite. +Large numbers of disparate instructions adversely affects resource +utilisation in In-Order systems. However it is not always that simple: +every one of the Power ISA "add" and "subtract" instructions, as shown by +the Microwatt source code, may be micro-coded as one single instruction +where RA may optionally be inverted, output likewise, and Carry-In set to +1, 0 or XER.CA. From these options the *entire* suite of add/subtract +may be synthesised (subtract by inverting RA and adding an extra 1 it +produces a 2s-complement of RA). + +`bmask` for example is to be proposed as a single instruction with +a 5-bit "Mode" operand, greatly simplifying some micro-architectural +implementations. Likewise the FP-INT conversion instructions are grouped +as a set of four, instead of over 30 separate instructions. Aside from +anything this strategy makes the ISA Working Group's evaluation task +easier, as well as reducing the work of writing a Compliance Test Suite. **Summary** -There are many tradeoffs here, it is a huge list of considerations: any others -known about please do submit feedback so they may be included, here. -Then the evaluation process may take place: again, constructive feedback on -that as to which instructions are a priority also appreciated. The above -helps explain the columns in the tables that follow. +There are many tradeoffs here, it is a huge list of considerations: any +others known about please do submit feedback so they may be included, +here. Then the evaluation process may take place: again, constructive +feedback on that as to which instructions are a priority also appreciated. +The above helps explain the columns in the tables that follow. # Tables -- 2.30.2