From 1cfd84edd40bb50a2bf0dc08525d73ef7eaf61df Mon Sep 17 00:00:00 2001 From: Luke Kenneth Casson Leighton Date: Wed, 12 Apr 2023 20:31:34 +0100 Subject: [PATCH] big whitespace cleanup and indentation of code-blocks with ```s --- openpower/sv/svp64/appendix.mdwn | 582 ++++++++++++++++--------------- 1 file changed, 295 insertions(+), 287 deletions(-) diff --git a/openpower/sv/svp64/appendix.mdwn b/openpower/sv/svp64/appendix.mdwn index e11e9299d..a414bf266 100644 --- a/openpower/sv/svp64/appendix.mdwn +++ b/openpower/sv/svp64/appendix.mdwn @@ -62,6 +62,7 @@ and producing, at the end, a single bit Carry out. High performance implementations may exploit this observation to deploy efficient Parallel Carry Lookahead. +``` # assume VL=4, this results in 4 sequential ops (below) sv.adde r0.v, r4.v, r8.v @@ -70,6 +71,7 @@ Parallel Carry Lookahead. adde r1, r5, r9 # takes carry from previous ... adde r3, r7, r11 # likewise +``` It can clearly be seen that the carry chains from one 64 bit add to the next, the end result being that a @@ -98,52 +100,53 @@ The final results, which are part of the SVP64 Specification, are here: [[openpower/opcode_regs_deduped]] * Firstly, every instruction's mnemonic (`add RT, RA, RB`) was analysed -from reading the markdown formatted version of the Scalar pseudocode -which is machine-readable and found in [[openpower/isatables]]. The -analysis gives, by instruction, a "Register Profile". `add RT, RA, RB` -for example is given a designation `RM-2R-1W` because it requires -two GPR reads and one GPR write. -* Secondly, the total number of registers was added up (2R-1W is 3 registers) -and if less than or equal to three then that instruction could be given an -EXTRA3 designation. Four or more is given an EXTRA2 designation because -there are only 9 bits available. + from reading the markdown formatted version of the Scalar pseudocode which + is machine-readable and found in [[openpower/isatables]]. The analysis + gives, by instruction, a "Register Profile". `add RT, RA, RB` for + example is given a designation `RM-2R-1W` because it requires two GPR + reads and one GPR write. +* Secondly, the total number of registers was added up (2R-1W is 3 + registers) and if less than or equal to three then that instruction + could be given an EXTRA3 designation. Four or more is given an EXTRA2 + designation because there are only 9 bits available. * Thirdly, the instruction was analysed to see if Twin or Single -Predication was suitable. As a general rule this was if there -was only a single operand and a single result (`extw` and LD/ST) -however it was found that some 2 or 3 operand instructions also -qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use -in Twin Predication, some compromises were made, here. LDST is -Twin but also has 3 operands in some operations, so only EXTRA2 can be used. + Predication was suitable. As a general rule this was if there + was only a single operand and a single result (`extw` and LD/ST) + however it was found that some 2 or 3 operand instructions also + qualify. Given that 3 of the 9 bits of EXTRA had to be sacrificed for use + in Twin Predication, some compromises were made, here. LDST is + Twin but also has 3 operands in some operations, so only EXTRA2 can be used. * Fourthly, a packing format was decided: for 2R-1W an EXTRA3 indexing -could have been decided -that RA would be indexed 0 (EXTRA bits 0-2), RB indexed 1 (EXTRA bits 3-5) -and RT indexed 2 (EXTRA bits 6-8). In some cases (LD/ST with update) -RA-as-a-source is given a **different** EXTRA index from RA-as-a-result -(because it is possible to do, and perceived to be useful). Rc=1 -co-results (CR0, CR1) are always given the same EXTRA index as their -main result (RT, FRT). -* Fifthly, in an automated process the results of the analysis -were outputted in CSV Format for use in machine-readable form -by sv_analysis.py - -This process was laborious but logical, and, crucially, once a -decision is made (and ratified) cannot be reversed. -Qualifying future Power ISA Scalar instructions for SVP64 -is **strongly** advised to utilise this same process and the same -sv_analysis.py program as a canonical method of maintaining the -relationships. Alterations to that same program which -change the Designation is **prohibited** once finalised (ratified -through the Power ISA WG Process). It would -be similar to deciding that `add` should be changed from X-Form + could have been decided that RA would be indexed 0 (EXTRA bits 0-2), RB + indexed 1 (EXTRA bits 3-5) and RT indexed 2 (EXTRA bits 6-8). In some + cases (LD/ST with update) RA-as-a-source is given a **different** EXTRA + index from RA-as-a-result (because it is possible to do, and perceived + to be useful). Rc=1 co-results (CR0, CR1) are always given the same + EXTRA index as their main result (RT, FRT). +* Fifthly, in an automated process the results of the analysis were + outputted in CSV Format for use in machine-readable form by sv_analysis.py + + +This process was laborious but logical, and, crucially, once a decision +is made (and ratified) cannot be reversed. Qualifying future Power ISA +Scalar instructions for SVP64 is **strongly** advised to utilise this +same process and the same sv_analysis.py program as a canonical method +of maintaining the relationships. Alterations to that same program +which change the Designation is **prohibited** once finalised (ratified +through the Power ISA WG Process). It would be similar to deciding that +`add` should be changed from X-Form to D-Form. ## Single Predication -This is a standard mode normally found in Vector ISAs. every element in every source Vector and in the destination uses the same bit of one single predicate mask. +This is a standard mode normally found in Vector ISAs. every element +in every source Vector and in the destination uses the same bit of one +single predicate mask. -In SVSTATE, for Single-predication, implementors MUST increment both srcstep and dststep, but depending on whether sz and/or dz are set, srcstep and -dststep can still potentially become different indices. Only when sz=dz -is srcstep guaranteed to equal dststep at all times. +In SVSTATE, for Single-predication, implementors MUST increment both +srcstep and dststep, but depending on whether sz and/or dz are set, +srcstep and dststep can still potentially become different indices. +Only when sz=dz is srcstep guaranteed to equal dststep at all times. Note that in some Mode Formats there is only one flag (zz). This indicates that *both* sz *and* dz are set to the same. @@ -240,21 +243,20 @@ in general to any 2P instruction. This extreme power and flexibility comes down to the fact that SVP64 is not actually a Vector ISA: it is a loop-abstraction-concept that -is applied *in general* to Scalar operations, just like the x86 -`REP` instruction (if put on steroids). +is applied *in general* to Scalar operations, just like the x86 `REP` +instruction (if put on steroids). ## Pack/Unpack The pack/unpack concept of VSX `vpack` is abstracted out as Sub-Vector -reordering. -Two bits in the `SVSHAPE` [[sv/spr]] -enable either "packing" or "unpacking" -on the subvectors vec2/3/4. +reordering. Two bits in the `SVSHAPE` [[sv/spr]] enable either "packing" +or "unpacking" on the subvectors vec2/3/4. -First, illustrating a -"normal" SVP64 operation with `SUBVL!=1:` (assuming no elwidth overrides), -note that the VL loop is outer and the SUBVL loop inner: +First, illustrating a "normal" SVP64 operation with `SUBVL!=1:` (assuming +no elwidth overrides), note that the VL loop is outer and the SUBVL +loop inner: +``` def index(): for i in range(VL): for j in range(SUBVL): @@ -262,12 +264,14 @@ note that the VL loop is outer and the SUBVL loop inner: for idx in index(): operation_on(RA+idx) +``` For pack/unpack (again, no elwidth overrides), note that now there is the option to swap the SUBVL and VL loop orders. In effect the Pack/Unpack performs a Transpose of the subvector elements. Illustrated this time with a GPR mv operation: +``` # yield an outer-SUBVL or inner VL loop with SUBVL def index_p(outer): if outer: @@ -282,6 +286,7 @@ Illustrated this time with a GPR mv operation: # walk through both source and dest indices simultaneously for src_idx, dst_idx in zip(index_p(PACK), index_p(UNPACK)): move_operation(RT+dst_idx, RA+src_idx) +``` "yield" from python is used here for simplicity and clarity. The two Finite State Machines for the generation of the source @@ -293,51 +298,49 @@ vec3 will be redistributed such that Sub-elements 0 are packed together, Sub-elements 1 are packed together, as are Sub-elements 2. +``` srcstep=0 srcstep=1 0 1 2 3 4 5 dststep=0 dststep=1 dststep=2 0 3 1 4 2 5 +``` -Setting of both `PACK` and `UNPACK` is neither prohibited nor -`UNDEFINED` because the reordering is fully deterministic, and -additional REMAP reordering may be applied. Combined with -Matrix REMAP this would -give potentially up to 4 Dimensions of reordering. +Setting of both `PACK` and `UNPACK` is neither prohibited nor `UNDEFINED` +because the reordering is fully deterministic, and additional REMAP +reordering may be applied. Combined with Matrix REMAP this would give +potentially up to 4 Dimensions of reordering. -Pack/Unpack has quirky interactions on -[[sv/mv.swizzle]] because it can set a different subvector length for -destination, and has a slightly different pseudocode algorithm -for Vertical-First Mode. +Pack/Unpack has quirky interactions on [[sv/mv.swizzle]] because it can +set a different subvector length for destination, and has a slightly +different pseudocode algorithm for Vertical-First Mode. Pack/Unpack is enabled (set up) through [[sv/svstep]]. ## Reduce modes -Reduction in SVP64 is deterministic and somewhat of a misnomer. A normal -Vector ISA would have explicit Reduce opcodes with defined characteristics -per operation: in SX Aurora there is even an additional scalar argument -containing the initial reduction value, and the default is either 0 -or 1 depending on the specifics of the explicit opcode. -SVP64 fundamentally has to -utilise *existing* Scalar Power ISA v3.0B operations, which presents some -unique challenges. +Reduction in SVP64 is deterministic and somewhat of a misnomer. +A normal Vector ISA would have explicit Reduce opcodes with defined +characteristics per operation: in SX Aurora there is even an additional +scalar argument containing the initial reduction value, and the default +is either 0 or 1 depending on the specifics of the explicit opcode. +SVP64 fundamentally has to utilise *existing* Scalar Power ISA v3.0B +operations, which presents some unique challenges. The solution turns out to be to simply define reduction as permitting deterministic element-based schedules to be issued using the base Scalar operations, and to rely on the underlying microarchitecture to resolve -Register Hazards at the element level. This goes back to -the fundamental principle that SV is nothing more than a Sub-Program-Counter -sitting between Decode and Issue phases. - -For Scalar Reduction, -Microarchitectures *may* take opportunities to parallelise the reduction -but only if in doing so they preserve strict Program Order at the Element Level. -Opportunities where this is possible include an `OR` operation -or a MIN/MAX operation: it may be possible to parallelise the reduction, -but for Floating Point it is not permitted due to different results -being obtained if the reduction is not executed in strict Program-Sequential -Order. +Register Hazards at the element level. This goes back to the fundamental +principle that SV is nothing more than a Sub-Program-Counter sitting +between Decode and Issue phases. + +For Scalar Reduction, Microarchitectures *may* take opportunities to +parallelise the reduction but only if in doing so they preserve strict +Program Order at the Element Level. Opportunities where this is possible +include an `OR` operation or a MIN/MAX operation: it may be possible to +parallelise the reduction, but for Floating Point it is not permitted +due to different results being obtained if the reduction is not executed +in strict Program-Sequential Order. In essence it becomes the programmer's responsibility to leverage the pre-determined schedules to desired effect. @@ -348,18 +351,15 @@ Scalar Reduction per se does not exist, instead is implemented in SVP64 as a simple and natural relaxation of the usual restriction on the Vector Looping which would terminate if the destination was marked as a Scalar. Scalar Reduction by contrast *keeps issuing Vector Element Operations* -even though the destination register is marked as scalar. -Thus it is up to the programmer to be aware of this, observe some -conventions, and thus end up achieving the desired outcome of scalar -reduction. - -It is also important to appreciate that there is no -actual imposition or restriction on how this mode is utilised: there -will therefore be several valuable uses (including Vector Iteration -and "Reverse-Gear") -and it is up to the programmer to make best use of the -(strictly deterministic) capability -provided. +even though the destination register is marked as scalar. Thus it is +up to the programmer to be aware of this, observe some conventions, +and thus end up achieving the desired outcome of scalar reduction. + +It is also important to appreciate that there is no actual imposition or +restriction on how this mode is utilised: there will therefore be several +valuable uses (including Vector Iteration and "Reverse-Gear") and it is +up to the programmer to make best use of the (strictly deterministic) +capability provided. In this mode, which is suited to operations involving carry or overflow, one register must be assigned, by convention by the programmer to be the @@ -377,139 +377,133 @@ one register must be assigned, by convention by the programmer to be the *Note that issuing instructions in Scalar reduce mode such as `setb` are neither `UNDEFINED` nor prohibited, despite them not making much -sense at first glance. -Scalar reduce is strictly defined behaviour, and the cost in -hardware terms of prohibition of seemingly non-sensical operations is too great. -Therefore it is permitted and required to be executed successfully. -Implementors **MAY** choose to optimise such instructions in instances -where their use results in "extraneous execution", i.e. where it is clear -that the sequence of operations, comprising multiple overwrites to -a scalar destination **without** cumulative, iterative, or reductive -behaviour (no "accumulator"), may discard all but the last element -operation. Identification -of such is trivial to do for `setb` and `cmp`: the source register type is -a completely different register file from the destination. -Likewise Scalar reduction when the destination is a Vector -is as if the Reduction Mode was not requested. However it would clearly -be unacceptable to perform such optimisations on cache-inhibited LD/ST, -so some considerable care needs to be taken.* +sense at first glance. Scalar reduce is strictly defined behaviour, +and the cost in hardware terms of prohibition of seemingly non-sensical +operations is too great. Therefore it is permitted and required to +be executed successfully. Implementors **MAY** choose to optimise +such instructions in instances where their use results in "extraneous +execution", i.e. where it is clear that the sequence of operations, +comprising multiple overwrites to a scalar destination **without** +cumulative, iterative, or reductive behaviour (no "accumulator"), may +discard all but the last element operation. Identification of such +is trivial to do for `setb` and `cmp`: the source register type is a +completely different register file from the destination. Likewise Scalar +reduction when the destination is a Vector is as if the Reduction Mode +was not requested. However it would clearly be unacceptable to perform +such optimisations on cache-inhibited LD/ST, so some considerable care +needs to be taken.* Typical applications include simple operations such as `ADD r3, r10.v, r3` where, clearly, r3 is being used to accumulate the addition of all elements of the vector starting at r10. +``` # add RT, RA,RB but when RT==RA for i in range(VL): iregs[RA] += iregs[RB+i] # RT==RA +``` However, *unless* the operation is marked as "mapreduce" (`sv.add/mr`) -SV ordinarily -**terminates** at the first scalar operation. Only by marking the -operation as "mapreduce" will it continue to issue multiple sub-looped -(element) instructions in `Program Order`. - -To perform the loop in reverse order, the ```RG``` (reverse gear) bit must be set. This may be useful in situations where the results may be different -(floating-point) if executed in a different order. Given that there is -no actual prohibition on Reduce Mode being applied when the destination -is a Vector, the "Reverse Gear" bit turns out to be a way to apply Iterative -or Cumulative Vector operations in reverse. `sv.add/rg r3.v, r4.v, r4.v` -for example will start at the opposite end of the Vector and push -a cumulative series of overlapping add operations into the Execution units of -the underlying hardware. +SV ordinarily **terminates** at the first scalar operation. Only by +marking the operation as "mapreduce" will it continue to issue multiple +sub-looped (element) instructions in `Program Order`. + +To perform the loop in reverse order, the ```RG``` (reverse gear) bit +must be set. This may be useful in situations where the results may be +different (floating-point) if executed in a different order. Given that +there is no actual prohibition on Reduce Mode being applied when the +destination is a Vector, the "Reverse Gear" bit turns out to be a way to +apply Iterative or Cumulative Vector operations in reverse. `sv.add/rg +r3.v, r4.v, r4.v` for example will start at the opposite end of the +Vector and push a cumulative series of overlapping add operations into +the Execution units of the underlying hardware. Other examples include shift-mask operations where a Vector of inserts -into a single destination register is required (see [[sv/bitmanip]], bmset), -as a way to construct -a value quickly from multiple arbitrary bit-ranges and bit-offsets. -Using the same register as both the source and destination, with Vectors -of different offsets masks and values to be inserted has multiple -applications including Video, cryptography and JIT compilation. +into a single destination register is required (see [[sv/bitmanip]], +bmset), as a way to construct a value quickly from multiple arbitrary +bit-ranges and bit-offsets. Using the same register as both the source +and destination, with Vectors of different offsets masks and values to +be inserted has multiple applications including Video, cryptography and +JIT compilation. +``` # assume VL=4: # * Vector of shift-offsets contained in RC (r12.v) # * Vector of masks contained in RB (r8.v) # * Vector of values to be masked-in in RA (r4.v) # * Scalar destination RT (r0) to receive all mask-offset values sv.bmset/mr r0, r4.v, r8.v, r12.v +``` -Due to the Deterministic Scheduling, -Subtract and Divide are still permitted to be executed in this mode, -although from an algorithmic perspective it is strongly discouraged. -It would be better to use addition followed by one final subtract, -or in the case of divide, to get better accuracy, to perform a multiply -cascade followed by a final divide. +Due to the Deterministic Scheduling, Subtract and Divide are still +permitted to be executed in this mode, although from an algorithmic +perspective it is strongly discouraged. It would be better to use +addition followed by one final subtract, or in the case of divide, to get +better accuracy, to perform a multiply cascade followed by a final divide. Note that single-operand or three-operand scalar-dest reduce is perfectly -well permitted: the programmer may still declare one register, used as -both a Vector source and Scalar destination, to be utilised as -the "accumulator". In the case of `sv.fmadds` and `sv.maddhw` etc -this naturally fits well with the normal expected usage of these -operations. +well permitted: the programmer may still declare one register, used +as both a Vector source and Scalar destination, to be utilised as the +"accumulator". In the case of `sv.fmadds` and `sv.maddhw` etc this +naturally fits well with the normal expected usage of these operations. If an interrupt or exception occurs in the middle of the scalar mapreduce, the scalar destination register **MUST** be updated with the current (intermediate) result, because this is how ```Program Order``` is -preserved (Vector Loops are to be considered to be just another way of issuing instructions -in Program Order). In this way, after return from interrupt, -the scalar mapreduce may continue where it left off. This provides -"precise" exception behaviour. - -Note that hardware is perfectly permitted to perform multi-issue -parallel optimisation of the scalar reduce operation: it's just that -as far as the user is concerned, all exceptions and interrupts **MUST** -be precise. +preserved (Vector Loops are to be considered to be just another way +of issuing instructions in Program Order). In this way, after return +from interrupt, the scalar mapreduce may continue where it left off. +This provides "precise" exception behaviour. +Note that hardware is perfectly permitted to perform multi-issue parallel +optimisation of the scalar reduce operation: it's just that as far as +the user is concerned, all exceptions and interrupts **MUST** be precise. ## Fail-on-first -Data-dependent fail-on-first has two distinct variants: one for LD/ST -(see [[sv/ldst]], -the other for arithmetic operations (actually, CR-driven) -[[sv/normal]] and CR operations [[sv/cr_ops]]. -Note in each -case the assumption is that vector elements are required appear to be -executed in sequential Program Order, element 0 being the first. - -* LD/ST ffirst treats the first LD/ST in a vector (element 0) as an - ordinary one. Exceptions occur "as normal". However for elements 1 - and above, if an exception would occur, then VL is **truncated** to the - previous element. +Data-dependent fail-on-first has two distinct variants: one for LD/ST (see +[[sv/ldst]], the other for arithmetic operations (actually, CR-driven) +[[sv/normal]] and CR operations [[sv/cr_ops]]. Note in each case the +assumption is that vector elements are required appear to be executed +in sequential Program Order, element 0 being the first. + +* LD/ST ffirst (not to be confused with *Data-Dependent* LD/ST ffirst) + treats the first LD/ST in a vector (element 0) as an ordinary one. + Exceptions occur "as normal". However for elements 1 and above, if an + exception would occur, then VL is **truncated** to the previous element. * Data-driven (CR-driven) fail-on-first activates when Rc=1 or other CR-creating operation produces a result (including cmp). Similar to - branch, an analysis of the CR is performed and if the test fails, the - vector operation terminates and discards all element operations - above the current one (and the current one if VLi is not set), - and VL is truncated to either - the *previous* element or the current one, depending on whether - VLi (VL "inclusive") is set. - -Thus the new VL comprises a contiguous vector of results, -all of which pass the testing criteria (equal to zero, less than zero). - -The CR-based data-driven fail-on-first is new and not found in ARM -SVE or RVV. At the same time it is also "old" because it is a generalisation -of the Z80 -[Block compare](https://rvbelzen.tripod.com/z80prgtemp/z80prg04.htm) + branch, an analysis of the CR is performed and if the test fails, + the vector operation terminates and discards all element operations + above the current one (and the current one if VLi is not set), and + VL is truncated to either the *previous* element or the current one, + depending on whether VLi (VL "inclusive") is set. + +Thus the new VL comprises a contiguous vector of results, all of which +pass the testing criteria (equal to zero, less than zero). + +The CR-based data-driven fail-on-first is new and not +found in ARM SVE or RVV. At the same time it is also +"old" because it is a generalisation of the Z80 [Block +compare](https://rvbelzen.tripod.com/z80prgtemp/z80prg04.htm) instructions, especially -[CPIR](http://z80-heaven.wikidot.com/instructions-set:cpir) -which is based on CP (compare) as the ultimate "element" (suffix) -operation to which the repeat (prefix) is applied. -It is extremely useful for reducing instruction count, -however requires speculative execution involving modifications of VL -to get high performance implementations. An additional mode (RC1=1) -effectively turns what would otherwise be an arithmetic operation -into a type of `cmp`. The CR is stored (and the CR.eq bit tested -against the `inv` field). -If the CR.eq bit is equal to `inv` then the Vector is truncated and -the loop ends. -Note that when RC1=1 the result elements are never stored, only the CRs. +[CPIR](http://z80-heaven.wikidot.com/instructions-set:cpir) which is +based on CP (compare) as the ultimate "element" (suffix) operation +to which the repeat (prefix) is applied. It is extremely useful for +reducing instruction count, however requires speculative execution +involving modifications of VL to get high performance implementations. +An additional mode (RC1=1) effectively turns what would otherwise be an +arithmetic operation into a type of `cmp`. The CR is stored (and the +CR.eq bit tested against the `inv` field). If the CR.eq bit is equal to +`inv` then the Vector is truncated and the loop ends. Note that when +RC1=1 the result elements are never stored, only the CRs. VLi is only available as an option when `Rc=0` (or for instructions -which do not have Rc). When set, the current element is always -also included in the count (the new length that VL will be set to). -This may be useful in combination with "inv" to truncate the Vector -to *exclude* elements that fail a test, or, in the case of implementations -of strncpy, to include the terminating zero. +which do not have Rc). When set, the current element is always also +included in the count (the new length that VL will be set to). This may +be useful in combination with "inv" to truncate the Vector to *exclude* +elements that fail a test, or, in the case of implementations of strncpy, +to include the terminating zero. In CR-based data-driven fail-on-first there is only the option to select and test one bit of each CR (just as with branch BO). For more complex @@ -537,24 +531,21 @@ ffirst LD/ST operations on an aligned boundary. Likewise, to reduce workloads or balance resources. CR-based data-dependent first on the other hand MUST not truncate VL -arbitrarily to a length decided by the hardware: VL MUST only be -truncated based explicitly on whether a test fails. -This because it is a precise test on which algorithms -will rely. +arbitrarily to a length decided by the hardware: VL MUST only be truncated +based explicitly on whether a test fails. This because it is a precise +test on which algorithms will rely. -*Note: there is no reverse-direction for Data-dependent Fail-First. -REMAP will need to be activated to invert the ordering of element -traversal.* +*Note: there is no reverse-direction for Data-dependent Fail-First. REMAP +will need to be activated to invert the ordering of element traversal.* ### Data-dependent fail-first on CR operations (crand etc) -Operations that actually produce or alter CR Field as a result -do not also in turn have an Rc=1 mode. However it makes no -sense to try to test the 4 bits of a CR Field for being equal -or not equal to zero. Moreover, the result is already in the -form that is desired: it is a CR field. Therefore, -CR-based operations have their own SVP64 Mode, described -in [[sv/cr_ops]] +Operations that actually produce or alter CR Field as a result do not +also in turn have an Rc=1 mode. However it makes no sense to try to test +the 4 bits of a CR Field for being equal or not equal to zero. Moreover, +the result is already in the form that is desired: it is a CR field. +Therefore, CR-based operations have their own SVP64 Mode, described in +[[sv/cr_ops]] There are two primary different types of CR operations: @@ -568,16 +559,16 @@ More details can be found in [[sv/cr_ops]]. Pred-result mode may not be applied on CR-based operations. -Although CR operations (mtcr, crand, cror) may be Vectorised, -predicated, pred-result mode applies to operations that have -an Rc=1 mode, or make sense to add an RC1 option. +Although CR operations (mtcr, crand, cror) may be Vectorised, predicated, +pred-result mode applies to operations that have an Rc=1 mode, or make +sense to add an RC1 option. -Predicate-result merges common CR testing with predication, saving on -instruction count. In essence, a Condition Register Field test -is performed, and if it fails it is considered to have been -*as if* the destination predicate bit was zero. Given that -there are no CR-based operations that produce Rc=1 co-results, -there can be no pred-result mode for mtcr and other CR-based instructions +Predicate-result merges common CR testing with predication, saving +on instruction count. In essence, a Condition Register Field test is +performed, and if it fails it is considered to have been *as if* the +destination predicate bit was zero. Given that there are no CR-based +operations that produce Rc=1 co-results, there can be no pred-result +mode for mtcr and other CR-based instructions Arithmetic and Logical Pred-result, which does have Rc=1 or for which RC1 Mode makes sense, is covered in [[sv/normal]] @@ -594,41 +585,44 @@ SV is applied. Numbering relationships for CR fields are already complex due to being in BE format (*the relationship is not clearly explained in the v3.0B -or v3.1 specification*). However with some care and consideration -the exact same mapping used for INT and FP regfiles may be applied, -just to the upper bits, as explained below. Firstly and most -importantly a new notation -`CR{field number}` is used to indicate access to a particular -Condition Register Field (as opposed to the notation `CR[bit]` -which accesses one bit of the 32 bit Power ISA v3.0B -Condition Register). +or v3.1 specification*). However with some care and consideration the +exact same mapping used for INT and FP regfiles may be applied, just to +the upper bits, as explained below. Firstly and most importantly a new +notation `CR{field number}` is used to indicate access to a particular +Condition Register Field (as opposed to the notation `CR[bit]` which +accesses one bit of the 32 bit Power ISA v3.0B Condition Register). `CR{n}` refers to `CR0` when `n=0` and consequently, for CR0-7, is defined, in v3.0B pseudocode, as: +``` CR{n} = CR[32+n*4:35+n*4] +``` -For SVP64 the relationship for the sequential -numbering of elements is to the CR **fields** within -the CR Register, not to individual bits within the CR register. +For SVP64 the relationship for the sequential numbering of elements is to +the CR **fields** within the CR Register, not to individual bits within +the CR register. The `CR{n}` notation is designed to give *linear sequential numbering* in the Vector domain on a straight sequential Vector Loop. In OpenPOWER v3.0/1, BF/BT/BA/BB are all 5 bits. The top 3 bits (0:2) -select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits -*in* that CR (EQ/LT/GT/SO). The numbering was determined (after 4 months of +select one of the 8 CRs; the bottom 2 bits (3:4) select one of 4 bits *in* +that CR (EQ/LT/GT/SO). The numbering was determined (after 4 months of analysis and research) to be as follows: +``` CR_index = (BA>>2) # top 3 bits bit_index = (BA & 0b11) # low 2 bits CR_reg = CR{CR_index} # get the CR # finally get the bit from the CR. CR_bit = (CR_reg & (1<>2) # top 3 bits if spec[0]: # vector mode, 0-124 increments of 4 @@ -657,6 +653,7 @@ algorithm to determine CR\_reg is modified to as follows: CR_reg = CR{CR_index} # get the CR # finally get the bit from the CR. CR_bit = (CR_reg & (1< 0 ... etc +``` If a "cumulated" CR based analysis of results is desired (a la VSX CR6) then a followup instruction must be performed, setting "reduce" mode on @@ -764,6 +763,7 @@ despite being auto-generated, are part of the Specification. illustration of normal mode add operation: zeroing not included, elwidth overrides not included. if there is no predicate, it is set to all 1s +``` function op_add(rd, rs1, rs2) # add not VADD! int i, id=0, irs1=0, irs2=0; predval = get_pred_val(FALSE, rd); @@ -780,6 +780,7 @@ overrides not included. if there is no predicate, it is set to all 1s STATE.srcoffs = 0; # reset return; } +``` This has several modes: @@ -810,7 +811,9 @@ mark instructions as "prefixed". A reasonable (prototype) starting point: +``` svp64 [field=value]* +``` Fields: @@ -824,7 +827,9 @@ similar to x86 "rex" prefix. For actual assembler: +``` sv.asmcode/mode.vec{N}.ew=8,sw=16,m={pred},sm={pred} reg.v, src.s +``` Qualifiers: @@ -889,22 +894,21 @@ Its application by default is that: For more complex applications a REMAP Schedule must be used -*Programmers's note: -if passed a predicate mask with only one bit set, this algorithm -takes no action, similar to when a predicate mask is all zero.* +*Programmers's note: if passed a predicate mask with only one bit set, +this algorithm takes no action, similar to when a predicate mask is +all zero.* *Implementor's Note: many SIMD-based Parallel Reduction Algorithms are implemented in hardware with MVs that ensure lane-crossing is minimised. -The mistake which would be catastrophic to SVP64 to make is to then -limit the Reduction Sequence for all implementors -based solely and exclusively on what one -specific internal microarchitecture does. -In SIMD ISAs the internal SIMD Architectural design is exposed and imposed on the programmer. Cray-style Vector ISAs on the other hand provide convenient, -compact and efficient encodings of abstract concepts.* -**It is the Implementor's responsibility to produce a design -that complies with the above algorithm, -utilising internal Micro-coding and other techniques to transparently -insert micro-architectural lane-crossing Move operations +The mistake which would be catastrophic to SVP64 to make is to then limit +the Reduction Sequence for all implementors based solely and exclusively +on what one specific internal microarchitecture does. In SIMD ISAs +the internal SIMD Architectural design is exposed and imposed on the +programmer. Cray-style Vector ISAs on the other hand provide convenient, +compact and efficient encodings of abstract concepts.* **It is the +Implementor's responsibility to produce a design that complies with the +above algorithm, utilising internal Micro-coding and other techniques to +transparently insert micro-architectural lane-crossing Move operations if necessary or desired, to give the level of efficiency or performance required.** @@ -914,6 +918,7 @@ Element-width overrides are best illustrated with a packed structure union in the c programming language. The following should be taken literally, and assume always a little-endian layout: +``` #pragma pack typedef union { uint8_t b[]; @@ -924,11 +929,13 @@ literally, and assume always a little-endian layout: } el_reg_t; elreg_t int_regfile[128]; +``` Accessing (get and set) of registers given a value, register (in `elreg_t` form), and that all arithmetic, numbering and pseudo-Memory format is LE-endian and LSB0-numbered below: +``` elreg_t& get_polymorphed_reg(elreg_t const& reg, bitwidth, offset): el_reg_t res; // result res.l = 0; // TODO: going to need sign-extending / zero-extending @@ -961,6 +968,7 @@ LE-endian and LSB0-numbered below: int_regfile[reg].i[offset] = val elif bitwidth == 64: int_regfile[reg].l[offset] = val +``` In effect the GPR registers r0 to r127 (and corresponding FPRs fp0 to fp127) are reinterpreted to be "starting points" in a byte-addressable @@ -974,6 +982,7 @@ write-enable lines, just like most SRAMs, in order to avoid READ-MODIFY-WRITE. An example ADD operation with predication and element width overrides: +```  for (i = 0; i < VL; i++) if (predval & 1<