lxo/ChangeLog

   1 2021-01-31
   2
   3         * GCC: Started debugging regressions in the stage1 non-svp64
   4         compiler.  Noticed that the renaming of mov to @altivec_mov
   5         removed expanders for some modes used by altivec but not by svp64.
   6         Reintroduced them, and added floating-point svp64 mov patterns.
   7         Split out of the main patch a preparation patch that could be
   8         submitted upstream right away, for it just prepares for register
   9         renumbering.  Fixed conditional register usage that, when svp64
  10         was not enabled, caused the LAST/MAX_* non-SVP64 registers to be
  11         marked as fixed, which caused the frame pointer to not be
  12         preserved across calls.  Fixed the *logue routines to account for
  13         the register renumbering.  Fixed the svp64 add expander to use a
  14         correct expansion of <VI_unit>, covering V2DI at power8.  With
  15         that, we're down to a single C regression, when not enabling
  16         svp64.  While the expected behavior is for the compiler to
  17         optimize gcc.target/powerpc/dform-3.c's gpr function so that p->c
  18         is (re)loaded to e.g. r10&r11 with a vsx_movv2df_64bit, because
  19         the MEM cost for its reload is negative, whereas the
  20         svp64-modified compiler keeps both such instructions, first
  21         loading to a VSX reg, then splitting it into a pair of GPRs.  It
  22         is a performance bug, but the generated code works.  Trying a
  23         bootstrap!  Stage2 wouldn't build because of /* within comments in
  24         rs6000-modes.def; adjusted the commented-out entries I'd put in to
  25         avoid that.  The memory move costs were off because of the use of
  26         literal regno 32 when computing the costs for FPR classes.
  27         Regstrapped the prepping patch successfully, then went back to the
  28         patch that introduces svp64 support, still disabled.  -msvp64 is
  29         still slightly, but without enabling it, we may be down to no
  30         regressions, testing should confirm.  (14:37)
  31
  32 2021-01-30
  33
  34         * GCC: Fixed the boundaries of the loops that disable SVP64
  35         registers when SVP64 is not enabled.  Fixed macros used for
  36         parameter and return value assignment to reflect the new FP
  37         numbers.  Require at least one register operand for svp64 vector
  38         mov pattern.  Add emit of altivec insn when not using svp64.
  39         Introduced a first svp64 reload change for preferred_reload, to
  40         avoid trying to reload constants into altivec registers.  A lot
  41         more work will be needed for svp64 reloads.  A non-svp64 native
  42         compiler builds stage1, but the compiler is still pretty broken,
  43         with thousands of regressions.  A svp one builds stage1 and fails
  44         in libgcc, with a bunch of asm failures because of (unsupported)
  45         sv.* opcodes, and one reload failure in decContext, that I started
  46         investigating.  (8:22)
  47
  48 2021-01-27
  49
  50         * µW (1:10)
  51
  52 2021-01-26
  53
  54         * VCoffee (1:41)
  55
  56 2021-01-24
  57
  58         * GCC: Introduced vector modes, registers, classes, constraints,
  59         renumbered and remapped registers, went over literals referring to
  60         register numbers, and started implementation of move/load/store
  61         and add for the V*DI integral types.  Still have to test that the
  62         compiler still works after the renumbering.  The new insns are not
  63         generated yet, I haven't made the new registers usable for
  64         anything yet.  (12:13)
  65
  66 2021-01-22
  67
  68         * 578: Specifying and debating the task with luke and, later,
  69         jacob.  Difficulties in conveying the requirements and overcoming
  70         the complexities involved in figuring out how to parse each asm
  71         operand in Python, underspecification of the input language,
  72         disagreement as to the complexity and the amount of work required
  73         to duplicate existing binutils functionality in python, and then
  74         duplicate this work one more time into binutils later, led Luke to
  75         take it upon himself.
  76         * 579: Talked to Jacob a bit about potential implementation
  77         strategies.  The need to build an immediate constant to use as the
  78         operand to .long/svp64 makes for plenty of complexity, even in
  79         C++.  I'm again unhappy with a plan that involves so much
  80         intentional waste of effort.  I'm also very surprised with the
  81         estimated amount of work involved in this task, compared with
  82         578, that is a much bigger one with all the rewriting of an asm
  83         parser, and likely more rewriting as the extended asm syntax
  84         evolves.  And thus pretty much a full workday ends up wasted,
  85         most of it complaining about planning to waste work.  (8:29)
  86
  87 2021-01-19
  88
  89         * Virtual Coffe (1:39)
  90
  91 2021-01-13
  92
  93         * Microwatts meeting (1:08)
  94
  95 2021-01-07
  96
  97         * 572: New, split out of 570, on what .[sv], elwidth, subvl
  98         affect in load/store ops: the address [vector] or the in-memory
  99         [vector]?
 100
 101 2021-01-06
 102
 103         * 570: New.  It's not specified whether selection of elwidth
 104         sub-dword bytes get byte-reversed into LE before or after the
 105         selection.  The specs say we convert loaded words to LE as quickly
 106         as possible, so that all internal operations are LE, but this
 107         would lead to reversal of sub-register vector elements when
 108         loading, even when using svp64 loads with the correct elwidth_src.
 109         * 569: New.  Also concerned about how to get bit arrays properly
 110         loaded into predicate registers so that the *bits* are reversed to
 111         match LE requirements.
 112         * 568: New.  After gotting clarification from Jacob about setvl's
 113         behavior: VL gets set to MIN(VL, MAXVL), you can count on its not
 114         being a smaller value.  This is documented only in pseudocode, it
 115         could be made more self-evident.  (3:13)
 116
 117 2021-01-05
 118
 119         * 567: Cesar filed it for me; I clarified it a bit further.
 120
 121 2021-01-04
 122
 123         * 560: Tried to show I understand the effects of loads and
 124         byte-swapping loads in both endiannesses, and restated my
 125         suggestion of iteration order matching the natural memory layout
 126         of arrays/vectors.  (1:46)
 127
 128 2021-01-03
 129
 130         * 560: Pointed out the circular reasoning in assuming LE in
 131         showing it works for LE and BE, stated the problem with BE and how
 132         the current BE status is incompatible with both PPC vectors and
 133         with how svp64 vectors are said to be expected to work.
 134         Recommended ruling BE out entirely for now, if the approach is to
 135         not look into the problems, this will result in broken,
 136         self-inconsistent specs that we'll either have to discontinue or
 137         carry indefinitely.
 138         * 558: Looked at the riscv implementation, particularly commit
 139         4922a0bed80f8fa1b7d507eee6f94fb9c34bfc32, the testcases in
 140         299ed9a4eaa569a5fc2b2291911ebf55318e85e4, and the reduction of
 141         redundant setvli in e71a47e3cd553cec24afbc752df2dc2214dd3850, and
 142         5fa22b2687b1f6ca1558fb487fc07e83ffe95707 that enables vl to not be
 143         a power of two.
 144         * 560: Wrote up about significance, ordering, endianness and such
 145         conventions.  (6:21)
 146
 147 2020-12-30
 148
 149         * 559: Luke split out the issue of whether we should we have
 150         automatic detection and reversal of direction of vectors, so that
 151         they always behave as if parallel, even if implemented as
 152         sequential.  Jacob pointed out that reversal is not enough for
 153         some 3-operand cases.
 154         * SVP64: Second review call.
 155         * 562: Filed, on elwidth encoding.
 156         * 558: Raised the need for the compiler to be able to save and
 157         restore VL, if it's exposed separately from maxvl; also brought up
 158         calling conventions.
 159         * 560: Commented on potential endianness issue: identity of
 160         register as scalar and of first element of vector starting at that
 161         register.  More questions on issues that arise in big endian mode,
 162         and compatibility we may wish to aim for.  Some difficulties in
 163         getting as much as a conversation going on endianness-influenced
 164         sub-register iteration order; presented a simple scenario that
 165         demonstrates the fundamental programming problems that will arise
 166         out of favoring LE as we seem to.
 167         * 558: Explained why disregarding things the compiler will do on
 168         its own and arguing it shouldn't do that doesn't make the initial
 169         project simpler, but harder, and also more fragile and likely to
 170         be throw-away code in the end.  Argued for in favor of seeing
 171         where we want to get to in the end, and then mapping out what it
 172         takes to get features we want for the first stage so that it's a
 173         step in the general direction of the end goal.  (6:43)
 174
 175 2020-12-28
 176
 177         * 558: Commented on vector modes, insns, regalloc, scheduling,
 178         auto vectorization, instrinsics, and the possibilities of vector
 179         length and component modes as parameters to template insns and
 180         instrinsics, and of mechanic generation thereof.  (2:22)
 181
 182 2020-12-26
 183
 184         * SVP64: Reviewed overview and proposed encoding, posted more
 185         questions.  (2:30)
 186
 187 2020-12-25
 188
 189         * Email backlog.
 190         * SVP64: More studying, more making sense.  Asked about
 191         parallelism vs dependencies.  (3:02)
 192
 193         * 550: Implemented the first cut at svp64 prefix in the assembler,
 194         namely, a 32-bit pseudo-insn that takes a 24-bit immediate
 195         operand, encoding it as an insn with EXT01 as the major opcode,
 196         MSB0 bits 7 and 9 also set, and the top two bits of the immediate
 197         shuffled into bits 6 and 8.  Added patch to bugzill and to the
 198         wiki.  Updated status.  (1:41)
 199
 200 2020-12-23
 201
 202         * SVP64: Review meeting.
 203         * 555: Reduce flag/s for fma.  Commented on the possibilities.
 204         (1:26)
 205
 206 2020-12-20
 207
 208         * 532: Implemented logic for mode-switching 32-bit insns with 6
 209         bits for the opcode, a 16-bit embedded compressed insn, and 10
 210         bits corresponding to subsequent insns, to tell whether or not
 211         each of them is compressed.  This nearly doubled the compression
 212         rate, using one such mode-switching insn per 3 compressed insns.
 213         (1:48)
 214
 215 2020-12-14
 216
 217         * 532: Reported on compression ratio findings and analyses.
 218         (1:06)
 219
 220 2020-12-13
 221
 222         * 532: Questioned some bullets under 16-imm opcodes.  Implemented
 223         condition register and system opcodes, 16-imm opcodes, extended
 224         load and store to cover 16-imm modes, condition bit expression
 225         parsing and finally bc 16-imm and bclr 10- and 16-bit opcodes.
 226         Tested a bit by visual inspection, introduced logic to backtrack
 227         into 32-bit and count such pairs as 10-bit nop + 16-imm insn,
 228         followed by 32-bit.  Fixed size estimation: count[2] was still
 229         counted as 16+16-imm, rather than a single 16-imm.  (5:30)
 230
 231 2020-12-06
 232
 233         * 532: Adjusted the logic in comp16-v1-skel.py for 16-bit 16-imm
 234         rather than the 16+16 I'd invented.  Implemented the most relevant
 235         opcodes for 10-bit, and many of the 16-bit ones too.  Not yet
 236         implemented are conditional branches, Immediate, CR and System
 237         opcodes.  With all of nop, unconditional branch, ld/st,
 238         arithmetic, logical and floating-point, we get less than 3%
 239         compression in GCC, with not-entirely-unreasonable reg subsets.
 240         It's not looking good.  (8:27)
 241
 242 2020-12-02
 243
 244         * Microwatts meeting.
 245         * 238: Added some thoghts on bl and blr, and implications about
 246         modes.  Also detailed my worries about how to preserve dynamic
 247         state, specifically switch-back-to-compressed-after-insn, across
 248         interrupts.  (1:44)
 249
 250 2020-11-30
 251
 252         * 238: Settled the N-without-M issue, it was likely an error in
 253         the tables.  Raised an inconsistency in decoder pseudocode's
 254         reversal of M and N.  Returned to the uncertainty and need for
 255         specifying how to handle conflicts between
 256         standard-then-compressed followed by 10-bit with M=0.  Raised
 257         issue of missing documentation that branch targets are always
 258         uncompressed, not just 32-bit aligned.  Raised issue of the
 259         purpose of M and N bits, particularly in unconditional branches.
 260         Explained why I believe phase 1 decoder hsa to look at Cmaj.m bits
 261         to tell whether or not N is there, brought crnand and crand
 262         encodings as example, and asked whether crand with M=0 should
 263         switch to 32-bit mode for only one insn, because the bit that
 264         usually holds N=1, or permanently, because there's no N field in
 265         the applicable encoding.  (2:33)
 266
 267         * 238: Detailed the motivations for my proposal of bit-shuffling
 268         in the 16-bit encoding, to reduce wires and selections in the
 269         realigning muxer.  Restated my question on N without M as I can't
 270         relate the answer with the question, it appears to have been
 271         misunderstood.  Further expanded on the advantages of moving the
 272         Cmaj.m and M bits as suggested, even going as far as enabling an
 273         extended compressed opcode reusing the bit that signals a match
 274         for a 10-bit insn in uncompressed mode.  (3:29)
 275
 276 2020-11-29
 277
 278         * 238: Noted some apparent contradictions in the rejection of
 279         extended 16-bit insns in the face of 16+16-bit insns.  Luke hit me
 280         with clarification that there's no such thing as a 16+16-bit insn
 281         in compressed mode, and I could see how I'd totally made it up by
 282         myself by reviewing the proposal.  Hit and asked other questions:
 283         what's the N for when there's no M, and what are the SV prefixes
 284         mentioned there, now that I no longer assume them to be something
 285         like extend-next.  Then I recorded some thoughts on minimizing the
 286         bits the muxer has to look into by making the bits that encode N,
 287         Cmaj.m and M onto the same bits that, in traditional mode, encode
 288         the primary opcode.  Finally, I was hit by the realization that,
 289         if we change the perspective from "uncompressed insns used to be
 290         32-bit only" to "uncompressed can be 32- or 16-bit depending on
 291         the opcode", on account of the 10-bit insns, the need for taking
 292         the opcode into account to tell whether we're looking at a 16- or
 293         32-bit insn, so why is it ok there, but not ok in compressed mode?
 294         Finally, I propose an encoding scheme that encodes lengths of
 295         subsequent insns in an early insn, achieving more coverage for
 296         16-bit insns, better limit compression, far more flexible mode
 297         switching, enabling savings at far more sparse settings, and
 298         without eating up a pair of primary opcodes: the 32-bit
 299         mode-switching insn could even be an extended opcode, though it
 300         would probably not have as many pre-length encoding bits then.  It
 301         would fit an entire 16-bit insn, which could do useful work, or
 302         queue up further pre-length bits, that correspond to static
 303         upcoming insns and tell whether to decode them as 32-bit or as
 304         (pairs of?) 16-bit ones.  Compared max ratio, representation
 305         overhead, and break-even density.  Shared some more thoughts on
 306         48- and 64-bit insns.  (7:39)
 307
 308         * 532: Got a little confused about some encodings; it's not clear
 309         whether the N and M bits in 16-bit instructions have uniform
 310         interpretation, or whether some proposed opcodes are repurposing
 311         them.  I'm surprised with such short immediate operands in the
 312         immediate instructions, if they don't get a 16-bit extension, or
 313         otherwise with the apparent requirement for an extended 16-bit
 314         immediate for something as simple as an mr encoded as addi.  Asked
 315         for clarification.  Not sure about how to proceed before I get it;
 316         the logic of the estimator would be too significantly impacted.
 317         (2:48)
 318
 319 2020-11-28
 320
 321         * 532: Figured out and implemented the logic to infer mode
 322         switching for best compression under attempt 1 proposed encoding,
 323         namely with 10-bit insns, 16-bit insns, 16+16-bit insns, and
 324         32-bit insns.  10-bit insns appear in uncompressed mode, and can
 325         be followed by insns in either mode; 16-bit ones appear in
 326         compressed mode, and can remain in compressed mode, or switch to
 327         uncomprssed mode for 1 insn or for good; 16+16-bit ones appear in
 328         compressed mode, and cannot switch modes; 32-bit ones appear only
 329         in uncompressed mode, or in the single-insn slot after a 16-bit
 330         that requests it.  If we find a 16-bit insn while we're in
 331         uncompressed mode, use a 10-bit nop to tentatively switch.  Insns
 332         that can be encoded in 10-bits, but appear in compressed mode, had
 333         better be encoded in 16-bits, for that offers further subsequent
 334         encoding options, without downsides for size estimation.  Insns
 335         that can be encoded as 16+16-bit decay to 32-bit if in
 336         uncompressed mode, or if, after a sequence thereof, a later insn
 337         forces a switch to 32-bit mode without an intervening switching
 338         insn.  Still missing: the code to select what insns can be encoded
 339         in what modes.  (6:42)
 340
 341         * 532: Implemented a skeleton for compression ratio estimation,
 342         initially with the simpler mode switching of the 8-bit nop,
 343         odd-address 16-bit insns.  Next, rewrite it for all the complexity
 344         of mode switching envisioned for the "attempt 1" proposal.  (2:02)
 345
 346 2020-11-23
 347
 348         * 238: Debating various possibilities of 16-bit encoding.  (5:20)
 349
 350         * 532: Wrote a histogram python script, that breaks counts down
 351         per opcode, and within them, by operands.  (2:05)
 352
 353 2020-11-22
 354
 355         * 529: Brought up the possibilities of using 8-bit nops to switch
 356         between modes, so that 16-bit insns would be at odd addresses, so
 357         that we could use the full 16-bits; of using 2-operand insns
 358         instead of 3- for 16-bit mode so as to increase the coverage of
 359         the compact encoding.
 360         * 238: Luke moved the comment above here, where it belonged.
 361         * 529: Elaborated how using actual odd-addresses for 16-bit insns
 362         would be dealt with WRT endianness.  Prompted by luke, added it to
 363         the wiki.
 364         * Wiki: Added self to team.  (11:50)
 365
 366 2020-11-21
 367
 368         * 532: Wrote patch for binutils to print insn histogram.
 369         * Mission: Restated the proposal of adding "and users" to the
 370         mission statement, next to customers, as those we wish to enable
 371         to trust our products.  (6:48)
 372
 373 2020-11-20
 374
 375         Reposted join message to the correct list.
 376         * 238: Started looking into it, from
 377         https://libre-soc.org/openpower/sv/16_bit_compressed/
 378
 379 2020-11-19
 380
 381         Joined.