lxo/ChangeLog

   1 2021-02-20
   2
   3         * GCC: Lowering DWARF_FRAME_REGISTERS, once I rebuilt the
   4         compiler, libgcc, and the test program, avoided the problem.  That
   5         didn't make much sense, so I reversed that change and got back to
   6         debugging.  The signal frame seemed to be unwound correctly, but
   7         instead of using the linux-unwind fallback frame stuff, that I'd
   8         messed with a week before, I noticed it was using frame info from
   9         the __sigtramp64rt (sp?) entry point in the kernel-supplied vdso.
  10         Though I'm pretty sure that changing that file got me some
  11         different results the week before, with vdso it couldn't possibly
  12         be where things got wrong.  So I proceeded to unwinding the frame
  13         until we hit the caller of the infinitely-recursive function, and
  14         found we got to the end of the stack before reaching it.  Huh?  A
  15         GDB stack frame also hit the same problem.  Oh, maybe there was
  16         something wrong with the frame info for those early calls in the
  17         thread.  But the stack frame only stopped at the third or fourth
  18         recursive call.  That seemed fishy, so I started the program over,
  19         and checked the stack trace at the point of the signal delivery,
  20         and found it was fine.  I stepped into the signal handler, and
  21         into the exception raising machinery, and it was still fine.  Only
  22         after we started the unwinding did it get corrupted.  At first I
  23         suspected something going wrong because of out-of-range accesses
  24         to the regs array, recompiled compiler and library and program
  25         just to be sure, and still the same issue.  Finally, then it
  26         occurred to me to check where the alternate stack stack, in which
  27         the stack overflow signal was handled, and found it to be running
  28         into the other end of the task's stack.  Turns out the Ada
  29         runtime, when starting a task, allocates an alt stack to handle
  30         stack overflows out of the stack itself.  With the larger register
  31         file, unwinding was taking up more of the alt stack space,
  32         overflowing it and thus overwriting part of the task's call stack,
  33         corrupting it to the point that the unwinder could no longer reach
  34         the exception handler in the task setup code, supposed to catch an
  35         escaping exception for the task parent to analyze/reraise.
  36         Growing the alt stack size in the Ada runtime fixes the problem,
  37         but since this explains why lowering DWARF_FRAME_REGISTERS avoided
  38         the problem, I'm now happy to have it set to the lower value, at
  39         least until call-saved SVP64 regs are needed.  Adjusted other
  40         references to ARG_POINTER_REGNUM in libgcc to use a fixed index.
  41         Wrote a blog post about this, while regstrapping the fix.
  42         https://www.fsfla.org/blogs/lxo/2021-02-20-longest-debugging-session.en.html
  43         Success, no regressions.  (9:09)
  44
  45 2021-02-13
  46
  47         * GCC: Found libgcc/config/linux-unwind.h using GCC's internal
  48         register numbers, and thus in need of renumbering as well.  Alas,
  49         the right fix didn't jump at me.  There's some confusion about
  50         using mapped register numbers or not.  Using the pristine
  51         libgcc_eh.a to link the program built with the new compiler, using
  52         newly-built libraries, it works, but with the new libgcc_eh.a, it
  53         fails, whether using 291 or 99 or 67 for R_AR, that used to be
  54         ARG_POINTER_REGNUM.  Changing R_AR and rebuilding doesn't alter
  55         anything within gcc/ada, so it's not the Ada runtime.  I guess
  56         I may have to go back to debugging, as it's not clear whether GCC
  57         is losing track of the frames or not finding the handler that
  58         would propagate the EH to the thread that activated the task.
  59         Tried experimenting with overriding DWARF_FRAME_REGISTERS to its
  60         original value.  (6:13)
  61
  62 2021-02-10
  63
  64         * MW (0:48)
  65
  66 2021-02-09
  67
  68         * VC (1:59)
  69
  70 2021-02-05
  71
  72         * GCC: Started investigating the remaining regressions, all in
  73         Ada.  They all turn out to be -fstack-check tests.  (0:40)
  74
  75 2021-01-31
  76
  77         * GCC: Started debugging regressions in the stage1 non-svp64
  78         compiler.  Noticed that the renaming of mov to @altivec_mov
  79         removed expanders for some modes used by altivec but not by svp64.
  80         Reintroduced them, and added floating-point svp64 mov patterns.
  81         Split out of the main patch a preparation patch that could be
  82         submitted upstream right away, for it just prepares for register
  83         renumbering.  Fixed conditional register usage that, when svp64
  84         was not enabled, caused the LAST/MAX_* non-SVP64 registers to be
  85         marked as fixed, which caused the frame pointer to not be
  86         preserved across calls.  Fixed the *logue routines to account for
  87         the register renumbering.  Fixed the svp64 add expander to use a
  88         correct expansion of <VI_unit>, covering V2DI at power8.  With
  89         that, we're down to a single C regression, when not enabling
  90         svp64.  While the expected behavior is for the compiler to
  91         optimize gcc.target/powerpc/dform-3.c's gpr function so that p->c
  92         is (re)loaded to e.g. r10&r11 with a vsx_movv2df_64bit, because
  93         the MEM cost for its reload is negative, whereas the
  94         svp64-modified compiler keeps both such instructions, first
  95         loading to a VSX reg, then splitting it into a pair of GPRs.  It
  96         is a performance bug, but the generated code works.  Trying a
  97         bootstrap!  Stage2 wouldn't build because of /* within comments in
  98         rs6000-modes.def; adjusted the commented-out entries I'd put in to
  99         avoid that.  The memory move costs were off because of the use of
 100         literal regno 32 when computing the costs for FPR classes.
 101         Regstrapped the prepping patch successfully, then went back to the
 102         patch that introduces svp64 support, still disabled.  -msvp64 is
 103         still slightly, but without enabling it, we may be down to no
 104         regressions, testing should confirm.  (14:37)
 105
 106 2021-01-30
 107
 108         * GCC: Fixed the boundaries of the loops that disable SVP64
 109         registers when SVP64 is not enabled.  Fixed macros used for
 110         parameter and return value assignment to reflect the new FP
 111         numbers.  Require at least one register operand for svp64 vector
 112         mov pattern.  Add emit of altivec insn when not using svp64.
 113         Introduced a first svp64 reload change for preferred_reload, to
 114         avoid trying to reload constants into altivec registers.  A lot
 115         more work will be needed for svp64 reloads.  A non-svp64 native
 116         compiler builds stage1, but the compiler is still pretty broken,
 117         with thousands of regressions.  A svp one builds stage1 and fails
 118         in libgcc, with a bunch of asm failures because of (unsupported)
 119         sv.* opcodes, and one reload failure in decContext, that I started
 120         investigating.  (8:22)
 121
 122 2021-01-27
 123
 124         * µW (1:10)
 125
 126 2021-01-26
 127
 128         * VCoffee (1:41)
 129
 130 2021-01-24
 131
 132         * GCC: Introduced vector modes, registers, classes, constraints,
 133         renumbered and remapped registers, went over literals referring to
 134         register numbers, and started implementation of move/load/store
 135         and add for the V*DI integral types.  Still have to test that the
 136         compiler still works after the renumbering.  The new insns are not
 137         generated yet, I haven't made the new registers usable for
 138         anything yet.  (12:13)
 139
 140 2021-01-22
 141
 142         * 578: Specifying and debating the task with luke and, later,
 143         jacob.  Difficulties in conveying the requirements and overcoming
 144         the complexities involved in figuring out how to parse each asm
 145         operand in Python, underspecification of the input language,
 146         disagreement as to the complexity and the amount of work required
 147         to duplicate existing binutils functionality in python, and then
 148         duplicate this work one more time into binutils later, led Luke to
 149         take it upon himself.
 150         * 579: Talked to Jacob a bit about potential implementation
 151         strategies.  The need to build an immediate constant to use as the
 152         operand to .long/svp64 makes for plenty of complexity, even in
 153         C++.  I'm again unhappy with a plan that involves so much
 154         intentional waste of effort.  I'm also very surprised with the
 155         estimated amount of work involved in this task, compared with
 156         578, that is a much bigger one with all the rewriting of an asm
 157         parser, and likely more rewriting as the extended asm syntax
 158         evolves.  And thus pretty much a full workday ends up wasted,
 159         most of it complaining about planning to waste work.  (8:29)
 160
 161 2021-01-19
 162
 163         * Virtual Coffe (1:39)
 164
 165 2021-01-13
 166
 167         * Microwatts meeting (1:08)
 168
 169 2021-01-07
 170
 171         * 572: New, split out of 570, on what .[sv], elwidth, subvl
 172         affect in load/store ops: the address [vector] or the in-memory
 173         [vector]?
 174
 175 2021-01-06
 176
 177         * 570: New.  It's not specified whether selection of elwidth
 178         sub-dword bytes get byte-reversed into LE before or after the
 179         selection.  The specs say we convert loaded words to LE as quickly
 180         as possible, so that all internal operations are LE, but this
 181         would lead to reversal of sub-register vector elements when
 182         loading, even when using svp64 loads with the correct elwidth_src.
 183         * 569: New.  Also concerned about how to get bit arrays properly
 184         loaded into predicate registers so that the *bits* are reversed to
 185         match LE requirements.
 186         * 568: New.  After gotting clarification from Jacob about setvl's
 187         behavior: VL gets set to MIN(VL, MAXVL), you can count on its not
 188         being a smaller value.  This is documented only in pseudocode, it
 189         could be made more self-evident.  (3:13)
 190
 191 2021-01-05
 192
 193         * 567: Cesar filed it for me; I clarified it a bit further.
 194
 195 2021-01-04
 196
 197         * 560: Tried to show I understand the effects of loads and
 198         byte-swapping loads in both endiannesses, and restated my
 199         suggestion of iteration order matching the natural memory layout
 200         of arrays/vectors.  (1:46)
 201
 202 2021-01-03
 203
 204         * 560: Pointed out the circular reasoning in assuming LE in
 205         showing it works for LE and BE, stated the problem with BE and how
 206         the current BE status is incompatible with both PPC vectors and
 207         with how svp64 vectors are said to be expected to work.
 208         Recommended ruling BE out entirely for now, if the approach is to
 209         not look into the problems, this will result in broken,
 210         self-inconsistent specs that we'll either have to discontinue or
 211         carry indefinitely.
 212         * 558: Looked at the riscv implementation, particularly commit
 213         4922a0bed80f8fa1b7d507eee6f94fb9c34bfc32, the testcases in
 214         299ed9a4eaa569a5fc2b2291911ebf55318e85e4, and the reduction of
 215         redundant setvli in e71a47e3cd553cec24afbc752df2dc2214dd3850, and
 216         5fa22b2687b1f6ca1558fb487fc07e83ffe95707 that enables vl to not be
 217         a power of two.
 218         * 560: Wrote up about significance, ordering, endianness and such
 219         conventions.  (6:21)
 220
 221 2020-12-30
 222
 223         * 559: Luke split out the issue of whether we should we have
 224         automatic detection and reversal of direction of vectors, so that
 225         they always behave as if parallel, even if implemented as
 226         sequential.  Jacob pointed out that reversal is not enough for
 227         some 3-operand cases.
 228         * SVP64: Second review call.
 229         * 562: Filed, on elwidth encoding.
 230         * 558: Raised the need for the compiler to be able to save and
 231         restore VL, if it's exposed separately from maxvl; also brought up
 232         calling conventions.
 233         * 560: Commented on potential endianness issue: identity of
 234         register as scalar and of first element of vector starting at that
 235         register.  More questions on issues that arise in big endian mode,
 236         and compatibility we may wish to aim for.  Some difficulties in
 237         getting as much as a conversation going on endianness-influenced
 238         sub-register iteration order; presented a simple scenario that
 239         demonstrates the fundamental programming problems that will arise
 240         out of favoring LE as we seem to.
 241         * 558: Explained why disregarding things the compiler will do on
 242         its own and arguing it shouldn't do that doesn't make the initial
 243         project simpler, but harder, and also more fragile and likely to
 244         be throw-away code in the end.  Argued for in favor of seeing
 245         where we want to get to in the end, and then mapping out what it
 246         takes to get features we want for the first stage so that it's a
 247         step in the general direction of the end goal.  (6:43)
 248
 249 2020-12-28
 250
 251         * 558: Commented on vector modes, insns, regalloc, scheduling,
 252         auto vectorization, instrinsics, and the possibilities of vector
 253         length and component modes as parameters to template insns and
 254         instrinsics, and of mechanic generation thereof.  (2:22)
 255
 256 2020-12-26
 257
 258         * SVP64: Reviewed overview and proposed encoding, posted more
 259         questions.  (2:30)
 260
 261 2020-12-25
 262
 263         * Email backlog.
 264         * SVP64: More studying, more making sense.  Asked about
 265         parallelism vs dependencies.  (3:02)
 266
 267         * 550: Implemented the first cut at svp64 prefix in the assembler,
 268         namely, a 32-bit pseudo-insn that takes a 24-bit immediate
 269         operand, encoding it as an insn with EXT01 as the major opcode,
 270         MSB0 bits 7 and 9 also set, and the top two bits of the immediate
 271         shuffled into bits 6 and 8.  Added patch to bugzill and to the
 272         wiki.  Updated status.  (1:41)
 273
 274 2020-12-23
 275
 276         * SVP64: Review meeting.
 277         * 555: Reduce flag/s for fma.  Commented on the possibilities.
 278         (1:26)
 279
 280 2020-12-20
 281
 282         * 532: Implemented logic for mode-switching 32-bit insns with 6
 283         bits for the opcode, a 16-bit embedded compressed insn, and 10
 284         bits corresponding to subsequent insns, to tell whether or not
 285         each of them is compressed.  This nearly doubled the compression
 286         rate, using one such mode-switching insn per 3 compressed insns.
 287         (1:48)
 288
 289 2020-12-14
 290
 291         * 532: Reported on compression ratio findings and analyses.
 292         (1:06)
 293
 294 2020-12-13
 295
 296         * 532: Questioned some bullets under 16-imm opcodes.  Implemented
 297         condition register and system opcodes, 16-imm opcodes, extended
 298         load and store to cover 16-imm modes, condition bit expression
 299         parsing and finally bc 16-imm and bclr 10- and 16-bit opcodes.
 300         Tested a bit by visual inspection, introduced logic to backtrack
 301         into 32-bit and count such pairs as 10-bit nop + 16-imm insn,
 302         followed by 32-bit.  Fixed size estimation: count[2] was still
 303         counted as 16+16-imm, rather than a single 16-imm.  (5:30)
 304
 305 2020-12-06
 306
 307         * 532: Adjusted the logic in comp16-v1-skel.py for 16-bit 16-imm
 308         rather than the 16+16 I'd invented.  Implemented the most relevant
 309         opcodes for 10-bit, and many of the 16-bit ones too.  Not yet
 310         implemented are conditional branches, Immediate, CR and System
 311         opcodes.  With all of nop, unconditional branch, ld/st,
 312         arithmetic, logical and floating-point, we get less than 3%
 313         compression in GCC, with not-entirely-unreasonable reg subsets.
 314         It's not looking good.  (8:27)
 315
 316 2020-12-02
 317
 318         * Microwatts meeting.
 319         * 238: Added some thoghts on bl and blr, and implications about
 320         modes.  Also detailed my worries about how to preserve dynamic
 321         state, specifically switch-back-to-compressed-after-insn, across
 322         interrupts.  (1:44)
 323
 324 2020-11-30
 325
 326         * 238: Settled the N-without-M issue, it was likely an error in
 327         the tables.  Raised an inconsistency in decoder pseudocode's
 328         reversal of M and N.  Returned to the uncertainty and need for
 329         specifying how to handle conflicts between
 330         standard-then-compressed followed by 10-bit with M=0.  Raised
 331         issue of missing documentation that branch targets are always
 332         uncompressed, not just 32-bit aligned.  Raised issue of the
 333         purpose of M and N bits, particularly in unconditional branches.
 334         Explained why I believe phase 1 decoder hsa to look at Cmaj.m bits
 335         to tell whether or not N is there, brought crnand and crand
 336         encodings as example, and asked whether crand with M=0 should
 337         switch to 32-bit mode for only one insn, because the bit that
 338         usually holds N=1, or permanently, because there's no N field in
 339         the applicable encoding.  (2:33)
 340
 341         * 238: Detailed the motivations for my proposal of bit-shuffling
 342         in the 16-bit encoding, to reduce wires and selections in the
 343         realigning muxer.  Restated my question on N without M as I can't
 344         relate the answer with the question, it appears to have been
 345         misunderstood.  Further expanded on the advantages of moving the
 346         Cmaj.m and M bits as suggested, even going as far as enabling an
 347         extended compressed opcode reusing the bit that signals a match
 348         for a 10-bit insn in uncompressed mode.  (3:29)
 349
 350 2020-11-29
 351
 352         * 238: Noted some apparent contradictions in the rejection of
 353         extended 16-bit insns in the face of 16+16-bit insns.  Luke hit me
 354         with clarification that there's no such thing as a 16+16-bit insn
 355         in compressed mode, and I could see how I'd totally made it up by
 356         myself by reviewing the proposal.  Hit and asked other questions:
 357         what's the N for when there's no M, and what are the SV prefixes
 358         mentioned there, now that I no longer assume them to be something
 359         like extend-next.  Then I recorded some thoughts on minimizing the
 360         bits the muxer has to look into by making the bits that encode N,
 361         Cmaj.m and M onto the same bits that, in traditional mode, encode
 362         the primary opcode.  Finally, I was hit by the realization that,
 363         if we change the perspective from "uncompressed insns used to be
 364         32-bit only" to "uncompressed can be 32- or 16-bit depending on
 365         the opcode", on account of the 10-bit insns, the need for taking
 366         the opcode into account to tell whether we're looking at a 16- or
 367         32-bit insn, so why is it ok there, but not ok in compressed mode?
 368         Finally, I propose an encoding scheme that encodes lengths of
 369         subsequent insns in an early insn, achieving more coverage for
 370         16-bit insns, better limit compression, far more flexible mode
 371         switching, enabling savings at far more sparse settings, and
 372         without eating up a pair of primary opcodes: the 32-bit
 373         mode-switching insn could even be an extended opcode, though it
 374         would probably not have as many pre-length encoding bits then.  It
 375         would fit an entire 16-bit insn, which could do useful work, or
 376         queue up further pre-length bits, that correspond to static
 377         upcoming insns and tell whether to decode them as 32-bit or as
 378         (pairs of?) 16-bit ones.  Compared max ratio, representation
 379         overhead, and break-even density.  Shared some more thoughts on
 380         48- and 64-bit insns.  (7:39)
 381
 382         * 532: Got a little confused about some encodings; it's not clear
 383         whether the N and M bits in 16-bit instructions have uniform
 384         interpretation, or whether some proposed opcodes are repurposing
 385         them.  I'm surprised with such short immediate operands in the
 386         immediate instructions, if they don't get a 16-bit extension, or
 387         otherwise with the apparent requirement for an extended 16-bit
 388         immediate for something as simple as an mr encoded as addi.  Asked
 389         for clarification.  Not sure about how to proceed before I get it;
 390         the logic of the estimator would be too significantly impacted.
 391         (2:48)
 392
 393 2020-11-28
 394
 395         * 532: Figured out and implemented the logic to infer mode
 396         switching for best compression under attempt 1 proposed encoding,
 397         namely with 10-bit insns, 16-bit insns, 16+16-bit insns, and
 398         32-bit insns.  10-bit insns appear in uncompressed mode, and can
 399         be followed by insns in either mode; 16-bit ones appear in
 400         compressed mode, and can remain in compressed mode, or switch to
 401         uncomprssed mode for 1 insn or for good; 16+16-bit ones appear in
 402         compressed mode, and cannot switch modes; 32-bit ones appear only
 403         in uncompressed mode, or in the single-insn slot after a 16-bit
 404         that requests it.  If we find a 16-bit insn while we're in
 405         uncompressed mode, use a 10-bit nop to tentatively switch.  Insns
 406         that can be encoded in 10-bits, but appear in compressed mode, had
 407         better be encoded in 16-bits, for that offers further subsequent
 408         encoding options, without downsides for size estimation.  Insns
 409         that can be encoded as 16+16-bit decay to 32-bit if in
 410         uncompressed mode, or if, after a sequence thereof, a later insn
 411         forces a switch to 32-bit mode without an intervening switching
 412         insn.  Still missing: the code to select what insns can be encoded
 413         in what modes.  (6:42)
 414
 415         * 532: Implemented a skeleton for compression ratio estimation,
 416         initially with the simpler mode switching of the 8-bit nop,
 417         odd-address 16-bit insns.  Next, rewrite it for all the complexity
 418         of mode switching envisioned for the "attempt 1" proposal.  (2:02)
 419
 420 2020-11-23
 421
 422         * 238: Debating various possibilities of 16-bit encoding.  (5:20)
 423
 424         * 532: Wrote a histogram python script, that breaks counts down
 425         per opcode, and within them, by operands.  (2:05)
 426
 427 2020-11-22
 428
 429         * 529: Brought up the possibilities of using 8-bit nops to switch
 430         between modes, so that 16-bit insns would be at odd addresses, so
 431         that we could use the full 16-bits; of using 2-operand insns
 432         instead of 3- for 16-bit mode so as to increase the coverage of
 433         the compact encoding.
 434         * 238: Luke moved the comment above here, where it belonged.
 435         * 529: Elaborated how using actual odd-addresses for 16-bit insns
 436         would be dealt with WRT endianness.  Prompted by luke, added it to
 437         the wiki.
 438         * Wiki: Added self to team.  (11:50)
 439
 440 2020-11-21
 441
 442         * 532: Wrote patch for binutils to print insn histogram.
 443         * Mission: Restated the proposal of adding "and users" to the
 444         mission statement, next to customers, as those we wish to enable
 445         to trust our products.  (6:48)
 446
 447 2020-11-20
 448
 449         Reposted join message to the correct list.
 450         * 238: Started looking into it, from
 451         https://libre-soc.org/openpower/sv/16_bit_compressed/
 452
 453 2020-11-19
 454
 455         Joined.