3d_gpu/architecture/inorder_model.mdwn

   1 # Single-Issue, In-Order Processor Core
   2
   3 note: as of the time of writing, this task is 95-98% completed and requires
   4 approximately 10-15 lines of python code to get it actually running a first unit test.
   5
   6 * First steps for a newbie developer [[docs/firststeps]]
   7 * bugreport <http://bugs.libre-riscv.org/show_bug.cgi?id=1039>
   8
   9 The Libre-SOC TestIssuer core
  10 utilises a Finite-State Machine (FSM) to control the fetch/dec/issue/exec
  11 Computational Units, with only one such CompUnit (a FSM or a pipeline) being active at any given time. This is good
  12 for debugging the HDL, but severly restricts performance as a single
  13 instruction will take tens of clock cycles to complete.  In-development
  14 (Andrey to research and link to the relevant bugreport) is an in-order
  15 core and following on from that will be an out-of-order core.
  16
  17 A Single-Issue In-Order control unit (written 12+ months ago) will allow every pipepline to be active,
  18 and raises the ideal maximum throughput to 1 instruction per clock cycle,
  19 bearing any register hazards.
  20
  21 This control unit has not been written in HDL yet (incorrect: the first version was written 12+ months ago, and is in soc/ and there are options in the Makefile to enable it), however there's currently a
  22 task to develop the model for the simulator first. The model will be used to
  23 determine performance.
  24
  25 Diagram that Luke drew comparing pipelines and fsms which allows for a transition from FSM to in-order to out-of-order and also allows "Micro-Coding".
  26
  27 [[!img /3d_gpu/pipeline_vs_fsms.jpg size="600x"]]
  28
  29 # The Model
  30 ## Brief
  31
  32 * [Bug description](https://bugs.libre-soc.org/show_bug.cgi?id=1039)
  33
  34 The model for the Single-Issue In-Order core needs to be added to the in-house
  35 Python simulator (`ISACaller`, called by `pypowersim`), which will allow basic
  36 *performance estimates*.  INCORRECT - pypowersim *outputs an execution trace log*
  37 which **after the fact** may be passed to **any** model of which the in-order
  38 model is **just the very first**.
  39
  40 For now, this model resides outside the simulator, and
  41 is *completely standalone* **and will ALWAYS remain standalone**
  42
  43 A subtask to be carried out **as incremental development**
  44 is that avatools source code will need to be studied to extract
  45 power consumption estimation and add that into the inorder model
  46
  47
  48 ## Task given
  49
  50 * [Bug comment #1](https://bugs.libre-soc.org/show_bug.cgi?id=1039#c1)
  51 * [IRC log](https://libre-soc.org/irclog/%23libre-soc.2023-05-02.log.html#t2023-05-02T10:51:45)
  52
  53 The offline instruction ordering analyser need to be **COMPLETED**
  54 (it is currently 98% complete) that models a
  55 (simple, initially V3.0-only) **in-order core** and gives an estimate of
  56 instructions per clock (IPC).
  57
  58 Hazard Protection **WHICH IS ALREADY COMPLETED** is a straightforward, simple bit vector
  59 (WRONG it is a "length of pipeline countdown until result is ready" which models the
  60 clock cycles needed in the ACTUAL pipeline(s)? the "bit" you refer to is
  61 "is there an entry in the python set() for this register yes-or-no")
  62
  63 - Take the write result register number: set bit WRONG "add num-cycles-until-ready to the set()"
  64 - For all read registers, check corresponding bit WRONG call the function that checks if there is an entry in the "python set() of expected outstanding results to be written" . If bit is set, STALL (fake/
  65 model-stall)
  66
  67 A stall is defined as a delay in execution of an instruction in order to
  68 resolve a hazard (i.e. trying to read a register while it is being written to).
  69 See the [wikipedia article on Pipeline Stall](https://en.wikipedia.org/wiki/Pipeline_stall)
  70
  71 Input **IS** (98% completed, remember?):
  72
  73 - Instruction with its operands (as assembler listing)
  74 - plus an optional memory-address and whether it is read or written.
  75
  76 The input will come as a trace output from the ISACaller simulator,
  77 [see bug comments #7-#16](https://bugs.libre-soc.org/show_bug.cgi?id=1039#c7)
  78
  79 Some classes needed (WRONG: ALREADY WRITTEN) which "model" pipeline stages: fetch, decode, issue,
  80 execute.
  81
  82 One global "STALL" flag will cause all buses to stop:
  83
  84 - Tells fetch to stop fetching
  85 - Decode stops (either because empty, or has instrution whose read reg's and
  86 being written to).
  87 - Issue stops.
  88 - Execute (pipelines) run as an empty slot (except for the initial instruction
  89  causing the stall)
  90
  91 Example (PC chosen arbitrarily):
  92
  93     addi 3, 4, 5    #PC=8
  94     cmpi 1, 0, 3, 4 #PC=12
  95     ld   1, 2(3)    #PC=16 EA=0x12345678
  96
  97 The third operand of `cmpi` is the register which to use in comparison, so
  98 register 3 needs to be read. However, `addi` will be writing to this register,
  99 and thus a STALL will occur when `cmpi` is in the decode phase.
 100
 101 The output diagram will look like this:
 102
 103 TODO, move this to a separate file then *include it twice*, once with triple-quotes
 104 and once without.  grep "inline raw=yes" for examples on how to include in mdwn
 105
 106 ```
 107 | clk # |    fetch     |    decode    |   issue      |   execute    |
 108 |:-----:|:------------:|:------------:|:------------:|:------------:|
 109 |   1   | addi 3,4,5   |              |              |              |
 110 |   2   | cmpi 1,0,3,4 | addi 3,4,5   |              |              |
 111 |   3   | STALL        | cmpi 1,0,3,4 | addi 3,4,5   |              |
 112 |   4   | STALL        | cmpi 1,0,3,4 |              | addi 3,4,5   |
 113 |   5   | ld 1,2(3)    |              | cmpi 1,0,3,4 |              |
 114 |   6   |              | ld 1,2(3)    |              | cmpi 1,0,3,4 |
 115 |   7   |              |              | ld 1,2(3)    |              |
 116 |   8   |              |              |              | ld 1,2(3)    |
 117 ```
 118
 119 Explanation:
 120
 121     1: Fetched addi.
 122     2: Decoded addi, fetched cmpi.
 123     3: Issued addi, decoded cmpi, must stall decode phase, stop fetching.
 124     4: Executed addi, everything else stalled.
 125     5: Issued cmpi, fetched ld.
 126     6: Executed cmpi, decoded ld.
 127     7: Issued ld.
 128     8: Executed ld.
 129
 130 For this initial model, it is assumed that all instructions take one cycle to
 131 execute (not the case for mul/div etc., but will be dealt with later.
 132
 133 **In-progress TODO**
 134
 135 # Code Explanation - *IN PROGRESS*
 136
 137 *(Not all of the code has been explained, just the general classes.)*
 138
 139 Source code: <https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/cyclemodel>
 140
 141 ## `Hazard` namedtuple data structure
 142
 143 A `namedtuple` object stores the attributes of the register access. The
 144 python `namedtuple` is immutable (like a normal tuple), while also allowing to
 145 access elements by predefined names. Immutability is great because the register
 146 access attributes won't change from fetch to execution stages, which is why a
 147 normal `list` or `dict` wouldn't be appropriate.
 148
 149 Unlike a normal dictionary, a `namedtuple` is also ordered (so the initially
 150 defined order is preserved). See the
 151 [python wiki on `namedtuple`](https://docs.python.org/3.7/library/collections.html#collections.namedtuple),
 152 [online namedtuple tutorial](https://realpython.com/python-namedtuple/),
 153 [sta].
 154
 155 `namedtuple` instances can also be stored in sets, which is exactly how it is
 156 used with the `RegisterWrite` class. One instruction trace may contain zero or
 157 more `Hazard` register access objects (depending on whether registers are
 158 needed for the instruction).
 159
 160 ## `HazardProfiles`
 161
 162 A dictionary of currently supported register file types. Each entry (register
 163 file type) defines the number of read and write ports, written as a tuple, with
 164 the first entry being the number of read ports, and second entry being the
 165 number of write ports.
 166
 167 Having multiple read and/or write ports means that multiple **different**
 168 entries in the same register file can be read from and/or written to in the
 169 same clock cycle.
 170 This doesn't prevent a stall if the same register entry is used
 171 by a consecutive instruction, even if a spare port is available
 172 (Read-after-Write hazard).
 173
 174 ## Parsing trace file dump using `read_file` function
 175
 176 The `CPU` model class takes as input, a single instruction trace `list` object.
 177
 178 This trace `list` object, is produced by the function
 179 `read_file` which itself reads an instruction trace file from modified
 180 `ISACaller` ([link to code needed](LINK)).
 181 From now on, the trace `list` object will simply be referred to as `trace`.
 182
 183 Each line of the trace dump is of the form
 184 `[{rw}:FILE:regnum:offset:width]* # insn` where:
 185
 186 - `rw` is the register to be used for reading (operands), or writing
 187 (to store result, condition codes, etc.).
 188 - `FILE` is the register file type (GPR/integer, FPR/floating-point, etc. see
 189 Additional Information section at the end of this page).
 190 *(TODO: use section reference link instead)*.
 191 - `regnum` is the register number
 192 - `offset` *TODO: Perhaps the offset of data in bytes??? no idea (right now not
 193 important, as examples all show 0 offset)*
 194 - `width` is the length of the data in bits to be accessed from the register.
 195 - `insn` is the full instruction written in PowerISA assembler.
 196
 197 The block `[{rw}:FILE:regnum:offset:width]` is used zero or more times,
 198 based on the total number of read and write registers used for the instruction.
 199
 200 Example trace file with three instructions:
 201
 202     r:GPR:0:0:64 w:GPR:1:0:64              # addi 1, 0, 0x0010
 203     r:GPR:0:0:64 w:GPR:2:0:64              # addi 2, 0, 0x1234
 204     r:GPR:1:0:64 r:GPR:2:0:64              # stw 2, 0(1)
 205
 206 The instruction trace file is processed line by line, where each line split into
 207 the register access atributes (from which a new namedtuple is created using
 208 `_make()` and the `Hazard` definition; see
 209 [python wiki on _make() method](https://docs.python.org/3.7/library/collections.html#collections.somenamedtuple._make)).
 210
 211 Each line is converted to a `trace` object of the form:
 212 `[insn, Hazard(...), Hazard(...), ...]`. An example trace looks like this:
 213
 214     ['addi 1, 0, 0x0010',
 215      Hazard(action='r', target='GPR', ident='0', offs='0',elwid='64'),
 216      Hazard(action='w', target='GPR', ident='1', offs='0', elwid='64')]
 217
 218 The function `read_file` yields (see [python wiki on yield]()) a single `trace`
 219 for each line of the trace file. To produces a full list of
 220 traces all the user needs to do is to call `read_file` with the filename of the
 221 `ISACaller` instruction trace dump, and assign to a new variable (which will
 222 end up being a list of `trace` objects, ready to be iterated over for the CPU
 223 model).
 224
 225 ## RegisterWrite
 226
 227 A class which is based on a Python set, and is used to keep track of current
 228 registers used for writing (for detecting Read-after-Write Hazards).
 229
 230 A [python wiki on sets](https://docs.python.org/3.7/tutorial/datastructures.html#sets)
 231 is an unordered collection with **no duplicate elements**.
 232
 233 By checking if next instruction's read registers match any of the write
 234 registers in the RegWrite set, the model can raise a STALL.
 235
 236 Anything in the set **MUST STALL** at the Decode phase because the
 237 currently issued/executed instruction's result has not been written to the
 238 register/s needed for the consecutive instruction.
 239
 240 ### Methods
 241
 242     def __init__(self):
 243         self.storage = set()
 244
 245 Initialise `RegisterWrite` set.
 246
 247     def expect_write(self, regs):
 248         return self.storage.update(regs)
 249
 250 If there are new registers to be written to, add them to the current
 251 `RegisterWrite` set.
 252
 253     def write_expected(self, regs):
 254         return (len(self.storage.intersection(regs)) != 0)
 255
 256 Boolean flag which is true if no read registers need to be written to (by
 257 previous instruction).
 258
 259     def retire_write(self, regs):
 260         return self.storage.difference_update(regs)
 261
 262 Remove write registers from `RegisterWrite` set if they match the given read
 263 registers.
 264
 265 ## `get_input_regs` and `get_output_regs` functions
 266
 267
 268
 269 ## CPU class
 270
 271 The `CPU` class models the in-order, single-issue core. Contains the
 272 `RegisterWrite` set for tracking Read-after-Write Hazards, fetch, decode, issue,
 273 and execute stages, as well as a `stall` flag for indicating if the CPU is
 274 currently stalled.
 275
 276 The input to the model is a trace `list` object.
 277
 278 The main methods used during the running of the model is
 279 `process_instructions()`, which is called every time an instruction trace
 280 `list` object is read from a trace file.
 281
 282 ### Methods
 283
 284     def __init__(self):
 285         self.regs = RegisterWrite()
 286         self.fetch = Fetch(self)
 287         self.decode = Decode(self)
 288         self.issue = Issue(self)
 289         self.exe = Execute(self)
 290         self.stall = False
 291
 292     def reads_possible(self, regs):
 293         # TODO: subdivide this down by GPR FPR CR-field.
 294         # currently assumes total of 3 regs are readable at one time
 295         possible = set()
 296         r = regs.copy()
 297         while len(possible) < 3 and len(r) > 0:
 298             possible.add(r.pop())
 299         return possible
 300
 301     def writes_possible(self, regs):
 302         # TODO: subdivide this down by GPR FPR CR-field.
 303         # currently assumes total of 1 reg is possible regardless of what it is
 304         possible = set()
 305         r = regs.copy()
 306         while len(possible) < 1 and len(r) > 0:
 307             possible.add(r.pop())
 308         return possible
 309
 310     def process_instructions(self):
 311         stall = self.stall
 312         stall = self.fetch.process_instructions(stall)
 313         stall = self.decode.process_instructions(stall)
 314         stall = self.issue.process_instructions(stall)
 315         stall = self.exe.process_instructions(stall)
 316         self.stall = stall
 317         if not stall:
 318             self.fetch.tick()
 319             self.decode.tick()
 320             self.issue.tick()
 321             self.exe.tick()
 322
 323 ## Execute class
 324
 325 The `Execute` class models the execute phase of the processor.
 326 Contains a list
 327
 328 ### Methods
 329
 330     def __init__(self, cpu):
 331         self.stages = []
 332         self.cpu = cpu
 333
 334     def add_stage(self, cycles_away, stage):
 335         while cycles_away > len(self.stages):
 336             self.stages.append([])
 337         self.stages[cycles_away].append(stage)
 338
 339     def add_instruction(self, insn, writeregs):
 340         self.add_stage(2, {'insn': insn, 'writes': writeregs})
 341
 342     def tick(self):
 343         self.stages.pop(0) # tick drops anything at time "zero"
 344
 345     def process_instructions(self, stall):
 346         instructions = self.stages[0] # get list of instructions
 347         to_write = set()              # need to know total writes
 348         for instruction in instructions:
 349             to_write.update(instruction['writes'])
 350         # see if all writes can be done, otherwise stall
 351         writes_possible = self.cpu.writes_possible(to_write)
 352         if writes_possible != to_write:
 353             stall = True
 354         # retire the writes that are possible in this cycle (regfile writes)
 355         self.cpu.regs.retire_write(writes_possible)
 356         # and now go through the instructions, removing those regs written
 357         for instruction in instructions:
 358             instruction['writes'].difference_update(writes_possible)
 359         return stall
 360
 361 # Additional Information
 362
 363 ## On register file types
 364
 365 Currently (20th Aug 2023), the following register files are included in the CPU
 366 model:
 367
 368 - General Purpose Registers (GPR) - stores integers (0-31 in default PowerISA,
 369 0-127 for Libre-SOC with SVP64)
 370 - Floating Point Registers (FPR) - stores floating-point numbers
 371 - Condition Register (CR) - broken up into 4-bit fields
 372 - Condition Register Fields (CRf) - stores arithmetic condition of an operation
 373 (less than, greater than, equal to zero, overflow)
 374 - Fixed-Point Exception Register (XER)
 375 - Machine State Register (MSR)
 376 - Floating-Point Status and Control Register (FPSCR)
 377 - Program Counter (PC); PowerISA spec primarilly calls this *Current
 378 Instruction Address (CIA)*. See PowerISA v3.1, section 1.3.4 Description of
 379 Instruction Operation
 380 - Slow Special Purpose Registers (SPRs)
 381 - Fast SPR (SPRf)
 382
 383 *TODO: Special Purpose Registers and fields need better explation. The initial
 384 writer of this page (Andrey) has very little understanding of whether SPR is
 385 actually a register, or if it's just a category of registers (XER, etc.)*
 386
 387 See the [PowerISA 3.1 spec](LINK) for detailed information on register files
 388 (Book I, Chapters 1.3.4, 2.3, 3.2, 4.2, 5.2, 5.3).