inorder_model: Documentation of the code so far.
[libreriscv.git] / 3d_gpu / architecture / inorder_model.mdwn
1 # Single-Issue, In-Order Processor Core
2
3 note: as of the time of writing, this task is 95-98% completed and requires
4 approximately 10-15 lines of python code to get it actually running a first unit test.
5
6 * First steps for a newbie developer [[docs/firststeps]]
7 * bugreport <http://bugs.libre-riscv.org/show_bug.cgi?id=1039>
8
9 The Libre-SOC TestIssuer core
10 utilises a Finite-State Machine (FSM) to control the fetch/dec/issue/exec
11 Computational Units, with only one such CompUnit (a FSM or a pipeline) being active at any given time. This is good
12 for debugging the HDL, but severly restricts performance as a single
13 instruction will take tens of clock cycles to complete. In-development
14 (Andrey to research and link to the relevant bugreport) is an in-order
15 core and following on from that will be an out-of-order core.
16
17 A Single-Issue In-Order control unit (written 12+ months ago) will allow every pipepline to be active,
18 and raises the ideal maximum throughput to 1 instruction per clock cycle,
19 bearing any register hazards.
20
21 This control unit has not been written in HDL yet (incorrect: the first version was written 12+ months ago, and is in soc/ and there are options in the Makefile to enable it), however there's currently a
22 task to develop the model for the simulator first. The model will be used to
23 determine performance.
24
25 Diagram that Luke drew comparing pipelines and fsms which allows for a transition from FSM to in-order to out-of-order and also allows "Micro-Coding".
26
27 [[!img /3d_gpu/pipeline_vs_fsms.jpg size="600x"]]
28
29 # The Model
30 ## Brief
31
32 * [Bug description](https://bugs.libre-soc.org/show_bug.cgi?id=1039)
33
34 The model for the Single-Issue In-Order core needs to be added to the in-house
35 Python simulator (`ISACaller`, called by `pypowersim`), which will allow basic
36 *performance estimates*. INCORRECT - pypowersim *outputs an execution trace log*
37 which **after the fact** may be passed to **any** model of which the in-order
38 model is **just the very first**.
39
40 For now, this model resides outside the simulator, and
41 is *completely standalone* **and will ALWAYS remain standalone**
42
43 A subtask to be carried out **as incremental development**
44 is that avatools source code will need to be studied to extract
45 power consumption estimation and add that into the inorder model
46
47
48 ## Task given
49
50 * [Bug comment #1](https://bugs.libre-soc.org/show_bug.cgi?id=1039#c1)
51 * [IRC log](https://libre-soc.org/irclog/%23libre-soc.2023-05-02.log.html#t2023-05-02T10:51:45)
52
53 The offline instruction ordering analyser need to be **COMPLETED**
54 (it is currently 98% complete) that models a
55 (simple, initially V3.0-only) **in-order core** and gives an estimate of
56 instructions per clock (IPC).
57
58 Hazard Protection **WHICH IS ALREADY COMPLETED** is a straightforward, simple bit vector
59 (WRONG it is a "length of pipeline countdown until result is ready" which models the
60 clock cycles needed in the ACTUAL pipeline(s)? the "bit" you refer to is
61 "is there an entry in the python set() for this register yes-or-no")
62
63 - Take the write result register number: set bit WRONG "add num-cycles-until-ready to the set()"
64 - For all read registers, check corresponding bit WRONG call the function that checks if there is an entry in the "python set() of expected outstanding results to be written" . If bit is set, STALL (fake/
65 model-stall)
66
67 A stall is defined as a delay in execution of an instruction in order to
68 resolve a hazard (i.e. trying to read a register while it is being written to).
69 See the [wikipedia article on Pipeline Stall](https://en.wikipedia.org/wiki/Pipeline_stall)
70
71 Input **IS** (98% completed, remember?):
72
73 - Instruction with its operands (as assembler listing)
74 - plus an optional memory-address and whether it is read or written.
75
76 The input will come as a trace output from the ISACaller simulator,
77 [see bug comments #7-#16](https://bugs.libre-soc.org/show_bug.cgi?id=1039#c7)
78
79 Some classes needed (WRONG: ALREADY WRITTEN) which "model" pipeline stages: fetch, decode, issue,
80 execute.
81
82 One global "STALL" flag will cause all buses to stop:
83
84 - Tells fetch to stop fetching
85 - Decode stops (either because empty, or has instrution whose read reg's and
86 being written to).
87 - Issue stops.
88 - Execute (pipelines) run as an empty slot (except for the initial instruction
89 causing the stall)
90
91 Example (PC chosen arbitrarily):
92
93 addi 3, 4, 5 #PC=8
94 cmpi 1, 0, 3, 4 #PC=12
95 ld 1, 2(3) #PC=16 EA=0x12345678
96
97 The third operand of `cmpi` is the register which to use in comparison, so
98 register 3 needs to be read. However, `addi` will be writing to this register,
99 and thus a STALL will occur when `cmpi` is in the decode phase.
100
101 The output diagram will look like this:
102
103 TODO, move this to a separate file then *include it twice*, once with triple-quotes
104 and once without. grep "inline raw=yes" for examples on how to include in mdwn
105
106 ```
107 | clk # | fetch | decode | issue | execute |
108 |:-----:|:------------:|:------------:|:------------:|:------------:|
109 | 1 | addi 3,4,5 | | | |
110 | 2 | cmpi 1,0,3,4 | addi 3,4,5 | | |
111 | 3 | STALL | cmpi 1,0,3,4 | addi 3,4,5 | |
112 | 4 | STALL | cmpi 1,0,3,4 | | addi 3,4,5 |
113 | 5 | ld 1,2(3) | | cmpi 1,0,3,4 | |
114 | 6 | | ld 1,2(3) | | cmpi 1,0,3,4 |
115 | 7 | | | ld 1,2(3) | |
116 | 8 | | | | ld 1,2(3) |
117 ```
118
119 Explanation:
120
121 1: Fetched addi.
122 2: Decoded addi, fetched cmpi.
123 3: Issued addi, decoded cmpi, must stall decode phase, stop fetching.
124 4: Executed addi, everything else stalled.
125 5: Issued cmpi, fetched ld.
126 6: Executed cmpi, decoded ld.
127 7: Issued ld.
128 8: Executed ld.
129
130 For this initial model, it is assumed that all instructions take one cycle to
131 execute (not the case for mul/div etc., but will be dealt with later.
132
133 **In-progress TODO**
134
135 # Code Explanation - *IN PROGRESS*
136
137 *(Not all of the code has been explained, just the general classes.)*
138
139 Source code: <https://git.libre-soc.org/?p=openpower-isa.git;a=tree;f=src/openpower/cyclemodel>
140
141 ## `Hazard` namedtuple data structure
142
143 A `namedtuple` object stores the attributes of the register access. The
144 python `namedtuple` is immutable (like a normal tuple), while also allowing to
145 access elements by predefined names. Immutability is great because the register
146 access attributes won't change from fetch to execution stages, which is why a
147 normal `list` or `dict` wouldn't be appropriate.
148
149 Unlike a normal dictionary, a `namedtuple` is also ordered (so the initially
150 defined order is preserved). See the
151 [python wiki on `namedtuple`](https://docs.python.org/3.7/library/collections.html#collections.namedtuple),
152 [online namedtuple tutorial](https://realpython.com/python-namedtuple/),
153 [sta].
154
155 `namedtuple` instances can also be stored in sets, which is exactly how it is
156 used with the `RegisterWrite` class. One instruction trace may contain zero or
157 more `Hazard` register access objects (depending on whether registers are
158 needed for the instruction).
159
160 ## `HazardProfiles`
161
162 A dictionary of currently supported register file types. Each entry (register
163 file type) defines the number of read and write ports, written as a tuple, with
164 the first entry being the number of read ports, and second entry being the
165 number of write ports.
166
167 Having multiple read and/or write ports means that multiple **different**
168 entries in the same register file can be read from and/or written to in the
169 same clock cycle.
170 This doesn't prevent a stall if the same register entry is used
171 by a consecutive instruction, even if a spare port is available
172 (Read-after-Write hazard).
173
174 ## Parsing trace file dump using `read_file` function
175
176 The `CPU` model class takes as input, a single instruction trace `list` object.
177
178 This trace `list` object, is produced by the function
179 `read_file` which itself reads an instruction trace file from modified
180 `ISACaller` ([link to code needed](LINK)).
181 From now on, the trace `list` object will simply be referred to as `trace`.
182
183 Each line of the trace dump is of the form
184 `[{rw}:FILE:regnum:offset:width]* # insn` where:
185
186 - `rw` is the register to be used for reading (operands), or writing
187 (to store result, condition codes, etc.).
188 - `FILE` is the register file type (GPR/integer, FPR/floating-point, etc. see
189 Additional Information section at the end of this page).
190 *(TODO: use section reference link instead)*.
191 - `regnum` is the register number
192 - `offset` *TODO: Perhaps the offset of data in bytes??? no idea (right now not
193 important, as examples all show 0 offset)*
194 - `width` is the length of the data in bits to be accessed from the register.
195 - `insn` is the full instruction written in PowerISA assembler.
196
197 The block `[{rw}:FILE:regnum:offset:width]` is used zero or more times,
198 based on the total number of read and write registers used for the instruction.
199
200 Example trace file with three instructions:
201
202 r:GPR:0:0:64 w:GPR:1:0:64 # addi 1, 0, 0x0010
203 r:GPR:0:0:64 w:GPR:2:0:64 # addi 2, 0, 0x1234
204 r:GPR:1:0:64 r:GPR:2:0:64 # stw 2, 0(1)
205
206 The instruction trace file is processed line by line, where each line split into
207 the register access atributes (from which a new namedtuple is created using
208 `_make()` and the `Hazard` definition; see
209 [python wiki on _make() method](https://docs.python.org/3.7/library/collections.html#collections.somenamedtuple._make)).
210
211 Each line is converted to a `trace` object of the form:
212 `[insn, Hazard(...), Hazard(...), ...]`. An example trace looks like this:
213
214 ['addi 1, 0, 0x0010',
215 Hazard(action='r', target='GPR', ident='0', offs='0',elwid='64'),
216 Hazard(action='w', target='GPR', ident='1', offs='0', elwid='64')]
217
218 The function `read_file` yields (see [python wiki on yield]()) a single `trace`
219 for each line of the trace file. To produces a full list of
220 traces all the user needs to do is to call `read_file` with the filename of the
221 `ISACaller` instruction trace dump, and assign to a new variable (which will
222 end up being a list of `trace` objects, ready to be iterated over for the CPU
223 model).
224
225 ## RegisterWrite
226
227 A class which is based on a Python set, and is used to keep track of current
228 registers used for writing (for detecting Read-after-Write Hazards).
229
230 A [python wiki on sets](https://docs.python.org/3.7/tutorial/datastructures.html#sets)
231 is an unordered collection with **no duplicate elements**.
232
233 By checking if next instruction's read registers match any of the write
234 registers in the RegWrite set, the model can raise a STALL.
235
236 Anything in the set **MUST STALL** at the Decode phase because the
237 currently issued/executed instruction's result has not been written to the
238 register/s needed for the consecutive instruction.
239
240 ### Methods
241
242 def __init__(self):
243 self.storage = set()
244
245 Initialise `RegisterWrite` set.
246
247 def expect_write(self, regs):
248 return self.storage.update(regs)
249
250 If there are new registers to be written to, add them to the current
251 `RegisterWrite` set.
252
253 def write_expected(self, regs):
254 return (len(self.storage.intersection(regs)) != 0)
255
256 Boolean flag which is true if no read registers need to be written to (by
257 previous instruction).
258
259 def retire_write(self, regs):
260 return self.storage.difference_update(regs)
261
262 Remove write registers from `RegisterWrite` set if they match the given read
263 registers.
264
265 ## `get_input_regs` and `get_output_regs` functions
266
267
268
269 ## CPU class
270
271 The `CPU` class models the in-order, single-issue core. Contains the
272 `RegisterWrite` set for tracking Read-after-Write Hazards, fetch, decode, issue,
273 and execute stages, as well as a `stall` flag for indicating if the CPU is
274 currently stalled.
275
276 The input to the model is a trace `list` object.
277
278 The main methods used during the running of the model is
279 `process_instructions()`, which is called every time an instruction trace
280 `list` object is read from a trace file.
281
282 ### Methods
283
284 def __init__(self):
285 self.regs = RegisterWrite()
286 self.fetch = Fetch(self)
287 self.decode = Decode(self)
288 self.issue = Issue(self)
289 self.exe = Execute(self)
290 self.stall = False
291
292 def reads_possible(self, regs):
293 # TODO: subdivide this down by GPR FPR CR-field.
294 # currently assumes total of 3 regs are readable at one time
295 possible = set()
296 r = regs.copy()
297 while len(possible) < 3 and len(r) > 0:
298 possible.add(r.pop())
299 return possible
300
301 def writes_possible(self, regs):
302 # TODO: subdivide this down by GPR FPR CR-field.
303 # currently assumes total of 1 reg is possible regardless of what it is
304 possible = set()
305 r = regs.copy()
306 while len(possible) < 1 and len(r) > 0:
307 possible.add(r.pop())
308 return possible
309
310 def process_instructions(self):
311 stall = self.stall
312 stall = self.fetch.process_instructions(stall)
313 stall = self.decode.process_instructions(stall)
314 stall = self.issue.process_instructions(stall)
315 stall = self.exe.process_instructions(stall)
316 self.stall = stall
317 if not stall:
318 self.fetch.tick()
319 self.decode.tick()
320 self.issue.tick()
321 self.exe.tick()
322
323 ## Execute class
324
325 The `Execute` class models the execute phase of the processor.
326 Contains a list
327
328 ### Methods
329
330 def __init__(self, cpu):
331 self.stages = []
332 self.cpu = cpu
333
334 def add_stage(self, cycles_away, stage):
335 while cycles_away > len(self.stages):
336 self.stages.append([])
337 self.stages[cycles_away].append(stage)
338
339 def add_instruction(self, insn, writeregs):
340 self.add_stage(2, {'insn': insn, 'writes': writeregs})
341
342 def tick(self):
343 self.stages.pop(0) # tick drops anything at time "zero"
344
345 def process_instructions(self, stall):
346 instructions = self.stages[0] # get list of instructions
347 to_write = set() # need to know total writes
348 for instruction in instructions:
349 to_write.update(instruction['writes'])
350 # see if all writes can be done, otherwise stall
351 writes_possible = self.cpu.writes_possible(to_write)
352 if writes_possible != to_write:
353 stall = True
354 # retire the writes that are possible in this cycle (regfile writes)
355 self.cpu.regs.retire_write(writes_possible)
356 # and now go through the instructions, removing those regs written
357 for instruction in instructions:
358 instruction['writes'].difference_update(writes_possible)
359 return stall
360
361 # Additional Information
362
363 ## On register file types
364
365 Currently (20th Aug 2023), the following register files are included in the CPU
366 model:
367
368 - General Purpose Registers (GPR) - stores integers (0-31 in default PowerISA,
369 0-127 for Libre-SOC with SVP64)
370 - Floating Point Registers (FPR) - stores floating-point numbers
371 - Condition Register (CR) - broken up into 4-bit fields
372 - Condition Register Fields (CRf) - stores arithmetic condition of an operation
373 (less than, greater than, equal to zero, overflow)
374 - Fixed-Point Exception Register (XER)
375 - Machine State Register (MSR)
376 - Floating-Point Status and Control Register (FPSCR)
377 - Program Counter (PC); PowerISA spec primarilly calls this *Current
378 Instruction Address (CIA)*. See PowerISA v3.1, section 1.3.4 Description of
379 Instruction Operation
380 - Slow Special Purpose Registers (SPRs)
381 - Fast SPR (SPRf)
382
383 *TODO: Special Purpose Registers and fields need better explation. The initial
384 writer of this page (Andrey) has very little understanding of whether SPR is
385 actually a register, or if it's just a category of registers (XER, etc.)*
386
387 See the [PowerISA 3.1 spec](LINK) for detailed information on register files
388 (Book I, Chapters 1.3.4, 2.3, 3.2, 4.2, 5.2, 5.3).