Return-path: Envelope-to: publicinbox@libre-riscv.org Delivery-date: Sat, 28 Mar 2020 14:08:34 +0000 Received: from localhost ([::1] helo=libre-riscv.org) by libre-riscv.org with esmtp (Exim 4.89) (envelope-from ) id 1jIC8W-0004aH-R7; Sat, 28 Mar 2020 14:08:32 +0000 Received: from vps2.stafverhaegen.be ([85.10.201.15]) by libre-riscv.org with esmtp (Exim 4.89) (envelope-from ) id 1jIC8U-0004aB-Gu for libre-riscv-dev@lists.libre-riscv.org; Sat, 28 Mar 2020 14:08:30 +0000 Received: from hpdc7800 (hpdc7800 [10.0.0.1]) by vps2.stafverhaegen.be (Postfix) with ESMTP id CAAF811C040B for ; Sat, 28 Mar 2020 15:08:29 +0100 (CET) Message-ID: <0d35e45bd81eeaecedeb64dc5061c1e33c89630c.camel@fibraservi.eu> From: Staf Verhaegen To: libre-riscv-dev@lists.libre-riscv.org Date: Sat, 28 Mar 2020 15:08:25 +0100 In-Reply-To: References: <29b1a9ecedda151dc9c8da6516c3691dfede62ef.camel@fibraservi.eu> <6fa40cb78b3f8c013ca4953ccb4daa5c23e3b501.camel@fibraservi.eu> <6fbfb2a3258be77f4fce69661b283dc31a683f7b.camel@fibraservi.eu> <9e44930a0332eff507661e617796b9d0674b0e05.camel@fibraservi.eu> Organization: FibraServi bvba X-Mailer: Evolution 3.28.5 (3.28.5-5.el7) Mime-Version: 1.0 X-Content-Filtered-By: Mailman/MimeDel 2.1.23 Subject: [libre-riscv-dev] Clock Gating (was cache SRAM organisation) X-BeenThere: libre-riscv-dev@lists.libre-riscv.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Libre-RISCV General Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: Libre-RISCV General Development Content-Type: multipart/mixed; boundary="===============5623462699142308802==" Errors-To: libre-riscv-dev-bounces@lists.libre-riscv.org Sender: "libre-riscv-dev" --===============5623462699142308802== Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature"; boundary="=-FBhRFmonkZ2xasih2XbI" --=-FBhRFmonkZ2xasih2XbI Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Luke Kenneth Casson Leighton schreef op vr 27-03-2020 om 10:59 [+0000]: > On Fri, Mar 27, 2020 at 10:36 AM Staf Verhaegen wrot= e: > > Yes and no, it is the basic functionality of a pipeline :( >=20 > yes. > > You have the same latency but can have double the number of operations = in flight. >=20 > yes. hence why it is so important to have, because double the numberof o= perations means that we need double the number of Function Unitsin the Depe= ndency Matrix in order to keep the entire out-of-orderengine occupied. > also, double the number of operations in flight means that we needdouble = the number of Branch Prediction Units, and much more complexBPUs at that, j= ust to deal with the (now very likely) scenario ofhaving far more overlappi= ng inner loops "in flight". > all this from just extending the pipeline length(s) from 5 to 10. soit's= not just a "nice-to-have" feature, it's actually really importantto keepin= g the overall size of the chip down. There is an (IMO better) alternative for what you are doing with your pass-= through registers and that is clock gating (wikipedia, allaboutcircuits). The principle is that you save power by not clocking the parts of the circu= it that don't have to do any computing. I think this could be a more genera= l way to only enable the stages in your pipeline who actually are doing com= putation. In the above example you would always use a 10 stage pipeline running at 16= 00MHz but to mimic the 5-stage pipeline you only submit an operation every = other clock cycle and intermittently enable the odd and even stages in your= pipeline. This way the MUXes are removed from the computation path. Using a shift register it could be easily generalized to only enable the st= ages for which there is an operation going through the pipeline. When an op= eration is submitted you set the first bit in the shift register to enable = the first stage in the pipeline. With each cycle you then shift this bit so= the stage that is needed for the execution of that operation is active. This is generalized power optimization because it means that if you are run= ning a program that only uses integer operations your FPU and GPU with use = almost no power. The way to implement it is using EnableInserter. Some untested code how I t= hink it can be done: stages_en =3D Signal(10) stage1 =3D EnableInserter(stages_en[0])(Stage1()) stage2 =3D EnableInserter(stages_en[1])(Stage2()) ... m.d.sync +=3D stages_en.eq(Cat(newop, stages_en[0:9])) That said I think this feature does not fit in the MVP scope of the October= prototype so that chip should IMO not use clock gating nor the pass-throug= h register feature from the original discussion. Reason is that implementin= g it is easier said than done. Several things need to be done: - You first need a clock gating cell. This is not available in nsxlib and i= s currently not planned to be implemented. I don't want to commit to someth= ing extra for the May test chip tape-out either. - nmigen/yosys needs to properly support clock gating for ASICs. Likely thi= s means work in yosys that insert the clock gates from if clauses in the RT= L. - Your P&R tool (e.g. Coriolis) needs to support the clock gates. It means = your clock tree synthesis (CTS) needs to support more than just buffers in = the clock tree. This is not a simple task and has to be discussed with Jean= -Paul & co. greets, Staf. --=-FBhRFmonkZ2xasih2XbI-- --===============5623462699142308802== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Content-Disposition: inline X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KbGlicmUtcmlz Y3YtZGV2IG1haWxpbmcgbGlzdApsaWJyZS1yaXNjdi1kZXZAbGlzdHMubGlicmUtcmlzY3Yub3Jn Cmh0dHA6Ly9saXN0cy5saWJyZS1yaXNjdi5vcmcvbWFpbG1hbi9saXN0aW5mby9saWJyZS1yaXNj di1kZXYK --===============5623462699142308802==--