Envelope-to: publicinbox@libre-riscv.org
Delivery-date: Sat, 28 Mar 2020 14:08:34 +0000
Message-ID: <0d35e45bd81eeaecedeb64dc5061c1e33c89630c.camel@fibraservi.eu>
From: Staf Verhaegen <staf@fibraservi.eu>
To: libre-riscv-dev@lists.libre-riscv.org
Date: Sat, 28 Mar 2020 15:08:25 +0100
In-Reply-To: <CAPweEDzAtWoU+wc6MTayF1vtKJvrxLLfP-Q1Czea+NX5MgOrfg@mail.gmail.com>
References: <CAPweEDx5QCCKxSr1gfuyuw_2D68Ld8fK85bEmmMTZi8S3w2E9g@mail.gmail.com>
 <29b1a9ecedda151dc9c8da6516c3691dfede62ef.camel@fibraservi.eu>
 <CAPweEDwfqMczPjg=5Fvt1J_S8nx1YK44XhyBY8H1abuTNF6=xg@mail.gmail.com>
 <6fa40cb78b3f8c013ca4953ccb4daa5c23e3b501.camel@fibraservi.eu>
 <CAPweEDxiyTEsneXN65Kq0HsEsdL3wdY=NYayq2tz5egXJNCVfg@mail.gmail.com>
 <e430ea6587d292166fd58460adf4dfebfad20c6d.camel@fibraservi.eu>
 <CAPweEDzEvtPYGKvGMvebmQzhJDhSgfvUOVZvB2WXxSbv_ebE8A@mail.gmail.com>
 <b18283c7e7a93fa8afdef2f0a8679b26e4569528.camel@fibraservi.eu>
 <CAPweEDwznLD5o6rHfWsSXR-8e1hbAfAB04f5O+YkL6pCwGsNfQ@mail.gmail.com>
 <6fbfb2a3258be77f4fce69661b283dc31a683f7b.camel@fibraservi.eu>
 <CAPweEDwf7s=r6bhq6N=VG7QQ1iD4jHYEG6mGvtxL32Uxnhzqwg@mail.gmail.com>
 <9e44930a0332eff507661e617796b9d0674b0e05.camel@fibraservi.eu>
 <CAPweEDzAtWoU+wc6MTayF1vtKJvrxLLfP-Q1Czea+NX5MgOrfg@mail.gmail.com>
Organization: FibraServi bvba
Mime-Version: 1.0
Subject: [libre-riscv-dev] Clock Gating (was  cache SRAM organisation)
Precedence: list
Reply-To: Libre-RISCV General Development
 <libre-riscv-dev@lists.libre-riscv.org>
Content-Type: multipart/mixed; boundary="===============5623462699142308802=="
Errors-To: libre-riscv-dev-bounces@lists.libre-riscv.org
Sender: "libre-riscv-dev" <libre-riscv-dev-bounces@lists.libre-riscv.org>


--===============5623462699142308802==
Content-Type: multipart/signed; micalg="pgp-sha1"; protocol="application/pgp-signature";
	boundary="=-FBhRFmonkZ2xasih2XbI"


--=-FBhRFmonkZ2xasih2XbI
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Luke Kenneth Casson Leighton schreef op vr 27-03-2020 om 10:59 [+0000]:
> On Fri, Mar 27, 2020 at 10:36 AM Staf Verhaegen <staf@fibraservi.eu> wrot=
e:
> > Yes and no, it is the basic functionality of a pipeline :(
>=20
> yes.
> > You have the same latency but can have double the number of operations =
in flight.
>=20
> yes.  hence why it is so important to have, because double the numberof o=
perations means that we need double the number of Function Unitsin the Depe=
ndency Matrix in order to keep the entire out-of-orderengine occupied.
> also, double the number of operations in flight means that we needdouble =
the number of Branch Prediction Units, and much more complexBPUs at that, j=
ust to deal with the (now very likely) scenario ofhaving far more overlappi=
ng inner loops "in flight".
> all this from just extending the pipeline length(s) from 5 to 10.  soit's=
 not just a "nice-to-have" feature, it's actually really importantto keepin=
g the overall size of the chip down.

There is an (IMO better) alternative for what you are doing with your pass-=
through registers and that is clock gating (wikipedia, allaboutcircuits).
The principle is that you save power by not clocking the parts of the circu=
it that don't have to do any computing. I think this could be a more genera=
l way to only enable the stages in your pipeline who actually are doing com=
putation.
In the above example you would always use a 10 stage pipeline running at 16=
00MHz but to mimic the 5-stage pipeline you only submit an operation every =
other clock cycle and intermittently enable the odd and even stages in your=
 pipeline. This way the MUXes are removed from the computation path.
Using a shift register it could be easily generalized to only enable the st=
ages for which there is an operation going through the pipeline. When an op=
eration is submitted you set the first bit in the shift register to enable =
the first stage in the pipeline. With each cycle you then shift this bit so=
 the stage that is needed for the execution of that operation is active.
This is generalized power optimization because it means that if you are run=
ning a program that only uses integer operations your FPU and GPU with use =
almost no power.

The way to implement it is using EnableInserter. Some untested code how I t=
hink it can be done:

	stages_en =3D Signal(10)
	stage1 =3D EnableInserter(stages_en[0])(Stage1())
	stage2 =3D EnableInserter(stages_en[1])(Stage2())
	...

	m.d.sync +=3D stages_en.eq(Cat(newop, stages_en[0:9]))

That said I think this feature does not fit in the MVP scope of the October=
 prototype so that chip should IMO not use clock gating nor the pass-throug=
h register feature from the original discussion. Reason is that implementin=
g it is easier said than done. Several things need to be done:
- You first need a clock gating cell. This is not available in nsxlib and i=
s currently not planned to be implemented. I don't want to commit to someth=
ing extra for the May test chip tape-out either.
- nmigen/yosys needs to properly support clock gating for ASICs. Likely thi=
s means work in yosys that insert the clock gates from if clauses in the RT=
L.
- Your P&R tool (e.g. Coriolis) needs to support the clock gates. It means =
your clock tree synthesis (CTS) needs to support more than just buffers in =
the clock tree. This is not a simple task and has to be discussed with Jean=
-Paul & co.

greets,
Staf.

--=-FBhRFmonkZ2xasih2XbI--


--===============5623462699142308802==
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: base64
Content-Disposition: inline

X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX18KbGlicmUtcmlz
Y3YtZGV2IG1haWxpbmcgbGlzdApsaWJyZS1yaXNjdi1kZXZAbGlzdHMubGlicmUtcmlzY3Yub3Jn
Cmh0dHA6Ly9saXN0cy5saWJyZS1yaXNjdi5vcmcvbWFpbG1hbi9saXN0aW5mby9saWJyZS1yaXNj
di1kZXYK

--===============5623462699142308802==--