

# Digital Design & Computer Architecture

Abdul Muizz



#### General Updates

- Albany Nanotech Complex Tours
- Mini-Colloquium on Advanced Packaging and Heterogeneous Integration

## Albany Nanotech Complex Tour

- Albany Nanotech Complex
  - A state-of-the-art campus that brings together industry leaders, academia and international partners to develop next-generation chips and chip fabrication processes.
  - Major Companies:
    - IBM, Applied Materials, Tokyo Electron, Wolfspeed
  - Tour Dates:
    - February 26th (Wednesday), 11:30 AM 1:30 PM
    - March 12th (Wednesday), 11:30 AM 1:30 PM
- More tours are coming in the future!
- Tour attendees must be IEEE EPS members!





- Mini-Colloquium Details
  - Located in Albany Nanotech Complex
  - Feb 18th (Tuesday) from 12pm 3pm
  - Transportation Provided
  - o Open to all students
- Schedule:
  - o 12pm Pizza Lunch
  - o 1pm Welcome/EPS Overview
  - o 1:10pm John Lau, Unimicron Technologies
  - o 1:55pm Break
  - o 2:00pm Prof. Inoue, Yokohama National University
  - o 2:45pm Closing Comments
- Please register on VTools!

IEEE EPS Mini Colloquium 2025 Innovations in Wafer Bonding and Advanced Substrates for HI Applications

#### Presented by 2025 IEEE EPS Mid-Hudson Valley Chapter

#### Date: 18 Feb 2025 Time:

12 pm to 3 pm Location: NFS Auditorium, Albany NanoTech Complex

Free Registration for ALL!

Free Pizza, drinks and desserts served to all registered attendees!

To reserve your spot, please REGISTER at: https://events.vtools.ieee.or g/m/455513

Register using QR code:



Advanced Substrates for Chiplets and Heterogeneous Integration John Lau Senior Special Project Assistant Unimicron



Key Technologies and Mechanism Analysis for Next-Generation Hybrid/Fusion Bonding Fumihiro Inoue Associate Professor and vice-director for Semiconductor and Quantum Integrated Electronics Research Center, Yokohama National University



#### What are we covering today? - Overview

- Most of the topics covered in EPS focus on the transistor-level and device packaging but rarely move to more abstract topics, such as computer architecture or digital logic.
- How do processors build upon transistorbased circuits to execute complex tasks? Where is the connection between device physics and computer science?
- This lecture will delve into these topics, unveiling the immense complexity of our highly-organized electronic devices.



#### What are we covering today? - Meet the Abstraction Layers

- The computer abstraction layer diagram shows the sophisticated path it takes to get from physics to applicate software.
- Starting from the bottom, we will move up the layers of abstraction.



The higher you go, the more "abstract" each layer becomes, i.e., rooted further away from device physics.

#### Physics & Devices



#### **Review – CMOS Transistors**

- CMOS (Complementary Metal-Oxide Semiconductor) is a type of MOSFET technology that uses a complementary pair of p-type (PMOS) and n-type (NMOS) MOSFETS.
- Transistors work by regulating current (Source-Drain or Drain-Source) by using a "Gate". Based on the received input, current can be turned on or off.





#### NMOS and PMOS Behavior



#### Analog & Digital Circuits



#### Linked Transistors can form Digital Logic Gates





| Inverter | Input | Output |
|----------|-------|--------|
| ~        | 0     | 1      |
|          | 1     | 0      |







Note the Pull-Up Network (PUN) and Pull-Down Network (PDN)

## Understanding the Analog in Digital

- Moving up the abstraction chart, these gates are viewed simply as digital logic blocks (1's and 0's)
- Under the hood, these are still analog circuits. Rise and fall times are influenced by parasitic capacitances and noise.
- Putting these gates together, along with registers and busses, can create function units, logic blocks, and control systems.





#### Inverter Output with 0.01 pF Load Parasitic



Inverter Output with 0.5 pF Load Parasitic

## Logic



#### Full Adder

- A full adder can take three inputs, A, B, and Cin (carry-in) and produce two outputs, S and carry-out.
- Extends on from the half-adder (2 inputs, 1 output), and can be chained for multiple bits.

| Α | В | Cin | Sum (S) | Cout |
|---|---|-----|---------|------|
| 0 | 0 | 0   | 0       | 0    |
| 0 | 0 | 1   | 1       | 0    |
| 0 | 1 | 0   | 1       | 0    |
| 0 | 1 | 1   | 0       | 1    |
| 1 | 0 | 0   | 1       | 0    |
| 1 | 0 | 1   | 0       | 1    |
| 1 | 1 | 0   | 0       | 1    |
| 1 | 1 | 1   | 1       | 1    |



#### **Full Subtractor**

• Similarly, there is a full subtractor for subtracting bit values.





#### Micro-Architecture



- By combining several arithmetic circuits (such as adders and subtractors), logic units, an accumulator, several registers, busses and control systems, you can create an ALU.
- This is the heart of a CPU and performs mathematical and logical operations.
- How do we interface with the ALU to execute a set of instructions?



#### Von Neumann Architecture

- Computer design philosophy that calls for storing instructions and data in memory and execute them sequentially.
- This design would include components, such as an arithmetic processing unit, a control unit, a memory for instructions, external mass storage, and I/O mechanisms
- Proposed in 1945 by Hungarian-American physicist and mathematician John von Neumann.



#### The RISC-V Single Cycle Processor – Data path

- RISC-V is an open-source instruction set architecture (ISA) that's used to design processors.
- The data path illustrates how an instruction (32-bit) can be fetched from memory, be broken down into control signals, pass through ALU and Write module to modify data, and then call the next instruction.



| Туре                     | Description                                                                                                                                        |
|--------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|
| R-type (Register)        | Perform arithmetic and logical operations that work entirely on registers (Ex: add, sub, and, or)                                                  |
| I-type (Immediate)       | Handle operations that use an immediate (constant) value along with a register.<br>They are also used for load instructions. (Ex: addi, lb, lw)    |
| S-type (Store)           | Used for storing data from a register into memory.<br>(Ex: sb, sh, sw)                                                                             |
| B-type (Branch)          | Enable conditional branching based on comparison between registers.<br>(Ex: beq, bne)                                                              |
| U-type (Upper Immediate) | Load a 20-bit immediate into the upper 20 bits of a register. This is useful for constructing larger constants or addresses (Ex: lui, auipc)       |
| J-type (Jump)            | Facilitate jump operations, where a larger immediate value is needed to compute a jump target relative to the current program counter (ex: jal) 21 |

#### Short Clip following the data path for an R-Type Instruction (5:19-10:00)



- Breaks an instruction down into multiple steps for execution.
- Useful when different stages of an instruction have different latencies.
- Beneficial for shortening the clock period and performing instructions incrementally across multiple cycles.
- Provides some level of perfromance increase, but not typically used today.



## Single Cycle is slow, can we speed it up? - Pipelining

- Yes! Through pipelining!
- Laundry analogy, rather than performing all laundry steps for one load at a time (Sort, Wash, Dry, Fold), we can accelerate the process by starting the next load while the previous load is still drying.







Each pipeline stall segment requires some form of control logic to prevent pipelining hazards



#### 1. Structural Hazard

- Structural hazards Occurs when two or more instructions in different pipeline stages simultaneously require the same hardware resource, but the resource is not available for them at the same time. Can be addressed with scheduling.
  - Ex: Fetch and Data Memory stages may need to access memory concurrently.



#### 2. Data Hazard

- Data hazards Occur when an instruction depends on the result of the previous instruction that has not yet completed its execution in the pipeline. Can be addressed with stalling.
  - Ex: Read after write, write after read, or write after write.



#### 3. Control Hazard (aka Branch Hazard)

- Control Hazards aka Branch Hazards, occur when the pipeline makes wrong assumptions about the path of a branch or jump instruction. Until the branch outcome is determined, the pipeline may have already fetched incorrect instructions. Can be addressed by discarding whatever is in the pipeline (flushing).
  - Ex: When reaching a conditional branch (if, for, while), a wrong prediction could lead to incorrect execution.



## Why are GPUs better at Pipelining than CPUs?

- GPUs have
  - o many more cores and threads,
  - can split instructions onto multiple threads,
  - have simplified execution units and specialized pipelines,
  - are optimized for throughput over latency.

Other ways to improve performance outside of pipelining include out-of-order execution, forwarding, and branch prediction.



#### Architecture



#### **RISC-V Instruction Set**

#### **Core Instruction Formats**

| 31 | 27                    | 26   | 25   | 24  | 20  | 19      | 15     | 14   | 12     | 11     | 7      | 6      | 0    |        |
|----|-----------------------|------|------|-----|-----|---------|--------|------|--------|--------|--------|--------|------|--------|
|    | funct7 rs2            |      | 2    | rs1 |     | fun     | funct3 |      | rd     | opo    | code   | R-type |      |        |
|    | ir                    | nm[] | 11:0 | )]  |     | rs1     |        | fun  | ct3    |        | rd     | opo    | :ode | I-type |
|    | mm[1                  |      |      | rs  | 2   | rs1     |        | fun  | ct3    | imr    | n[4:0] | opo    | :ode | S-type |
| im | imm[12 10:5] rs2 rs1  |      | fun  | ct3 | imm | 4:1 11] | opo    | :ode | B-type |        |        |        |      |        |
|    | imm[31:12]            |      |      |     |     |         | rd     | opo  | :ode   | U-type |        |        |      |        |
|    | imm[20 10:1 11 19:12] |      |      |     |     |         | rd     | opo  | ode    | J-type |        |        |      |        |

https://www.cs.sfu.ca/~ashriram/Courses/CS 295/assets/notebooks/RISCV/RISCV\_CARD.p df <- Full reference card for RISC-V ISA

- ISA needs to be consistent and organized in order to support a wide variety of instructions.
- The 32-bit instruction can be broken down into several critical pieces (Ex: for an R-type Instruction).

| funct7                                                                                 | rs2 (Src. Reg. 2)                                                                                                                                  | rs1 (Src. Reg. 1)                                                                      | funct3                                                                                               | rd (Dest. Reg.)                                                                                                                                                | Opcode                                                                                                                                                        |
|----------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 7 bits, [31:25]                                                                        | 5 bits, [24:20]                                                                                                                                    | 5 bits, [19:15]                                                                        | 3 bits, [14:12]                                                                                      | 5 bits, [11:7]                                                                                                                                                 | 7 bits, [6:0]                                                                                                                                                 |
| This field is used to<br>further distinguish<br>between variants of<br>an instruction. | This field indicates the<br>second source<br>register operand for<br>operations that<br>require two registers<br>(like many R-type<br>operations). | The value from this<br>register is typically<br>one of the inputs to<br>the operation. | This secondary<br>opcode field further<br>refines the operation<br>defined by the<br>primary opcode. | In instructions that<br>write results to a<br>register (like R-type<br>and I-type<br>instructions), these<br>bits specify the<br>destination register<br>(rd). | This field identifies<br>the broad class of the<br>instruction (for<br>example, whether it's<br>an arithmetic<br>operation, a<br>load/store, or a<br>branch). |



A good start, but a processor still can't read this. Let's visit: <u>https://venus.kvakil.me/</u>

#### Let's convert a simple program into a set of instructions (machine code)

|                |               | Editor Simulator                                                     |                   |           |
|----------------|---------------|----------------------------------------------------------------------|-------------------|-----------|
|                |               |                                                                      | t0 (x5)           | 55        |
|                |               | Run Step Prev Reset Dump                                             | t1 (x6)           | 5         |
|                |               |                                                                      | t2 (x7)           | 0         |
| Machine Code   | Basic Code    | Original Code                                                        | s0 (x8)           | 0         |
| 0x00500293     | addi x5 x0 5  | addi t0, zero, 5 # X = 5                                             | s1 (x9)           | 0         |
| 0x00000313     | addi x6 x0 0  | addi t1, zero, 0 # Y = 0                                             | a0 ( <b>x</b> 10) | 0         |
| 0x00532393     | slti x7 x6 5  | loop: slti t2, t1, 5 # Set t2 = 1 if Y < 5, else 0                   | a1 (x11)          | 0         |
| 0x00038863     | beq x7 x0 16  | beq t2, zero, end $\ddagger$ If t2 == 0 (i.e. Y >= 5), exit the loop | a2 (x12)          | 0         |
| 0x00a28293     | addi x5 x5 10 | addi t0, t0, 10 # X = X + 10                                         | a3 ( <b>x</b> 13) | 0         |
| 0x00130313     | addi x6 x6 1  | addi t1, t1, 1 # Y = Y + 1                                           | a4 (x14)          | 0         |
| 0xfflff06f     | jal x0 -16    | j loop # Jump back to the start of the loop                          | a5 (x15)          | 0         |
| 0x00628e33     | add x28 x5 x6 | add t3, t0, t1 $\#$ Z = X + Y                                        | a6 (x16)          | 0         |
|                |               |                                                                      | a7 (x17)          | 0         |
|                |               |                                                                      | s2 (x18)          | 0         |
|                |               |                                                                      | s3 (x19)          | 0         |
|                |               |                                                                      | s4 (x20)          | 0         |
|                |               |                                                                      | s5 (x21)          | 0         |
|                |               |                                                                      | s6 (x22)          | 0         |
|                |               |                                                                      | s7 (x23)          | 0         |
|                |               |                                                                      | s8 (x24)          | 0         |
|                |               |                                                                      | s9 (x25)          | 0         |
|                |               |                                                                      | s10 (x26)         | 0         |
| console output |               |                                                                      | s11 (x27)         | 0         |
|                |               |                                                                      | t3 (x28)          | 60        |
|                |               |                                                                      | Display Settings  | Decimal Y |

#### Other ISAs

- X86, incredibly widespread platform used on almost all Intel and AMD builds
- ARM, another "reduced instruction set" ISA, but not open source like RISC-V
- Apple switched to an ARM based ISA in 2020 (the switch from Intel to Apple Silicon in Macs)







# **x86**

# ARM

- Higher raw performance for intensive tasks (video editing, gaming, data analysis)
- Wide software compatibility (large software ecosystem)
- Wide and flexible instruction sets, allowing for greater customization and optimization
- Increased power efficiency, due to low power consumption (ideal for mobile devices and embedded systems with limited battery life).
- Cost-effective

#### Apple Rosetta

- A dynamic binary translator for macOS.
- Released in 2006 with new Intel Macs to allow applications to run from previous Macs using PowerPC processors (Dropped in 2011).
- Rosetta 2 released in 2020 with Apple Silicon, allowing Intel applications to run on Apple silicon-based Macs.
- Uses Ahead-of-Time (AOT) Translation to pre-translate parts of the application into Arm code (Cost of time). Also uses Just-In-time (JIT) Dynamic Translation on the fly (Efficient enough to not notice performance degradation).
- Low-level or Kernal level features might still encounter issues and still comes with a performance cost.
- "Whisky" is similar in concept and allows games to run on macOS.







#### **Operating Systems & Application Software**



#### **Operating Systems**

- Operating Systems exist to hide the complexities of underlying hardware (e.g., registers, memory management, I/O operations)
- Provide a simpler interface for applications.
- Operating systems will look at process and thread management, memory management, contain a file system and storage management, and manage I/O devices
- Modern OS will also consider Networking protocols, security and user management, and virtualization (multiple instances on one machine)
- Current OS include Windows (evolved from MS-DOS), Linux, MacOS, ChromeOS. Mobile devices include Android and iOS.





- Written in programming languages that vary in complexity and features
- High level languages (C, Java, Python, Rust) can allow developers to write code without worrying about the intricate details of the underlying hardware. Features include structured programming, objectoriented programming, garbage collection, standard libraries.
- A compiler will translate high level code (C, Java, Rust) into low level code (assembly), which the computer can understand. An assembler can than translate assembly into machine code (binary).
- Interpreted languages (Python) use an interpreter rather than following the compiling chain, meaning source code is interpreted at runtime rather than being pre-compiled.
- Modern Java environments may use Just-In-Time (JIT) which compiles code on the fly during execution.







# The End

Thanks for listening!

|                         |                    | _                         |   |
|-------------------------|--------------------|---------------------------|---|
| Application<br>Software | >"hello<br>world!" | Programs                  |   |
| Operating<br>Systems    |                    | Device<br>Drivers         |   |
| Architecture            |                    | Instructions<br>Registers | G |
| Micro-<br>architecture  |                    | Datapaths<br>Controllers  |   |
| Logic                   |                    | Adders<br>Memories        |   |
| Digital<br>Circuits     | 0- <b>-</b> -0     | AND Gates<br>NOT Gates    |   |
| Analog<br>Circuits      |                    | Amplifiers<br>Filters     |   |
| Devices                 | -                  | Transistors<br>Diodes     |   |
| Physics                 |                    | Electrons                 |   |
|                         |                    |                           |   |

# GBM Attendance