How to design your own CPU from scratch – Part 1

I’m planning to make this a larger series of articles and videos about how to build your completely custom computer from scratch. I also plan to write an OS for the machine later (planned for summer 2018).This series will consist of some articles that I structured into the following topics:

How to design your own CPU from scratch (5 parts)
Here I plan to design a really simple 8-Bit CPU from scratch, only using low-level logic and low-level components. I’ll also discuss the main parts of a CPU and maybe I’ll discuss extended features and pipelining at the end of the series or in a future article.

How to build a custom computer from scratch (5 parts)
Here I plan to build everything around the CPU that’s needed. In this series I’ll design my own mainboard with I/O and a simple graphics chip that outputs VGA. I’ll talk about all the important topics in their respective articles of this series.

How to design and code a custom operating system
This is something that I plan to do in the future and it’s something that I wanted to do, since I first started programming. This is really exciting for me and I plan to write this series in the next year.

What does a CPU consist of?

That’s not that easy to answer, especially if you look at modern-day CPUs that have so many different features that one could write a series of books about each of them. However I want to focus on the most simple CPU design, I can think of: An 8-Bit CPU without pipelining and without any extended features. It’s going to be a very inefficient, but easy to understand design. I want to keep it as simple as possible, so that everybody (that’s interested in the topic) can understand the core concepts. However I’ll try to sum up pipelining and very simple optimizations at the end of this series.

This is the part in a CPU that takes care of calculations. The acronym stands for “Arithmetic-Logic-Unit”. It is an electronic circuit that combines different arithmetic and bitwise logic operations on two binary numbers. The very basic functionality, that each ALU has to have (otherwise it wouldn’t make any sense), is the arithmetic addition and the logic-functions AND and NOT or OR and NOT, as they are complete, which means that you can derive every other logic function by using these two. By the way, the combined functions (NAND and NOR) and complete as well, however implementing a NAND function only for example wouldn’t necessarily be better than implementing the AND and NOT functions, because then you’d have to implement the AND and NOT functions each time in every program you write. However you can use NAND-Gates only when you design and build your ALU, to reduce costs and space needed for the circuit.

We’ll look at this topic in more detail later in this article.

Data/Instruction memory
Of course there needs to be some way to store results from previous calculations. Usually in a microprocessor this happens in registers (For an example see this article, to be more precise: the PRU program portion). These often have the size of one word and can therefore store one value. In our case one word consists of 8-Bit and one register in this CPU can store 8-Bit. However this doesn’t necessary has to be the case. You could also design a CPU with an 64-Bit ALU and store 128-Bit values in your registers, this would make sense if you natively support multiplications, where the result can be the 2 times the input length.

In my design the data and instructions are stored separately. This architecture is known as the “Harvard architecture“. Combined storage of instructions and data in one storage is also referred to as the “Von-Neumann architecture“.

However, registers are not meant to store data permanently. They are only used for storing results from previous calculations to use them in future calculations. To store values for a longer time, RAM is the way to go. To store them permanently, even after powering the system off, other methods, like hard-disk-drives, are used.

I plan to use 16 registers, from which only 13 can be used by the application programmer. The three registers, that are read-only, will hold the values for -1, 0 and 1, so in binary (1111 1111), (0000 0000) and (0000 0001), to make easy in- and de- crements possible.

If you look at all the registers of the CPU, the resulting table is often referred to as the “register-file“. Sometimes it’s also called “scratchpad”.

A shifter is usually also included in CPUs. The simplest possible shifter allows the programmer to shift the result of the last calculation by 1-bit, either to the left or to the right. It’s also possible not to shift at all. More advanced CPUs allow the programmer to shift multiple positions or to rotate binary values.

My CPU will only allow the user to shift the result by one position (left or right) or it will just pass the value through without doing anything at all.

A clock
Obviously the complete circuit (the CPU) will need to have a signal that tells all the individual components and sections when they should do their work. As this CPU will won’t have any pipelining, we’ll need to delay the clock for each individual section, so it each section will know when it has to work.

If you use pipelining, the sections you usually have are: Instruction fetch, instruction decode (register values are obtained here), execute, store and write-back. So we’ll need similar sections that I’ll discuss in more detail later, however as you can see, we’ll also need to load the next instruction from our program memory, we’ll also need to get the values from the registers, obviously we’ll also need to calculate something and at the end we should save the result somewhere.

We’ll have to load and store a lot of bits during our calculations, so we’ll need some lines where the information can travel. Such a line is called a bus. We could either load all bits simultaneously (parallel) or one bit after another (serial). Each of these methods has its pros and cons, however to keep the whole thing simple and easier to understand and build I’ll only use parallel busses.

The main busses will be between the register-file and the ALU, the ALU and the program-memory and the whole CPU and an external memory (or memory controller to be more precise), for storing data in RAM or persistent memory.

I guess I described everything a very simple CPU has to consist of, to make any sense. It consists of a unit that calculates results, data storage and busses that connect the components and transfer data. Everything is then controlled by a clock signal.

Like I said above, this is by no means a complete list. These are the very basic components of the CPU I want to design. I linked further resources at the relevant positions. I’ll also explain the parts in more detail when I show you the individual designs.

The ALU in more detail

Like I stated above, this part is responsible for all the calculations made. Usually the ALU is symbolised by a v-shaped symbol with its in- and output-lines:

Figure 1: Typical representation of an ALU

So a, b and f are the inputs. A and B are input-data words that are going to be used in the next calculation, F represents input flags. It’s used for several things, but for example it let’s the ALU-circuit know what the next instruction is and therefore how to calculate the result (R).

The outputs are the result (R) and output-flags (D), which can be used by the application programmer. For example: The ALU might have a line that gets high, when the last result was 0, or when it was negative. There are loads of more things that can be output here.

Arithmetic functions in the ALU

In this section I’ll only look at different types of binary addition circuits, because that’s the only function my ALU will support, because all the other simple arithmetic operations can be derived from the addition.

I’ll start with the simplest of all possibilities. Just imagine that you only have two one-bit words you want to calculate. The output also consists of one bit.

The half-adder adds two one-bit binary digits together without considering a previous carry, that might have occurred during a previous addition. So it has two inputs: A and B and two outputs: S and C which stands for sum and carry.

The following table shows you, what the results of this adder look like:

0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1

This adder can be built by using a XOR and an AND logic gate and it looks like this:

Figure 2: The half-adder

Try it here:

The full-adder is very similar to the half-adder from above. Actually the only difference here is, that it has an additional input, often named with Cn-1 or Ci, which stands for a carry-in-bit from a previous addition.

The full-adder logic looks like this:

Figure 3: The full-adder. It’s made up from two half-adders and an OR-Gate

Try it here:

n-Bit addition
As stated above, this CPU will work with 8-Bit long data words. Therefore the ALU will have to be able to add two 8-Bit numbers together. So we’ll have to combine the one bit adders from above to a larger network, to enable the addition of more than one bit at once. There are some popular methods but I only want to discuss two of them really quickly, to keep the article short, because it’s getting too long already:

Carry-ripple adder
I will start with the easiest to understand. It can start with a full-adder but I’ll start with a half-adder, because for my CPU it’s not relevant whether there was a carry-out in the last addition made. The rest of the carry-ripple-adder is made up by full adders.

The inputs A and B are the respective positions of the numbers you want to add and the carry-in of one adder is connected to the carry-out of the previous one. The last FA’s carry-out is the carry-out of the whole carry-ripple-adder. It can be used to tell the application programmer that an overflow occurred during the last addition by setting an output line of the ALU high.

So let’s take a look at an 4-Bit carry-ripple-adder (of course the 8-Bit one would have 16 inputs and 8 adders in total, but this is enough to show you the functionality):

Figure 4: carry-ripple-adder

Try it here: (pre-made example)

A good thing about this circuit is, that it’s simple to build and easy to understand. One negative side effect is, that it’s relatively slow.

By the way I built this adder in LBP a while ago. Here’s a quick demo of how full-adders and the carry-ripple adder works:

Carry-lookahead adder
I would consider this the opposite from the carry-ripple adder shown above: It’s much more complex and therefore harder to think through but it’s quicker. It looks like this:

Figure 5: Carry-lookahead adder in full detail

The idea behind this method is, that you don’t have to wait for the results to pass through all the adders, before you can calculate your final result. I guess it might not be that dramatic in the example shown above, because we only have two 4-Bit numbers. But what if you have two 32-Bit numbers? (Hint: it get’s really slow, especially when the carry has to travel through all the adders)

So in the carry-lookahead adder we determine whether a bit is being carried out or not after a specific position and then we can calculate different parts of the numbers at once. For example if there are two 4-Bit numbers put into the adder we determine wether the last two bits will generate a carry-out and then use this information to calculate the sum of the first two bits simultaneously with the sum of the last two bits and output all the results. That’s why this adder is also sometimes referred to as a ‘parallel-adder’ or ‘PA with lookahead’.

A really good and more detailed explanation can be found in this article.

Block diagram

The following block diagram shows the ALU and it’s in- and outputs. Also it shows the clock. I decided to go with the carry-ripple adder, simply because it’s the easiest to implement and to build and I don’t care about performance too much in this case, it won’t be very good anyways and I’ll take a look at some techniques to improve the performance in the last part of the series.

However, here’s the ALU in all it’s glory:

Figure 6: The ALU in a logic simulator

Try it here:

In the next part…

… I want to talk about how the ALU is controlled, where data is stored and how it is transferred into the ALU and back to the memory.


Table of contents

Part 1 – Basics and the ALU (You are here)
Part 2 – Registers and memory
Part 3 – Applications
Part 4 – The completed CPU
Part 5 – Further improvements and advanced topics (Not published yet)

Further resources

Adders, Wikipedia
Online logic editor and simulator,


[Fig. 1] An ALU, Wikipedia
[Fig. 5] CLA, Wikipedia


One thought on “How to design your own CPU from scratch – Part 1

Leave your two cents, comment here!

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s