![](https://crypto4nerd.com/wp-content/uploads/2023/07/1vxlcZU8pQCnTObszkPm3Mg-1024x568.png)
There are multiple layers of codesigning paradigm as in figure below.
Hardware
What is Google TPUs?
Google TPUs are the co-processor that are focused on accelerating ML and AI workloads.
TPUs were first introduced in 2015. In 2015, the workloads of ML and AI doubled every 3.5 months, which inspired Google to design its first tensor processing units v1 in an efficient way with performance and power efficiency.
Imagine if you had to design the new workload. What would you do? There are two approaches. One is going through the traditional processors. The process starts with taking general processor units, extending the instruction set on the architecture, and including new instructions that tackle this workload to run faster. The problem with this approach is the increase in instruction latency. Because it is a general-purpose processor, it is based on latency-oriented processors. In this case, the processors keep doing the same instruction repeatedly. However, they have to be good at issuing new instructions and doing them. Consequently, if we take the ML workload to the general-purpose processor, then we extend its instruction. The amount of latency would increase with the typical instruction issue.
The second approach is co-processors, or specialized processors. How specialized processors are used depends on the company, such as Asic which fabricated an actual processor. In Google, this is also TPUs. TPUs were developed over a lot of generations; today is v4. Today, workloads such as Translate and AlphaGo run on these TPU processors.
Why TPU is important?
GPUs have been a little slow in catching up to the TPU workload previously. That because they have varied workloads to target. When TPU is narrowed to a single domain, they are able to get perfect speed. The graph on the right shows the better performance of TPU v1 16 times more than the GPUs in terms of performance per watt. Even now, the GPUs have drastically developed, but TPU v4 is still better.
In a simple ML application, for example, image classification, the process of a neural network is to learn the characteristics of an image and predict whether it is a cat or not.
In depth, the computation generally performs the matrix multiplication billions or trillions of times. This matrix is being translated into TPU and computed by it. The systolic array is the system in TPU that overcomes the traditional matrix multiplication method.
Systolic arrays were introduced in 1970s computer architecture [Kung and Leisserson]. Systolic arrays are the champions of matrix multiplication by seeing the matrix as the data moving for a 2D grid processor with pipelined and neighbor exchanges.
The process is illustrated in the two figures below. The weight matrix and input matrix are local and share the result. As we proceed, the movement of multiplication and sum of the whole 3D grid processor and the result are the same as in the natural way.
Systolic array is translating load in stores from SRAM by sharing among processors. So, it is able to achieve much better performance. Not only its performance, but also its power efficiency. Because it converts loads and stores into near-neighbor communication between processors. Then getting much better performance than the traditional executions of matrix multiplication.
Google deploys many MEMS-based optical switches (OCS) for the standard datacenter networking routers and routing hardware. This is also useful in scaling up TPUs and some supercomputers to many thousands of chips.
On the application end, ML applications require scale and throughput to be very efficient. With hundreds of billions of parameters, we would need scale in order to serve and train a large model. In order to do that, they pack up multiple chips together if we want high throughput. We would need communication fabric that is both lightweight, energy efficient, and cost effective.
OCS provides good energy efficiency. They have low latency and high bandwidth. It is like a mirror that is controlled electromechanically. You have the input and output of optical fibers. By changing the direction of the mirror, you can route the light from one set of fibers to the resultant output fibers. This is energy efficient compared to the packet switched electrical network, where every single package might have to reconfigure switches, which depends on the applications. We will get the speed of light propagation delay. Low power consumption because the pipeline is not switching electrical networks. Once you set the MEMS mirror arrays in a particular manner, light just flows through without consuming any power.
Systems
How do we use this OCS in a large scale?
Due to the need for a huge training dataset and more scaling, there are also rapid changes to the algorithms and model structure. Datacenters have scaled up to 4000 chips in TPU v4. OCS is enabled to reconfigure networks on the supercomputers.
When considering large scale, the first thing we need to consider is the improved availability at scale. For example, if you take an individual chip, it might fail. When packing these chips together, the chance of one chip failing is low, and the system can still run.
Google constructs 4 by 4 chip blocks, and OCS is linked to the blocks. If one chip closes down, we detect that (after a hundred ms) and slack the system to find a new block in the datacenter, reset the OCS, and route to that.
The OCS is bandwidth independent; more fibers to the optical switch as long as it satisfies the bandwidth and the network vendors.
Software
In programming model side effects, we will focus on two high-level optimizations, DeepFustion and Flexible Parallelization. Both of them are critical for performance workloads.
DeepFusion
This is how traditional contractors’ work. The convolution reads the inputs from your slow memory SRAM and then writes out its output to slow memory. And read it again and write out the full result with the same value. In this case, the memory transfer becomes a bottleneck. Because they are doing 2 bytes in one flop; it is almost like 50 operational intensities.
Machine and loop balance
Therefore, we should consider Machine balance and loop balance; this was introduced by Karl and Kenedy before roofline analysis. Machine balance is the theoretical peak of flops that can be achieved per byte of red operation from memory. Loop balance is generally what the compiler generates and the characteristic that it exhibits.
If machine balance says we can do 4 flops for every byte rate from memory, then the compiler’s best loop generates four. It would not be linear. Awesome complier. Google tries to get as close to machine balance as possible. They fuse three layers together. This Deep Fusion triples the operational intensity three times compared to the unfused one.
A key optimization is hardware utilization on the second iteration of the loop. Memory utilization benefits from avoiding the transfers of the intermediate tensors. But the con is that the output is no longer in the slow SRAM. The main one is achieving parity between machine and loop balance. It hit the theoretical roof line, which is the number of flops per byte consumed from SRAM. These are a few examples of research.
Examples
- Flash attention — fusion and recompute developed from TPU complier.
- This is still in develop. from deep multi-operator from PLDI’21 which is matrix matrix fusions are unprofitable and more complex to orchestrate.
Fusion is a first-class primitive in the TPU complier. They are done performantly on the TPU.
In the attention mechanism, they fuse the attention logit and softmax together. To overcome the bottlenecks from compute bound and memory bound, FLAT tries to fix the input with a small input tensor and a large intermediate tensor. They fuse three operators together unless there is a large intermediate tensor in the system.
Flexible parallelization
Based on the model benchmark, for example, in LLM, there are different types of parallelization strategies. The first approach to flexible parallelization is GSPMD. From reconfigurable OCS, by looking at the workload, Google is able to configure the topology to what it needs.
Math
Two mains topic is focused here: numeric and sparsity.
Numeric
In general, we often use floating point like scientific notation 6.02e32.
The change in numeric type could have an effect on the quality of the model. They discover that what models wanted was slightly different than the standardized. They introduce bfloat16 because they want more exponent bits compared to mantissa bits. Compared to fp16, we can see that the number of exponent bits is higher, but the number of mantissa bits is significantly lower. Multipliers grow as thin as a mantissa bit. The lower mantissa bits can actually improve the performance and energy efficiency of the hardware.
Sparsity
Sparsity can come in many forms, including weight, activation, and semantic sparsity. Different types of sparsity cause different granularities between fine-grained and coarse-grained (0 and 1). The structures need to be selected for a specific domain of this solution to determine the right strategy for a profitable system.
Example
- SparseCore in TPU v4 is recent research on developing system solution with software and hardware to accelerate the computation for application like CNN, GNN. Hessians base augmentation for mixed precision training
Application and Infrastructure
It has been concerned about the power consumption of the AI workload. Google has been thinking about power and energy efficiency for a very long time. They move the location to where it has access to clean energy.
The cost of training a large model is high. Responsible AI is a broad topic focusing on its carbon emissions from ML training. We can measure here in the datacenter.
4Ms is the strategy for reducing energy and carbon emissions. The model can change from Transformers to Primer. The processing machine, using TPUv4, has improved 14x compared to P100. There is the consideration of mechanization and maps, which are based on the resource’s location.
Google tries to locate where datacenters can access energy sources.