TPU : Tensor Processing Unit is highly-optimised for large batches and CNNs and has the highest training throughput. TPUs are good for deep learning tasks involving TensorFlow, while GPUs are more general purpose and flexible massively-parallel processors. Machine learning, a branch of artificial intelligence (AI), is a buzzword in the tech field right now.
This image summarizes the compute primitive (smallest unit) in CPU, GPU and TPU : The dimension of data are: 1. As a comparision, consider this: 1. CPU can handle tens of operation per cycle 2. GPU can handle tens of thousands of operation per cycle 3. For the power density metric, lower values are better. Therefore, TPU has the lowest, and thus the best, power density. What upgrade first CPU or GPU? A TPU computes such a matrix multiplication by splitting the matrix into many smaller 128×1matrix multiplications.
For a GPU we have the same process, but we use smaller tiles with more processors. Similarly to the TPU, we use two loads in parallel to hide memory latency. For GPUs, however, we would have a tile size of 96×for 16-bit data. If we take a V1Tesla GPU, then we can run 1of these in parallel at full bandwidth with low memory latency. SMs, 1thread blocks, each thread.
See full list on timdettmers. Note that all models are wrong, but some are useful. I would expect that this bandwidth model is in about of the correct runtime values for TPU vs GPU. The biggest limitation is that these calculations are for specific matrices sizes.
Computational differences can be amplified for certain sizes. For example, if your batch-size is 12 there is a slight speedup for GPUs compared to TPUs. Decreasing the size of matrix B will make the performance of GPUs better. So this comparison might favor TPUs slightly.
Further direct limitations include fused operations. The TPU can calculate additional element-wise operations such as a non-linear activation function or a bias on the fly within a matrix. If we repeat the same calculations from above for 32-bit values (64x64x tiles) we find that TPUs would be 5. So the datatype size has a much larger effect than switching from TPU to GPU and vice versa. TPUs do not support 8-bit training, but Turing GPUs do.
So we can also have a look at how 8-bit matrix multiplication would impact performance. I published research on 8-bit models and it is not too difficult to train them with 8-bit alone. In fact, the literature on low-bit computing is quite rich. With 32-bit accumulation as supported by Turing GPUs 8-bit training should be even easier. If we can make 8-bit computing work for general models this would entail huge speedups for transformers.
All of this makes 16-bit computatio. TPUs are about to faster for training BERT-like models. On a standar affordable GPU machine with GPUs one can expect to train BERT base for about days using 16-bit or about days using 8-bit. A TPU on the other hand or a Tensor Processing Unit processes Tensors, or geometric objects that describe linear relations between geometric vectors, scalars, and other tensors. If you are trying to optimize for cost then it makes sense to use a TPU if it will train your model at least times as fast as if you trained the same model using a GPU.
We have compared these in respect to Memory Subsystem Architecture, Compute Primitive, Performance, Purpose, Usage and Manufacturers. You can use a GPU to run your PUBG at 4k but a TPU sticks on to neural networks. Graphics Processing Unit ( GPU ), Tensor Processing Unit ( TPU ) and Field Programmable Gate Array (FPGA) are processors with a specialized purpose and architecture.
While you can buy GPUs with the system you buy, TPUs are only accessible in the cloud (for now)!
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.