PetaFLOPS Inference Era: 1 PFLOPS Attention, and Preliminary End-to-End Results

4 min readFeb 7, 2024

Achieve 1 PFLOPS Attention on A Single H100 SXM

In our recent blog, we introduced HippoAttention, the industry’s first FP8 fused attention. After further optimization, we are thrilled to announce that HippoAttention has achieved 1 PFLOPS on a single H100 SXM GPU. This milestone marks a significant achievement in the PetaFLOPS inference era: it demonstrates that the major building blocks of modern AI models can achieve more than 1 PFLOPS.

causal=False, headdim=256, nhead=8, seqlen=8192

Challenges for PetaFLOPS FP8 Inference Engine

Using float8, including E4M3 and E5M2, is different from using float16 or bfloat16, where direct casting from float32 usually works without any loss of accuracy. Direct casting from float16/bfloat16 to float8 causes too much drop in accuracy due to the smaller dynamic range and worse precision of float8 as a result of fewer exponent and mantissa bits. To avoid catastrophic accuracy degradation, the activations and weights in float8 need to be scaled, similar to int8 quantization. In inference, E4M3 is more useful because of its higher precision and the scale makes up for the smaller dynamic range.

Int8 quantization has been around for many years, and int8 tensor cores TFLOPs are twice that of float16/bfloat16. But int8 has yet to see wide adoption like fp16, because quantization aware training and the post training calibration are either not well supported by popular frameworks like PyTorch, or hard to use when it’s supported, like in exisiting inference engines. Here at HippoML, we have demonstrated lossless int8 quantization without quantization aware training or post training calibration in PrivateCanvas. We applied a similar design to our fp8 solution to make it easy to use and more suitable for wide adoption.

Preliminary FP8 End-to-End Inference Results

We are in the process of tuning our FP8 inference engine. At this stage, our primary focus is on verifying the correctness and coverage of models in FP8. We are aware that the current performance is not optimal. Rest assured, we anticipate further enhancements once the improvements are implemented. Nevertheless, we have already observed significant speedup even without these improvements. E4M3 is used in all the HippoEngine benchmarks.

Since FP8 inference is not fully supported by any other inference framework, here we directly use a nightly build of torch.compile (FP16/BF16/INT8) for reference.

SDXL

We test standard SDXL with batch=1, 30 steps, 1024 x 1024 resolution. We are using Diffusion Fast codebase to obtain the baseline. We turned all optimization options on in Diffusion fast.

| torch.compile (INT8) | HippoEngine FP8 |
|----------------------|-----------------|
| 1752 ms              | 1189 ms         |

torch.compile takes 10X longer time to compile the model!

Embedding Model (BERT-Style)

Although autoregressive decoding models in Large Language Models (LLMs) are the hottest topic, BERT-style models are predominantly utilized for text embedding. For our purposes, we refer to BERT-Large as a benchmark. There is no torch.compile INT8 receipt.

| Batch Size | Sequence Length | torch.compile FP16 | HippoEngine FP8 |
|------------|-----------------|--------------------|-----------------|
| 64         | 128             | 16 ms              | 8 ms            |
| 512        | 128             | 120 ms             | 61 ms           |
| 64         | 256             | 37 ms              | 17 ms           |
| 512        | 256             | 287 ms             | 125 ms          |
| 64         | 512             | 86 ms              | 34 ms           |
| 512        | 512             | 763 ms             | 263 ms          |

VIT

We used the image encoder model from BLIP2, which is based on ViT-g/14 from EVA-CLIP but removed the last layer of ViT and used the second last layer’s output features. There is no torch.compile INT8 receipt.

| Batch Size | torch.compile FP16 | HippoEngine FP8 |
|------------|--------------------|-----------------|
| 16         | 35 ms              | 15 ms           |
| 32         | 56 ms              | 27 ms           |
| 64         | 106 ms             | 53 ms           |
| 128        | 208 ms             | 104 ms          |
| 256        | 411 ms             | 205 ms          |
| 512        | 821 ms             | 407 ms          |

Conclusion

The preliminary results of FP8 inference are promising. We are pushing more optimizations and increasing model coverage. We hope to push the end-to-end model inference to achieve 1 PFLOPS soon.