PetaFLOPS Inference Era: 1 PFLOPS Attention, and Preliminary End-to-End ResultsAchieve 1 PFLOPS Attention on A Single H100 SXMFeb 7, 2024Feb 7, 2024
8bit HippoAttention: Up to 3X Faster Compared to FlashAttentionV2HippoML’s Out-of-The-Box 8bit InferenceJan 17, 2024Jan 17, 2024
Unified DataCenter & Local Foundation Model Serving: Beyond Docker WayIs generalized Docker solution good for Foundation Model serving?Jan 8, 2024Jan 8, 2024
Super AI Creativity App Run with Local GPU on Mac/Windows/Linux [Early Access]Last year highlighted the limitations and challenges of centralized AI services, including the spike in demand and turmoil in AI company…Jan 2, 2024Jan 2, 2024
Up to 80X Speedup in Multi-Head Attention on Apple SiliconWhat is Multi-Head AttentionDec 12, 2023Dec 12, 2023
Large Language Model Inference: from Datacenter to EdgeJust 6 months ago, we wouldn’t have even thought to make such a disclosure. With the rapidly growing awareness of ChatGPT, we thought we’d…May 30, 2023May 30, 2023