News

NVIDIA's new H200 Tensor Core GPUs and TensorRT LLM updates tackle soaring AI demands today

As the demand for increasingly sophisticated generative AI models continues to develop, so too are model sizes, which are getting larger. Clearly, this will be a problem for infrastructure service providers whose customers rely on their hardware solutions to train new models and deploy them effectively. To address the growing need for inference platforms today, NVIDIA is announcing that its new are to customers and will be available from all the usual OEMs and ODMs in the data server industry in Q2 2024 – which is just days away. To recap, the H200 is essentially an but is now endowed with over the original H100 GPU configuration using HBM3. The result is a massive boost in memory capacity per GPU of (up from 80GB), with a total (up from just over 3TB/s). Tested on the latest , which is developed by a consortium of AI leaders across academia, research and industries to provide an unbiased evaluation of training and inference performance of hardware, software and services, the miracle of an updated memory technology is a massive boon to inference performance as illustrated in the image above/below. The latest benchmark version incorporates even newer tests and workloads, such as the updated Llama2 language model with 70 billion parameters to better represent the evolving inferencing needs. Setting new machine language performance records with over 31,000 tokens/s on the MLPerf Llama 2 benchmark, the NVIDIA H200 Tensor Core GPUs are just what’s needed at this point in time. More so, the H200 is drop-in compatible with the H100 (similar physical connections and TDP requirements), so existing customers can switch them out without a penalty in power requirements for a big boost in inferencing performance.  Consequently, the  (further ) also benefits from this update with the same upswing in memory throughput to tackle these ever-growing workloads expeditiously. As the demand for increasingly sophisticated generative AI models continues to develop, so too are model sizes, which are getting larger. Back when the transformer deep learning models were new and unlocked notable performance in self-supervised learning through natural language processing (NLP) for summarisation, translation, and computer vision, was one of the most prominently used. That was roughly 340 million parameters in size. Today, the largest model size is at a staggering 1.8 trillion parameters, and according to the trend line, it’s not going to stop. While at GTC 20204 are designed to tackle trillion-parameter-sized generative AI models , they will not be available to customers until at least the end of 2024 or even early 2025. For now, the will be the highest performing set of offerings from NVIDIA. NVIDIA has shared that approximately 40% of their data centre revenue can be attributed to inferencing platform needs. This underscores the market demands for inference training across various domains such as LLMs for textual manipulation, visual content generation and recommender systems. The latter’s model size can be exceptionally large in order to be make its mark accurately in fruitful and timely recommendations. To recap, to better utilize the Tensor Cores present within these modern GPUs that are responsible for AI processing tasks and workloads, was developed as an open-source library to accelerate and optimise inference performance on the latest LLMs run on NVIDIA Tensor Core-equipped GPUs. First released in , it has already without TensorRT-LLM. Thankfully, NVIDIA continuously improves the inference compiler that works with NVIDIA GPUs and has improved TensorRT-LLM by continuously incorporating  and thus has more performance gains to show. Here are some ways TensorRT-LLM accelerates processing LLMs in a generative manner:- In just six months since of its release, TensorRT-LLM now of NVIDIA Hopper GPUs (while tested with GPT-J model). This means H100 and H200 GPUs are much more capable than they already are on their own, potentially delivering massive uplifts (subject to tests and models used) while running on the same hardware. This goes a long way in cementing NVIDIA as not only a hardware platform leader, but because of its relentless updates by its software engineers, it is a platform that its customers can count on to get even more value out of their investment and push the needle further in delivering the inferencing performance needed in this fast-evolving scene. Soure: