Nvidia is stepping up its collaboration with Microsoft through an enhanced co-sell strategy. At the Ignite conference, Nvidia introduced an AI foundry service designed to help both businesses and startups create custom AI applications on the Azure cloud. This includes models that can utilize enterprise data effectively through retrieval augmented generation (RAG).
Nvidia’s new AI foundry service brings together generative AI model technologies, extensive training expertise, and a large-scale AI factory, all integrated within Microsoft Azure. This setup allows companies worldwide to link their custom models with Microsoft’s leading cloud services.
In addition, Nvidia revealed new 8-billion parameter models that will be part of the foundry service. They also announced plans to add next-generation GPUs to Microsoft Azure in the near future.
So, how does the AI foundry service benefit users on Azure? Enterprises using Azure will have access to all the crucial components needed to develop customized, business-focused generative AI applications in one place. This includes everything from Nvidia AI foundation models and the NeMo framework to the Nvidia DGX cloud supercomputing service.
This comprehensive setup means any customer can complete the entire enterprise generative AI workflow directly on Azure. They can easily obtain the necessary technology components within the Azure ecosystem. Simply put, it’s a joint effort between Nvidia and Microsoft to provide seamless AI development solutions.
Nvidia is also adding a new family of Nemotron-3 8B models to offer enterprises a broader selection of foundation models for use in Azure environments. These models support the creation of advanced enterprise chat and Q&A applications across various industries, including healthcare, telecommunications, and financial services. With multilingual capabilities, these models will be available through the Azure AI model catalog as well as via Hugging Face and the Nvidia NGC catalog.
Additional community foundation models in the Nvidia catalog include Llama 2 (soon to be available in the Azure AI catalog), Stable Diffusion XL, and Mistral 7B. Once users choose their preferred model, they can proceed to the training and deployment phases for custom applications using Nvidia DGX Cloud and AI Enterprise software, both accessible via the Azure marketplace. DGX Cloud offers rental instances that scale to thousands of Nvidia Tensor Core GPUs for training, along with the AI Enterprise toolkit, which includes the NeMo framework and Nvidia Triton Inference Server to enhance LLM customization.
This toolkit is also available separately on the marketplace, and Nvidia noted that users can utilize their existing Microsoft Azure Consumption Commitment credits to speed up model development.
Last month, Nvidia also announced a comparable partnership with Oracle, allowing qualified enterprises to purchase the tools directly from the Oracle Cloud marketplace and train models for deployment on Oracle Cloud Infrastructure (OCI).
Currently, companies like SAP, Amdocs, and Getty Images are among the early adopters testing the foundry service on Azure to develop custom AI applications for various use cases.
In addition to the generative AI service, Nvidia and Microsoft expanded their partnership by introducing the latest Nvidia hardware. Specifically, Microsoft announced the new NC H100 v5 virtual machines for Azure, featuring PCIe-based H100 GPUs connected via Nvidia NVLink. These virtual machines offer nearly four petaflops of AI compute power and 188GB of faster HBM3 memory.
The Nvidia H100 NVL GPU can deliver up to 12 times higher performance on GPT-3 175B compared to the previous generation, making it ideal for both inference and mainstream training workloads. Furthermore, Nvidia plans to add the new H200 Tensor Core GPU to Azure next year. This new GPU offers 141GB of HBM3e memory (1.8 times more than its predecessor) and 4.8 TB/s of peak memory bandwidth (a 1.4 times increase), making it a purpose-built solution for the largest AI workloads, including generative AI training and inference.
This new GPU will join Microsoft’s Maia 100 AI accelerator, providing Azure users with multiple options for handling AI workloads.
To further enhance LLM work on Windows devices, Nvidia announced several updates, including an update for TensorRT LLM for Windows. This update will introduce support for new large language models like Mistral 7B and Nemotron-3 8B and is expected to release later this month. It promises five times faster inference performance, making it easier to run these models on desktops and laptops equipped with GeForce RTX 30 Series and 40 Series GPUs with at least 8GB of RAM.
Additionally, TensorRT-LLM for Windows will be compatible with OpenAI’s Chat API through a new wrapper, enabling numerous developer projects and applications to run locally on a Windows 11 PC with RTX, rather than relying on cloud-based solutions.