News Posts matching #LLM

Return to Keyword Browsing

Qualcomm Announces the Snapdragon 7+ Gen 3, Featuring Exceptional On-Device AI Capabilities

Qualcomm Technologies, Inc., unveiled today the Snapdragon 7+ Gen 3 Mobile Platform, bringing on-device generative AI into the Snapdragon 7 series. The Mobile Platform supports a wide range of AI models including large language models (LLMs) such as Baichuan-7B, Llama 2, and Gemini Nano. Fueling extraordinary entertainment capabilities, Snapdragon 7+ Gen 3 also brings new select Snapdragon Elite Gaming features to the 7-series including Game Post Processing Accelerator and Adreno Frame Motion Engine 2, enhancing game effects and upscaling gaming content for desktop-level visuals. Plus, this platform brings top-notch photography features with our industry-leading 18-bit cognitive ISP.

"Today, we embark on the latest expansion in the 7-series to create new levels of entertainment for consumers - integrating next-generation technologies for richer experiences," said Chris Patrick, senior vice president and general manager of mobile handsets, Qualcomm Technologies, Inc. "Snapdragon 7+ Gen 3 is packed with support for incredible on-device generative AI features and provides incredible performance and power efficiency, while bringing Wi-Fi 7 to the Snapdragon 7 Series for the first time."

Samsung Roadmaps UFS 5.0 Storage Standard, Predicts Commercialization by 2027

Mobile tech tipster, Revegnus, has highlighted an interesting Samsung presentation slide—according to machine translation, the company's electronics division is already responding to an anticipated growth of "client-side large language model" service development. This market trend will demand improved Universal Flash Storage (UFS) interface speeds—Samsung engineers are currently engaged in: "developing a new product that uses UFS 4.0 technology, but increases the number of channels from the current 2 to 4." The upcoming "more advanced" UFS 4.0 storage chips could be beefy enough to be utilized alongside next-gen mobile processors in 2025. For example; ARM is gearing up "Blackhawk," the Cortex-X4's successor—industry watchdogs reckon that the semiconductor firm's new core is designed to deliver "great Large Language Model (LLM) performance" on future smartphones. Samsung's roadmap outlines another major R&D goal, but this prospect is far off from finalization—their chart reveals an anticipated 2027 rollout. The slide's body of text included a brief teaser: "at the same time, we are also actively participating in discussions on the UFS 5.0 standard."

Tiny Corp. Pauses Development of AMD Radeon GPU-based Tinybox AI Cluster

George Hotz and his Tiny Corporation colleagues were pinning their hopes on AMD delivering some good news earlier this month. The development of a "TinyBox" AI compute cluster project hit some major roadblocks a couple of weeks ago—at the time, Radeon RX 7900 XTX GPU firmware was not gelling with Tiny Corp.'s setup. Hotz expressed "70% confidence" in AMD approving open-sourcing certain bits of firmware. At the time of writing this has not transpired—this week the Tiny Corp. social media account has, once again, switched to an "all guns blazing" mode. Hotz and Co. have publicly disclosed that they were dabbling with Intel Arc graphics cards, as of a few weeks ago. NVIDIA hardware is another possible route, according to freshly posted open thoughts.

Yesterday, it was confirmed that the young startup organization had paused its utilization of XFX Speedster MERC310 RX 7900 XTX graphics cards: "the driver is still very unstable, and when it crashes or hangs we have no way of debugging it. We have no way of dumping the state of a GPU. Apparently it isn't just the MES causing these issues, it's also the Command Processor (CP). After seeing how open Tenstorrent is, it's hard to deal with this. With Tenstorrent, I feel confident that if there's an issue, I can debug and fix it. With AMD, I don't." The $15,000 TinyBox system relies on "cheaper" gaming-oriented GPUs, rather than traditional enterprise solutions—this oddball approach has attracted a number of customers, but the latest announcements likely signal another delay. Yesterday's tweet continued to state: "we are exploring Intel, working on adding Level Zero support to tinygrad. We also added a $400 bounty for XMX support. We are also (sadly) exploring a 6x GeForce RTX 4090 GPU box. At least we know the software is good there. We will revisit AMD once we have an open and reproducible build process for the driver and firmware. We are willing to dive really deep into hardware to make it amazing. But without access, we can't."

Ubisoft Exploring Generative AI, Could Revolutionize NPC Narratives

Have you ever dreamed of having a real conversation with an NPC in a video game? Not just one gated within a dialogue tree of pre-determined answers, but an actual conversation, conducted through spontaneous action and reaction? Lately, a small R&D team at Ubisoft's Paris studio, in collaboration with Nvidia's Audio2Face application and Inworld's Large Language Model (LLM), have been experimenting with generative AI in an attempt to turn this dream into a reality. Their project, NEO NPC, uses GenAI to prod at the limits of how a player can interact with an NPC without breaking the authenticity of the situation they are in, or the character of the NPC itself.

Considering that word—authenticity—the project has had to be a hugely collaborative effort across artistic and scientific disciplines. Generative AI is a hot topic of conversation in the videogame industry, and Senior Vice President of Production Technology Guillemette Picard is keen to stress that the goal behind all genAI projects at Ubisoft is to bring value to the player; and that means continuing to focus on human creativity behind the scenes. "The way we worked on this project, is always with our players and our developers in mind," says Picard. "With the player in mind, we know that developers and their creativity must still drive our projects. Generative AI is only of value if it has value for them."

NVIDIA Digital Human Technologies Bring AI Characters to Life

NVIDIA announced today that leading AI application developers across a wide range of industries are using NVIDIA digital human technologies to create lifelike avatars for commercial applications and dynamic game characters. The results are on display at GTC, the global AI conference held this week in San Jose, Calif., and can be seen in technology demonstrations from Hippocratic AI, Inworld AI, UneeQ and more.

NVIDIA Avatar Cloud Engine (ACE) for speech and animation, NVIDIA NeMo for language, and NVIDIA RTX for ray-traced rendering are the building blocks that enable developers to create digital humans capable of AI-powered natural language interactions, making conversations more realistic and engaging.

NVIDIA Blackwell Platform Arrives to Power a New Era of Computing

Powering a new era of computing, NVIDIA today announced that the NVIDIA Blackwell platform has arrived—enabling organizations everywhere to build and run real-time generative AI on trillion-parameter large language models at up to 25x less cost and energy consumption than its predecessor.

The Blackwell GPU architecture features six transformative technologies for accelerated computing, which will help unlock breakthroughs in data processing, engineering simulation, electronic design automation, computer-aided drug design, quantum computing and generative AI—all emerging industry opportunities for NVIDIA.

Gigabyte Unveils Comprehensive and Powerful AI Platforms at NVIDIA GTC

GIGABYTE Technology and Giga Computing, a subsidiary of GIGABYTE and an industry leader in enterprise solutions, will showcase their solutions at the GIGABYTE booth #1224 at NVIDIA GTC, a global AI developer conference running through March 21. This event will offer GIGABYTE the chance to connect with its valued partners and customers, and together explore what the future in computing holds.

The GIGABYTE booth will focus on GIGABYTE's enterprise products that demonstrate AI training and inference delivered by versatile computing platforms based on NVIDIA solutions, as well as direct liquid cooling (DLC) for improved compute density and energy efficiency. Also not to be missed at the NVIDIA booth is the MGX Pavilion, which features a rack of GIGABYTE servers for the NVIDIA GH200 Grace Hopper Superchip architecture.

Report: Apple to Use Google's Gemini AI for iPhones

In the world where the largest companies are riding the AI train, the biggest of them all—Apple—seemed to stay quiet for a while. Even with many companies announcing their systems/models, Apple has stayed relatively silent about the use of LLMs in their products. However, according to Bloomberg, Apple is not pushing out an AI model of its own; rather, it will license Google's leading Gemini models for its iPhone smartphones. Gemini is Google's leading AI model with three variants: Gemini Nano 1/2, Gemini Pro, and Gemini Ultra. The Gemini Nano 1 and Nano 2 are designed to run locally on hardware like smartphones. At the same time, Gemini Pro and Ultra are inferenced from Google's servers onto a local device using API and the internet.

Apple could use a local Gemini Nano for basic tasks while also utilizing Geminin Pro or Ultra for more complex tasks, where a router sends user input to the available model. That way, users could use AI capabilities both online and offline. Since Apple is readying a suite of changes for iOS version 18, backed by Neural Engine inside A-series Bionic chips, the LLM game of Apple's iPhones might get a significant upgrade with the Google partnership. While we still don't know the size of the deal, it surely is a massive deal for Google to tap into millions of devices Apple ships every year and for Apple to give its users a more optimized experience.

MemVerge and Micron Boost NVIDIA GPU Utilization with CXL Memory

MemVerge, a leader in AI-first Big Memory Software, has joined forces with Micron to unveil a groundbreaking solution that leverages intelligent tiering of CXL memory, boosting the performance of large language models (LLMs) by offloading from GPU HBM to CXL memory. This innovative collaboration is being showcased in Micron booth #1030 at GTC, where attendees can witness firsthand the transformative impact of tiered memory on AI workloads.

Charles Fan, CEO and Co-founder of MemVerge, emphasized the critical importance of overcoming the bottleneck of HBM capacity. "Scaling LLM performance cost-effectively means keeping the GPUs fed with data," stated Fan. "Our demo at GTC demonstrates that pools of tiered memory not only drive performance higher but also maximize the utilization of precious GPU resources."

MAINGEAR Introduces PRO AI Workstations Featuring aiDAPTIV+ For Cost-Effective Large Language Model Training

MAINGEAR, a leading provider of high-performance custom PC systems, and Phison, a global leader in NAND controllers and storage solutions, today unveiled groundbreaking MAINGEAR PRO AI workstations with Phison's aiDAPTIV+ technology. Specifically engineered to democratize Large Language Model (LLM) development and training for small and medium-sized businesses (SMBs), these ultra-powerful workstations incorporate aiDAPTIV+ technology to deliver supercomputer LLM training capabilities at a fraction of the cost of traditional AI training servers.

As the demand for large-scale generative AI models continues to surge and their complexity increases, the potential for LLMs also expands. However, this rapid advancement in LLM AI technology has led to a notable boost in hardware requirements, making model training cost-prohibitive and inaccessible for many small to medium businesses.

Cerebras & G42 Break Ground on Condor Galaxy 3 - an 8 exaFLOPs AI Supercomputer

Cerebras Systems, the pioneer in accelerating generative AI, and G42, the Abu Dhabi-based leading technology holding group, today announced the build of Condor Galaxy 3 (CG-3), the third cluster of their constellation of AI supercomputers, the Condor Galaxy. Featuring 64 of Cerebras' newly announced CS-3 systems - all powered by the industry's fastest AI chip, the Wafer-Scale Engine 3 (WSE-3) - Condor Galaxy 3 will deliver 8 exaFLOPs of AI with 58 million AI-optimized cores. The Cerebras and G42 strategic partnership already delivered 8 exaFLOPs of AI supercomputing performance via Condor Galaxy 1 and Condor Galaxy 2, each amongst the largest AI supercomputers in the world. Located in Dallas, Texas, Condor Galaxy 3 brings the current total of the Condor Galaxy network to 16 exaFLOPs.

"With Condor Galaxy 3, we continue to achieve our joint vision of transforming the worldwide inventory of AI compute through the development of the world's largest and fastest AI supercomputers," said Kiril Evtimov, Group CTO of G42. "The existing Condor Galaxy network has trained some of the leading open-source models in the industry, with tens of thousands of downloads. By doubling the capacity to 16exaFLOPs, we look forward to seeing the next wave of innovation Condor Galaxy supercomputers can enable." At the heart of Condor Galaxy 3 are 64 Cerebras CS-3 Systems. Each CS-3 is powered by the new 4 trillion transistor, 900,000 AI core WSE-3. Manufactured at TSMC at the 5-nanometer node, the WSE-3 delivers twice the performance at the same power and for the same price as the previous generation part. Purpose built for training the industry's largest AI models, WSE-3 delivers an astounding 125 petaflops of peak AI performance per chip.

ZOTAC Expands Computing Hardware with GPU Server Product Line for the AI-Bound Future

ZOTAC Technology Limited, a global leader in innovative technology solutions, expands its product portfolio with the GPU Server Series. The first series of products in ZOTAC's Enterprise lineup offers organizations affordable and high-performance computing solutions for a wide range of demanding applications, from core-to-edge inferencing and data visualization to model training, HPC modeling, and simulation.

The ZOTAC series of GPU Servers comes in a diverse range of form factors and configurations, featuring both Tower Workstations and Rack Mount Servers, as well as both Intel and AMD processor configurations. With support for up to 10 GPUs, modular design for easier access to internal hardware, a high space-to-performance ratio, and industry-standard features like redundant power supplies and extensive cooling options, ZOTAC's enterprise solutions can ensure optimal performance and durability, even under sustained intense workloads.

Cerebras Systems Unveils World's Fastest AI Chip with 4 Trillion Transistors and 900,000 AI cores

Cerebras Systems, the pioneer in accelerating generative AI, has doubled down on its existing world record of fastest AI chip with the introduction of the Wafer Scale Engine 3. The WSE-3 delivers twice the performance of the previous record-holder, the Cerebras WSE-2, at the same power draw and for the same price. Purpose built for training the industry's largest AI models, the 5 nm-based, 4 trillion transistor WSE-3 powers the Cerebras CS-3 AI supercomputer, delivering 125 petaflops of peak AI performance through 900,000 AI optimized compute cores.

Intel Gaudi2 Accelerator Beats NVIDIA H100 at Stable Diffusion 3 by 55%

Stability AI, the developers behind the popular Stable Diffusion generative AI model, have run some first-party performance benchmarks for Stable Diffusion 3 using popular data-center AI GPUs, including the NVIDIA H100 "Hopper" 80 GB, A100 "Ampere" 80 GB, and Intel's Gaudi2 96 GB accelerator. Unlike the H100, which is a super-scalar CUDA+Tensor core GPU; the Gaudi2 is purpose-built to accelerate generative AI and LLMs. Stability AI published its performance findings in a blog post, which reveals that the Intel Gaudi2 96 GB is posting a roughly 56% higher performance than the H100 80 GB.

With 2 nodes, 16 accelerators, and a constant batch size of 16 per accelerator (256 in all), the Intel Gaudi2 array is able to generate 927 images per second, compared to 595 images for the H100 array, and 381 images per second for the A100 array, keeping accelerator and node counts constant. Scaling things up a notch to 32 nodes, and 256 accelerators or a batch size of 16 per accelerator (total batch size of 4,096), the Gaudi2 array is posting 12,654 images per second; or 49.4 images per-second per-device; compared to 3,992 images per second or 15.6 images per-second per-device for the older-gen A100 "Ampere" array.

AMD Publishes User Guide for LM Studio - a Local AI Chatbot

AMD has caught up with NVIDIA and Intel in the race to get a locally run AI chatbot up and running on its respective hardware. Team Red's community hub welcomed a new blog entry on Wednesday—AI staffers published a handy "How to run a Large Language Model (LLM) on your AMD Ryzen AI PC or Radeon Graphics Card" step-by-step guide. They recommend that interested parties are best served by downloading the correct version of LM Studio. Their CPU-bound Windows variant—designed for higher-end Phoenix and Hawk Point chips—compatible Ryzen AI PCs can deploy instances of a GPT based LLM-powered AI chatbot. The LM Studio ROCm technical preview functions similarly, but is reliant on Radeon RX 7000 graphics card ownership. Supported GPU targets include: gfx1100, gfx1101 and gfx1102.

AMD believes that: "AI assistants are quickly becoming essential resources to help increase productivity, efficiency or even brainstorm for ideas." Their blog also puts a spotlight on LM Studio's offline functionality: "Not only does the local AI chatbot on your machine not require an internet connection—but your conversations stay on your local machine." The six-step guide invites curious members to experiment with a handful of large language models—most notably Mistral 7b and LLAMA v2 7b. They thoroughly recommend that you select options with "Q4 K M" (AKA 4-bit quantization). You can learn about spooling up "your very own AI chatbot" here.

Chinese Governing Bodies Reportedly Offering "Compute Vouchers" to AI Startups

Regional Chinese governments are attempting to prop up local AI startup companies with an intriguing "voucher" support system. A Financial Times article outlines "computing" support packages valued between "$140,000 to $280,000" for fledgling organizations involved in LLM training. Widespread shortages of AI chips and rising data center operation costs are cited as the main factors driving a rollout of strategic subsidizations. The big three—Alibaba, Tencent, and ByteDance—are reportedly less willing to rent out their AI-crunching servers, due to internal operations demanding lengthy compute sessions. China's largest technology companies are believed to hording the vast majority of NVIDIA AI hardware, while smaller competitors are believed to fighting over table scraps. US trade restrictions have further escalated supply issues, with lower-performance/China-specific models being rejected—AMD's Instinct MI309 AI accelerator being the latest example.

The "computer voucher" initiative could be the first part of a wider scheme—reports suggest that regional governing bodies (including Shanghai) are devising another subsidy tier for domestic AI chips. Charlie Chai, an 86Research analyst, reckons that the initial support package is only a short-term solution. He shared this observation with FT: "the voucher is helpful to address the cost barrier, but it will not help with the scarcity of the resources." The Chinese government is reportedly looking into the creation of an alternative state-run system, that will become less reliant on a "Big Tech" data center model. A proposed "East Data West Computing" project could produce a more energy-efficient cluster of AI data centers, combined with a centralized management system.

IBM Announces Availability of Open-Source Mistral AI Model on watsonx

IBM announced the availability of the popular open-source Mixtral-8x7B large language model (LLM), developed by Mistral AI, on its watsonx AI and data platform, as it continues to expand capabilities to help clients innovate with IBM's own foundation models and those from a range of open-source providers. IBM offers an optimized version of Mixtral-8x7B that, in internal testing, was able to increase throughput—or the amount of data that can be processed in a given time period—by 50 percent when compared to the regular model. This could potentially cut latency by 35-75 percent, depending on batch size—speeding time to insights. This is achieved through a process called quantization, which reduces model size and memory requirements for LLMs and, in turn, can speed up processing to help lower costs and energy consumption.

The addition of Mixtral-8x7B expands IBM's open, multi-model strategy to meet clients where they are and give them choice and flexibility to scale enterprise AI solutions across their businesses. Through decades-long AI research and development, open collaboration with Meta and Hugging Face, and partnerships with model leaders, IBM is expanding its watsonx.ai model catalog and bringing in new capabilities, languages, and modalities. IBM's enterprise-ready foundation model choices and its watsonx AI and data platform can empower clients to use generative AI to gain new insights and efficiencies, and create new business models based on principles of trust. IBM enables clients to select the right model for the right use cases and price-performance goals for targeted business domains like finance.

ServiceNow, Hugging Face & NVIDIA Release StarCoder2 - a New Open-Access LLM Family

ServiceNow, Hugging Face, and NVIDIA today announced the release of StarCoder2, a family of open-access large language models for code generation that sets new standards for performance, transparency, and cost-effectiveness. StarCoder2 was developed in partnership with the BigCode Community, managed by ServiceNow, the leading digital workflow company making the world work better for everyone, and Hugging Face, the most-used open-source platform, where the machine learning community collaborates on models, datasets, and applications. Trained on 619 programming languages, StarCoder2 can be further trained and embedded in enterprise applications to perform specialized tasks such as application source code generation, workflow generation, text summarization, and more. Developers can use its code completion, advanced code summarization, code snippets retrieval, and other capabilities to accelerate innovation and improve productivity.

StarCoder2 offers three model sizes: a 3-billion-parameter model trained by ServiceNow; a 7-billion-parameter model trained by Hugging Face; and a 15-billion-parameter model built by NVIDIA with NVIDIA NeMo and trained on NVIDIA accelerated infrastructure. The smaller variants provide powerful performance while saving on compute costs, as fewer parameters require less computing during inference. In fact, the new 3-billion-parameter model matches the performance of the original StarCoder 15-billion-parameter model. "StarCoder2 stands as a testament to the combined power of open scientific collaboration and responsible AI practices with an ethical data supply chain," emphasized Harm de Vries, lead of ServiceNow's StarCoder2 development team and co-lead of BigCode. "The state-of-the-art open-access model improves on prior generative AI performance to increase developer productivity and provides developers equal access to the benefits of code generation AI, which in turn enables organizations of any size to more easily meet their full business potential."

Supermicro Accelerates Performance of 5G and Telco Cloud Workloads with New and Expanded Portfolio of Infrastructure Solutions

Supermicro, Inc. (NASDAQ: SMCI), a Total IT Solution Provider for AI, Cloud, Storage, and 5G/Edge, delivers an expanded portfolio of purpose-built infrastructure solutions to accelerate performance and increase efficiency in 5G and telecom workloads. With one of the industry's most diverse offerings, Supermicro enables customers to expand public and private 5G infrastructures with improved performance per watt and support for new and innovative AI applications. As a long-term advocate of open networking platforms and a member of the O-RAN Alliance, Supermicro's portfolio incorporates systems featuring 5th Gen Intel Xeon processors, AMD EPYC 8004 Series processors, and the NVIDIA Grace Hopper Superchip.

"Supermicro is expanding our broad portfolio of sustainable and state-of-the-art servers to address the demanding requirements of 5G and telco markets and Edge AI," said Charles Liang, president and CEO of Supermicro. "Our products are not just about technology, they are about delivering tangible customer benefits. We quickly bring data center AI capabilities to the network's edge using our Building Block architecture. Our products enable operators to offer new capabilities to their customers with improved performance and lower energy consumption. Our edge servers contain up to 2 TB of high-speed DDR5 memory, 6 PCIe slots, and a range of networking options. These systems are designed for increased power efficiency and performance-per-watt, enabling operators to create high-performance, customized solutions for their unique requirements. This reassures our customers that they are investing in reliable and efficient solutions."

Intel Optimizes PyTorch for Llama 2 on Arc A770, Higher Precision FP16

Intel just announced optimizations for PyTorch (IPEX) to take advantage of the AI acceleration features of its Arc "Alchemist" GPUs.PyTorch is a popular machine learning library that is often associated with NVIDIA GPUs, but it is actually platform-agnostic. It can be run on a variety of hardware, including CPUs and GPUs. However, performance may not be optimal without specific optimizations. Intel offers such optimizations through the Intel Extension for PyTorch (IPEX), which extends PyTorch with optimizations specifically designed for Intel's compute hardware.

Intel released a blog post detailing how to run Meta AI's Llama 2 large language model on its Arc "Alchemist" A770 graphics card. The model requires 14 GB of GPU RAM, so a 16 GB version of the A770 is recommended. This development could be seen as a direct response to NVIDIA's Chat with RTX tool, which allows GeForce users with >8 GB RTX 30-series "Ampere" and RTX 40-series "Ada" GPUs to run PyTorch-LLM models on their graphics cards. NVIDIA achieves lower VRAM usage by distributing INT4-quantized versions of the models, while Intel uses a higher-precision FP16 version. In theory, this should not have a significant impact on the results. This blog post by Intel provides instructions on how to set up Llama 2 inference with PyTorch (IPEX) on the A770.

Google's Gemma Optimized to Run on NVIDIA GPUs, Gemma Coming to Chat with RTX

NVIDIA, in collaboration with Google, today launched optimizations across all NVIDIA AI platforms for Gemma—Google's state-of-the-art new lightweight 2 billion- and 7 billion-parameter open language models that can be run anywhere, reducing costs and speeding innovative work for domain-specific use cases.

Teams from the companies worked closely together to accelerate the performance of Gemma—built from the same research and technology used to create the Gemini models—with NVIDIA TensorRT-LLM, an open-source library for optimizing large language model inference, when running on NVIDIA GPUs in the data center, in the cloud and on PCs with NVIDIA RTX GPUs. This allows developers to target the installed base of over 100 million NVIDIA RTX GPUs available in high-performance AI PCs globally.

Groq LPU AI Inference Chip is Rivaling Major Players like NVIDIA, AMD, and Intel

AI workloads are split into two different categories: training and inference. While training requires large computing and memory capacity, access speeds are not a significant contributor; inference is another story. With inference, the AI model must run extremely fast to serve the end-user with as many tokens (words) as possible, hence giving the user answers to their prompts faster. An AI chip startup, Groq, which was in stealth mode for a long time, has been making major moves in providing ultra-fast inference speeds using its Language Processing Unit (LPU) designed for large language models (LLMs) like GPT, Llama, and Mistral LLMs. The Groq LPU is a single-core unit based on the Tensor-Streaming Processor (TSP) architecture which achieves 750 TOPS at INT8 and 188 TeraFLOPS at FP16, with 320x320 fused dot product matrix multiplication, in addition to 5,120 Vector ALUs.

Having massive concurrency with 80 TB/s of bandwidth, the Groq LPU has 230 MB capacity of local SRAM. All of this is working together to provide Groq with a fantastic performance, making waves over the past few days on the internet. Serving the Mixtral 8x7B model at 480 tokens per second, the Groq LPU is providing one of the leading inference numbers in the industry. In models like Llama 2 70B with 4096 token context length, Groq can serve 300 tokens/s, while in smaller Llama 2 7B with 2048 tokens of context, Groq LPU can output 750 tokens/s. According to the LLMPerf Leaderboard, the Groq LPU is beating the GPU-based cloud providers at inferencing LLMs Llama in configurations of anywhere from 7 to 70 billion parameters. In token throughput (output) and time to first token (latency), Groq is leading the pack, achieving the highest throughput and second lowest latency.

Sony CEO Wants PlayStation Ecosystem to Expand into PC, AI & Cloud Territories

Kenichiro Yoshida—Sony Group Corporation Chairman, President And CEO—appeared as a guest on Norges Bank Investment Management's Good Company videocast late last year. News outlets have sluggishly picked up on some interesting tidbits from the November 2023 interview—the Sony boss has discussed his gaming division's ambitions in the recent past, but (host) Nicolai Tangen managed to pry out a clearer picture of PlayStation's ambitions for the future. Yoshida-san has an all-encompassing vision for the brand: "In short, it will be ubiquitous wherever there is computing users will be able to play their favorite games seamlessly, gamers will be able to find a place to play in different spaces, while PlayStation will remain our core product, we will expand our gaming experiences to PC, Mobile and Cloud." Gamers on the PC platform have to wait roughly two to three years for PlayStation exclusive titles to breakaway from home consoles origins—it is encouraging to hear that a greater number of conversions could be in the pipeline (with shorter lead times...hopefully).

The discussion moved onto game subscription services—a hotbed talking point as of late—Yoshida seemed to be happy with his company's normal mode of operation: "Well, we do subscription business model. At the same time, people usually play one game at the time, so an all-you-can-eat type of many games may not be so valuable compared with video streaming services. We have kind of balanced a hybrid service on PlayStation Network: subscription as well as paid content." Microsoft is a market leader with its Xbox and PC Game Pass services, now bolstered with a takeover of Activision Blizzard—the Sony CEO remained calm regarding his firm's main rival: "Healthy competition is necessary for the Games Industry to grow and at Sony we believe it is important to provide gamers with different options to play so we will continue our efforts to achieve this."

AMD Instinct MI300X GPUs Featured in LaminiAI LLM Pods

LaminiAI appears to be one of AMD's first customers to receive a bulk order of Instinct MI300X GPUs—late last week, Sharon Zhou (CEO and co-founder) posted about the "next batch of LaminiAI LLM Pods" up and running with Team Red's cutting-edge CDNA 3 series accelerators inside. Her short post on social media stated: "rocm-smi...like freshly baked bread, 8x MI300X is online—if you're building on open LLMs and you're blocked on compute, lmk. Everyone should have access to this wizard technology called LLMs."

An attached screenshot of a ROCm System Management Interface (ROCm SMI) session showcases an individual Pod configuration sporting eight Instinct MI300X GPUs. According to official blog entries, LaminiAI has utilized bog-standard MI300 accelerators since 2023, so it is not surprising to see their partnership continue to grow with AMD. Industry predictions have the Instinct MI300X and MI300A models placed as great alternatives to NVIDIA's dominant H100 "Hopper" series—AMD stock is climbing due to encouraging financial analyst estimations.

Microsoft Sets 16 GB RAM as Minimum-Requirement for Copilot and Windows AI Features

Microsoft has reportedly set 16 GB as the minimum system requirement for AI PCs, a TrendForce market research report finds. To say that Microsoft has a pivotal role to play in PC hardware specs is an understatement. This year sees the introduction of the first "AI PCs," or PCs with on-device AI acceleration for several new features native to Windows 11 23H2, mainly Microsoft Copilot. From the looks of it, Copilot is receiving the highest corporate attention from Microsoft, as the company looks to integrate the AI chatbot that automates and generates work, into the mainstream PC. In fact, Microsoft is even pushing for a dedicated Copilot button on PC keyboards along the lines of the key that brings up the Start menu. The company's biggest move with Copilot will be the 2024 introduction of Copilot Pro, an AI assistant integrated with Office and 365, which the company plans to sell on a subscription basis alone.

Besides cloud-based acceleration, Microsoft's various AI features will rely on some basic hardware specs for local acceleration. One of them of course is the NPU, with Intel's AI Boost and AMD's Ryzen AI being introduced with their latest mobile processors. The other requirement will be memory. AI acceleration is a highly memory sensitive operation, and LLMs require a sizable amount of fast frequent-access memory. So Microsoft arrived at 16 GB as the bare minimum amount of memory for not just native acceleration, but also cloud-based Copilot AI features to work. This should see the notebooks of 2024 set 16 GB as their baseline memory specs; and for commercial notebooks to scale up to 32 GB or even 64 GB, depending on organizational requirements. The development bodes particularly well for the DRAM industry.
Return to Keyword Browsing
Dec 21st, 2024 07:11 EST change timezone

New Forum Posts

Popular Reviews

Controversial News Posts