Jon Allman | Puget Systems

Exploring Hybrid CPU/GPU LLM Inference

Jon Allman — Thu, 20 Mar 2025 20:41:02 +0000

A brief look into using a hybrid GPU/VRAM + CPU/RAM approach to LLM inference with the KTransformers inference library.

The post Exploring Hybrid CPU/GPU LLM Inference appeared first on Puget Systems.

NVIDIA GeForce RTX 5090 & 5080 AI Review

Jon Allman — Sat, 22 Feb 2025 00:16:26 +0000

How do NVIDIA's new GeForce RTX 5090 and 5080, released with fanfare regarding their new features and capabilities, perform in real world AI applications?

The post NVIDIA GeForce RTX 5090 & 5080 AI Review appeared first on Puget Systems.

Exploring GPU Performance Across LLM Sizes

Jon Allman — Thu, 16 Jan 2025 20:49:25 +0000

Does the size of a Large Language Model affect relative performance when testing a variety of GPUs?

The post Exploring GPU Performance Across LLM Sizes appeared first on Puget Systems.

What’s the deal with NPUs?

Jon Allman — Fri, 25 Oct 2024 19:55:03 +0000

An introduction to NPU hardware and its growing presence outside of mobile computing devices.

The post What’s the deal with NPUs? appeared first on Puget Systems.

LLM Inference – NVIDIA RTX GPU Performance

Jon Allman — Thu, 22 Aug 2024 16:41:27 +0000

How do a selection of GPUs from NVIDIA's professional lineup compare to each other in the llama.cpp benchmark?

The post LLM Inference – NVIDIA RTX GPU Performance appeared first on Puget Systems.

LLM Inference – Consumer GPU performance

Jon Allman — Thu, 22 Aug 2024 16:41:26 +0000

How do a selection of GPUs from NVIDIA's GeForce series compare to each other in the llama.cpp benchmark?

The post LLM Inference – Consumer GPU performance appeared first on Puget Systems.

Tech Primer: What hardware do you need to run a local LLM?

Jon Allman — Mon, 12 Aug 2024 21:34:43 +0000

What considerations need to be made when starting off running LLMs locally?

The post Tech Primer: What hardware do you need to run a local LLM? appeared first on Puget Systems.

Effects of CPU speed on GPU inference in llama.cpp

Jon Allman — Mon, 01 Jul 2024 17:20:22 +0000

What effect, if any, does a system's CPU speed have on GPU inference with CUDA in llama.cpp?

The post Effects of CPU speed on GPU inference in llama.cpp appeared first on Puget Systems.

Puget Mobile 17″ vs M3 Max MacBook Pro 16″ for AI Workflows

Jon Allman — Tue, 28 May 2024 19:17:34 +0000

How does the new Puget Mobile 17" compare to the MacBook Pro M3 Max 16" in performance across a variety of AI-powered workloads?

The post Puget Mobile 17″ vs M3 Max MacBook Pro 16″ for AI Workflows appeared first on Puget Systems.

Local alternatives to Cloud AI services

Jon Allman — Thu, 11 Apr 2024 20:07:33 +0000

Presenting local AI-powered software options for tasks such as image & text generation, automatic speech recognition, and frame interpolation.

The post Local alternatives to Cloud AI services appeared first on Puget Systems.