But economically, it is still much better to buy a lower spec't laptop and to pay a monthly subscription for AI.
However, I agree with the article that people will run big LLMs on their laptop N years down the line. Especially if hardware outgrows best-in-class LLM model requirements. If a phone could run a 512GB LLM model fast, you would want it.
> economically, it is still much better to buy a lower spec't laptop and to pay a monthly subscription for AI
Uber is economical, too; but folks prefer to own cars, sometimes multiple.
And how there's market for all kinds of vanity cars, fast sportscars, expensive supercars... I imagine PCs & Laptops will have such a market, too: In probably less than a decade, may be a £20k laptop running a 671b+ LLM locally will be the norm among pros.
When LLM use approaches this number, running one locally would be, yes. What you and other commentator seem to miss is, "Uber" is a stand-in for Cloud-based LLMs: Someone else builds and owns those servers, runs the LLMs, pays the electricity bills... while its users find it "economical" to rent it.
(btw, taxis are considered economical in parts of the world where owning cars is a luxury)
One time I took an Uber to work because my car broke down and was in the shop and the Uber driver (somewhat pointedly) made a comment that I must be really rich to commute to work via Uber because Ubers are so expensive
Then idk why they say that most laptops are bad at running LLMs, Apple has a huge marketshare in the laptop market and even their cheapest laptops are capable in that realm. And their PC competitors are more likely to be generously specced out in terms of included memory.
> However, for the average laptop that’s over a year old, the number of useful AI models you can run locally on your PC is close to zero.
Apple has a 10-18% market share for laptops. That's significant but it certainly isn't "most".
Most laptops can run at best a 7-14b model, even if you buy one with a high spec graphics chip. These are not useful models unless you're writing spam.
Most desktops have a decent amount of system memory but that can't be used for running LLMs at a useful speed, especially since the stuff you could run in 32-64GB RAM would need lots of interaction and hand holding.
And that's for the easy part, inference. Training is much more expensive.
Most laptops have 16GB of RAM or less. A little more than a year ago I think the base model Mac laptop had 8GB of RAM which really isn't fantastic for running LLMs.
So I'm hearing a lot of people running LLMs on Apple hardware. But is there actually anything useful you can run? Does it run at a usable speed? And is it worth the cost? Because the last time I checked the answer to all three questions appeared to be no.
Though maybe it depends on what you're doing? (Although if you're doing something simple like embeddings, then you don't need the Apple hardware in the first place.)
You still need ridiculously high spec hardware, and at Apple’s prices, that isn’t cheap. Even if you can afford it (most won't), the local models you can run are still limited and they still underperform. It’s much cheaper to pay for a cloud solution and get significantly better result. In my opinion, the article is right. We need a better way to run LLMs locally.
You still need ridiculously high spec hardware, and at Apple’s prices, that isn’t cheap.
You can easily run models like Mistral and Stable Diffusion in Ollama and Draw Things, and you can run newer models like Devstral (the MLX version) and Z Image Turbo with a little effort using LM Studio and Comfyui. It isn't as fast as using a good nVidia GPU or a cloud GPU but it's certainly good enough to play around with and learn more about it. I've written a bunch of apps that give me a browser UI talking to an API that's provided by an app running a model locally and it works perfectly well. I did that on an 8GB M1 for 18 months and then upgraded to a 24GB M4 Pro recently. I still have the M1 on my network for doing AI things in the background.
Yeah, any Mac system specced with a decent amount of RAM since the M1 will run LLMs locally very well. And that’s exactly how the built-in Apple Intelligence service works: when enabled, it downloads a smallish local model. Since all Macs since the M1 have very fast memory available to the integrated GPU, they’re very good at AI.
The article kinda sucks at explaining how NPUs aren’t really even needed, they just have potential to make things more efficient in the future rather than depending on the power consumption involved with running your GPU.
Strictly speaking, you don't need that much VRAM or even plain old RAM - just enough to store your context and model activations. It's just that as you run with less and less (V)RAM you'll start to bottleneck on things like SSD transfer bandwidth and your inference speed goes down to a crawl. But even that may or may not be an issue depending on your exact requirements: perhaps you don't need your answer instantly and can wait while it gets computed in the background. Or maybe you're running with the latest PCIe 5 storage which overall gives you comparable bandwidth to something like DDR3/DDR4 memory.
A lazy easy cheap shot. But do you deny these aspects from the article are not coming? Or won't be still here in 5 years?
- Addition of more—and faster—memory.
- Consolidation of memory.
- Combination of chips on the same silicon.
All of these are also happening for non AI reasons. The move to SoC that really started with the M1 wasn't because of AI, but unified memory being the default is something we will see in 5 years. Unlike 3D TV.
We just had a series of articles and sysadmin outcry that major vendors were bringing 8gb laptops back to standard models because of the ram prices. In the short term, we're seeing a reduction.
- People wanting more memory is not a novel feature. I am excited to find out how many people immediately want to disable the AI nonsense to free up memory for things they actually want to do.
- Same answer.
- I think the drive towards SOCs has been happening already. Apple's M-series utterly demolishes every PC chip apart from the absolute bleeding-edge available, includes dedicated memory and processors for ML tasks, and it's mature technology. Been there for years. To the extent PC makers are chasing this, I would say it's far more in response to that than anything to do with AI.
I was in the market for a laptop this month. Many new laptops now advertise AI features like this "HP OmniBook 5 Next Gen AI PC" which advertises:
"SNAPDRAGON X PLUS PROCESSOR - Achieve more everyday with responsive performance for seamless multitasking with AI tools that enhance productivity and connectivity while providing long battery life"
I don't want this garbage on my laptop, especially when its running of its battery! Running AI on your laptop is like playing Starcraft Remastered on the Xbox or Factorio on your steamdeck. I hear you can play DOOM on a pregnancy test too. Sure, you can, but its just going to be a tedious inferior experiance.
Really, this is just a fine example of how overhyped AI is right now.
Laptop manufacturers are too desperate to cash on the AI craze. There's nothing special about an 'AI PC'. It's just a regular PC with Windows Copilot... which is a standard Windows feature anyway.
>I don't want this garbage on my laptop, especially when its running of its battery!
The one bit of good news is it's not going to impact your battery life because it doesn't do any on-device processing. It's just calling an LLM in the cloud.
That's not quite correct. Snapdragon chips that are advertised as being good for "AI" also come with the Hexagon DSP, which is now used for (or targeted at) AI applications. It's essentially a separate vector processor with large vector sizes.
> It's just a regular PC with Windows Copilot... which is a standard Windows feature anyway.
"AI PC" branded devices get "Copilot+" and additional crap that comes with that due to the NPU. Despite desktops having GPUs with up to 50x more TOPs than the requirement, they don't get all that for some reason https://www.thurrott.com/mobile/copilot-pc/323616/microsoft-...
Doesn't this lead to a lot of tension between the hardware makers and Microsoft?
MS wants everyone to run Copilot on their shiny new data centre, so they can collect the data on the way.
Laptop manufacturers are making laptops that can run an LLM locally, but there's no point in that unless there's a local LLM to run (and Windows won't have that because Copilot). Are they going to be pre-installing Llama on new laptops?
Are we going to see a new power user / normal user split? Where power users buy laptops with LLMs installed, that can run them, and normal folks buy something that can call Copilot?
It isn't just copilot that these laptops come with; manufacturers are already putting their own AI chat apps as well.
For example, the LG gram I recently got came with just such an app named Chat, though the "ai button" on the keyboard (really just right alt or control, I forget which) defaults to copilot.
If there's any tension at all, it's just who gets to be the default app for the "ai button" on the keyboard that I assume almost nobody actually uses.
> MS wants everyone to run Copilot on their shiny new data centre, so they can collect the data on the way.
MS doesn't care where your data is, they're happy to go digging through your C drive to collect/mine whatever they want, assuming you can avoid all the dark patterns they use to push you to save everything on OneDrive anyway and they'll record all your interactions with any other AI using Recall
It's just marketing. The laptop makers will market it as if your laptop power makes a difference knowing full well that it's offloaded to the cloud.
For a slightly more charitable perspective, agentic AI means that there is still a bunch of stuff happening on the local machine, it's just not the inference itself.
There's nothing special with what Intel has lowered the bar as an AI PC so vendors can market it. Ollama can run a 4b model plenty fine on Tiger Lake with 8gb classic RAM.
But unified memory IS truly what makes an AI ready PC. The Apple Silicon proves that. People are willing to pay the premium, and I suspect unified memory will still be around and bringing us benefits even if no one cares about LLMs in 5 years.
Even collecting and sending all that data to the cloud is going to drain battery life. I'd really rather my devices only do what I ask them to than have AI running the background all the time trying to be helpful or just silently collecting data.
Windows is going more and more into AI and embedding it into the core of the OS as much as it can. It’s not “an app”, even if that was true now it wouldn't be true for very long. The strategy is well communicated.
Unfortunately still loads of hurdles for most people.
AAA Games with anti-cheat that don't support Linux.
Video editing (DaVinci Resolve exists but is a pain to get up and running on many distros, KDenLive/OpenShot don't really cut it for most)
Adobe Suite (Photoshop/Lightroom specifically, and Premiere for Video Editing) - would like to see Affinity support Linux but hasn't happened so far. GIMP and DarkTable aren't really substitutions unless you pour a lot of time into them.
Tried moving to Linux on my laptop this past month, made it a month before a reinstall of Windows 11. Had issues with WiFi chip (managed to fix but had to edit config files deep in the system, not ideal), Fedora with LUKS encryption after a kernel update the keyboard wouldn't work to input the encryption key, no Windows Hello-like support (face ID). Had the most success with EndeavourOS but running Arch is a chore for most.
It's getting there, best it's ever been, but there's still hurdles.
> AAA Games with anti-cheat that don't support Linux.
I really don't understand people that want to play games so badly that they are willing to install a literal rootkit on their devices. I can understand if you're a pro gamer but it feels stupid to do it otherwise.
According to my friends, Arc Raders works well on linux. So it's very much, just a small selection of AAA games, so they can run anti-cheat, that probably doesn't even work. Can you name a triple a you want to play, that proton says is incompatible?
Gimp isn't a solution, sure but it works for what I need. Darktable does way more than I've ever wanted, so I can forgive it for the one time it crashed. Inkscape and blender both exceed my needs as well.
And Adobe is so user hostile, that I feel I need to call you a mean name to prove how I feel.... dummy!
Yes, I already feel bad, and I'm sorry. But trolling aside, listing applications that treat users like shit, aren't reasons to stay on the platform that also treats you like shit.
I get it, sometimes, being treated like shit is worth it because it's easier now that you're used to being disrespected. But an aversion to the effort it'd take for you to climb the learning curve of something different, isn't valid reason to help the disrespectful trash companies making the world worse, recruit more people for them to treat like trash.
Just because you use it, doesn't make it worth recommending.
I don't really PC game anymore, use my Xbox or a few older games my laptop's iGPU can handle, not at the moment anyway. Battlefield 6 is a big one recently that if I had a gaming PC set-up I'd probably want to play.
I know Adobe are... c-words, but their software is industry standard for a reason.
> Battlefield 6 is a big one recently that if I had a gaming PC set-up I'd probably want to play.
We definitely play very different games, I wouldn't touch it if you paid me. So I'm sure we both have a bit of sample bias in our expected rates of linux compatibility. Especially since EA is another company like Adobe. Also, the internet seems to think they have a cheating problem. I wonder how bad it really is, and if it's worth the cost of the anti-cheat.
They're industry standard because they were first. Not necessarily because they were better. They do have a feature set that's near impossible to beat, not even I can pretend like they don't. I'm just saying, respect and fairness is more important to me, than content aware fill ever will be.
The thing is nowhere near the performance as a macbook, but its silent and the battery lasts ages, which is a far cry from the same laptop with an Intel CPU, which is what many are running.
"Local AI" could be many different things. NPUs are too puny to run many recent models, such as image generation and llms. The article seems to gloss over many important details like this, for example the creative agency, what AI work are they doing?
> marketing firm Aigency Amsterdam, told me earlier this year that although she prefers macOS, her agency doesn’t use Mac computers for AI work.
re NPUs: they've been a marketing thing for years now, but I really have no idea how many of them are actually used when you run [whatever]. particularly after a year or two of software updates.
anyone have numbers? are they just an added expense that is supported for first party stuff for 6 months before they need a bigger model, or do they have staying power? clearly they are capable of being used to save power, but does anything do that in practice, in consumer hardware?
Was never really into Apple hardware (mainly the price), however I recently got an M1 Mac Mini and an iPhone for app development, and the inference speed for as you say, a 5 year old chip is actually crazy.
If they made the M series fully open for Linux (I know Asahi is working away) I probably would never buy another non-M series processor again.
I got an M1 Mac Mini somewhat recently as well, to replace my ~2012 Mac Mini that I use as a media center PC. And frankly, it's overkill. Used ones can be had for $200-$300 USD, lower side with cosmetic damage. An absolute steal, IMO.
You can still get an M1 Macbook Air at retail for $599 ($300 for refurbs), which is a Chromebook price for a laptop that is better in pretty much every respect than any Chromebook.
With the wild ram prices, which btw are probably going to last out 2026, I expect 8 GB ram to be the new standard going on forward.
32 GB ram will be for enthusiasts with deep pockets, and professionals. Anything over that, exclusively professionals.
The conspiracy theorist inside me is telling me that big AI companies like OpenAI would rather see that people are using their puny laptops as terminals / shells only, to reach sky-based models, than to let them have beefy laptops and local models.
The conspiracy theorist inside me is telling me that big AI companies...
I don’t believe in conspiracies but I do believe in incentives sometimes lining up. Now that there is a RAM heavy cloud application, cloud providers are suddenly in direct competition with consumers for scarce resources, with the winner being able to control where people run their models.
The thing that is supposed to happen next is high-bandwidth flash. In theory, it could allow laptops to run the larger models without being extortionately costly, by loading directly from flash into the GPU (not by executing in flash)
But I haven't seen figures of the actual bandwidth yet, and no doubt to start with it will be expensive. the underlying technology of flash has much higher read latency than dram, so it's not really clear (to me, at least) if they can deliver the speeds needed to remove the need to cache in VRAM just by increasing parallelism.
Video games have driven the need for hardware more than office work. Sadly games are already being scaled back and more time is being spent on optimization instead of content since consumers can't be expected to have the kind of RAM available they normally would and everyone will be forced to make do with whatever RAM they have for a long time.
That might not be the case. The kind of memory that will flood the second-hand market could not be the kind of memory we can stuff in laptops or even desktop systems.
I was mussing this summer if I should get a refurbed Thinkpad P16 with 96GB of RAM to run VMs purely in memory. Now that 96GB of ram cost as much as a second P16.
I feel you, so much. I was thinking of getting a second 64gb node for my homelab and i thought i’d save those money… now the ram alone cost as much as the node, and I’m crying.
Lesson learned: you should always listen to that voice inside your head that say: “but i need it…” lol
I rebuilt a workstation after a failed motherboard a year ago. I was not very excited about being forced to replace it on a days notice and cheaped out on the RAM (only got 32GB). This is like the third or fourth time I've taught myself the lesson to not pinch pennies when buying equipment/infrastructure assets. It's the second time the lesson was about RAM, so clearly I'm a slow learner.
By "we" do you mean consumers? No, "we" will get neither. This is unexpected, irresistable opportunity to create a new class, by controlling the technology that people are required and are desiring to use (large genAI) with a comprehensive moat — financial, legislative and technological. Why make affordable devices that enable at least partial autonomy? Of course the focus will be on better remote operation (networking, on-device secure computation, advancing narrative that equates local computation with extremism and sociopathy).
I think only a small percentage of users care that much about running LLMs locally to pay for extra hardware for it, put up with slower and lower-quality responses, etc. . It’ll never be as good as non-local offerings, and is more hassle.
I'm running GPT-OSS 120B on a MacBook Pro M3 Max w/128 GB. It is pretty good, not great, but better than nothing when the wifi on the plane basically doesn't work.
I feel like there's no point to get a graphics card nowadays. Clearly, graphics cards are optimized for graphics; they just happened to be good for AI but based on the increased significance of AI, I'd be surprised if we don't get more specialized chips and specialized machines just for LLMs. One for LLMs, a different one for stable diffusion.
With graphics processing, you need a lot of bandwidth to get stuff in and out of the graphics card for rendering on a high-resolution screen, lots of pixels, lots of refreshes, lots of bandwidth... With LLMs, a relatively small amount of text goes in and a relatively small amount of text comes out over a reasonably long amount of time. The amount of internal processing is huge relative to the size of input and output. I think NVIDIA and a few other companies already started going down that route.
But probably graphics cards will still be useful for stable diffusion; especially AI-generated videos as the inputs and output bandwidth is much higher.
First, GPGPU is powerful and flexible. You can make an "AI-specific accelerator", but it wouldn't be much simpler or much more power-efficient - while being a lot less flexible. And since you need to run traditional graphics and AI workloads both in consumer hardware? It makes sense to run both on the same hardware.
And bandwidth? GPUs are notorious for not being bandwidth starved. 4K@60FPS seems like a lot of data to push in or out, but it's nothing compared to how fast modern PCIe 5.0 x16 goes. AI accelerators are more of the same.
GPUs might not be bandwidth starved most of the time, but they absolutely are when generating text from an llm.
It’s the whole reason why low precision floating point numbers are being pushed by nvidia.
LLMs are enormously bandwidth hungry. You have to shuffle your 800GB neural network in and out of memory for every token, which can take more time/energy than actually doing the matrix multiplies. GPUs are almost not high bandwidth enough.
But even so, for a single user, the output rate for a very fast LLM would be like 100 tokens per second. With graphics, we're talking like 2 million pixels, 60 times a second; 120 million pixels per second for a standard high res screen. Big difference between 100 tokens vs 120 million pixels.
24 bit pixels gives 16 million possible colors... For tokens, it's probably enough to represent every word of the entire vocabulary of every major national language on earth combined.
> You have to shuffle your 800GB neural network in and out of memory
Do you really though? That seems more like a constraint imposed by graphics cards. A specialized AI chip would just keep the weights and all parameters in memory/hardware right where they are and update them in-situ. It seems a lot more efficient.
I think that it's because graphics cards have such high bandwidth that people decided to use this approach but it seems suboptimal.
But if we want to be optimal; then ideally, only the inputs and outputs would need to move in and out of the chip. This shuffling should be seen as an inefficiency; a tradeoff to get a certain kind of flexibility in the software stack... But you waste a huge amount of CPU cycles moving data between RAM, CPU cache and Graphics card memory.
It stays in on the hbm but it need to get shuffled to the place where it can actually do the computation. It’s a lot like a normal cpu. The cpu can’t do anything with data in the system memory, it has to be loaded into a cpu register.
For every token that is generated, a dense llm has to read every parameter in the model.
This doesn't seem right. Where is it shuffling to and from? My drives aren't fast enough to load the model every token that fast, and I don't have enough system memory to unload models to.
If you're using a MoE model like DeepSeek V3 the full model is 671 GB but only 37 GB are active per token, so it's more like running a 37 GB model from the memory bandwidth perspective. If you do a quant of that it could e.g. be more like 18 GB.
There won't be a single time you can observe yourself carrying the weight of everything being moved out of the house because that's not what's happening. Instead you can observe yourself taking many tiny loads until everything is finally moved, at which point you yourself should not be loaded as a result of carrying things from the house anymore (but you may be loaded for whatever else you're doing).
Viewing active memory bandwidth can be more complicated than it'd seem to set up, so the easier way is to just view your VRAM usage as you load in the model freshly into the card. The "nvtop" utility can do this for most any GPU on Linux, as well as other stats you might care about as you watch LLMs run.
My confusion was on the shuffling process happening per token. If this was happening per token, it would be effectively the same as loading the model from disk every token.
I don't doubt that there will be specialized chips that make AI easier, but they'll be more expensive than the graphics cards sold to consumers which means that a lot of companies will just go with graphics cards, either because the extra speed of specialized chips won't be worth the cost, or will they'll be flat out too expensive and priced for the small number of massive spenders who'll shell out insane amounts of money for any/every advantage (whatever they think that means) they can get over everyone else.
I’ve been running LLMs on my laptop (M3 Max 64GB) for a year now and I think they are ready, especially with how good mid sized models are getting. I’m pretty sure unified memory and energy efficient GPUs will be more than just a thing on Apple laptops in the next few years.
You doing code completion and agentic stuff successfully with local models? Got any tips? I've been out of the game for [checks watch] a few months and am behind on the latest. Is Cline the move?
Memory prices will rise short term and generally fall long term, even with the current supply hiccup the answer is to just build out more capacity (which will happen if there is healthy competition). I meant, I expect the other mobile chip providers to adopt unified architecture and beefy GPU cores on chip and lots of bandwidth to connect it to memory (at the max or ultra level, at least), I think AMD is already doing UM at least?
> Memory prices will rise short term and generally fall long term, even with the current supply hiccup the answer is to just build out more capacity (which will happen if there is healthy competition)
Don't worry! Sam Altman is on it. Making sure there never is healthy competition that is.
High margins are exactly what should create a strong incentive to build more capacity. But that dynamic has been tamped down so far because we're all scared of a possible AI bubble that might pop at any moment.
My recent shower thought was the idea that Moores law hasnt slowed at all, we just went multi-core. Its crazy that the intel folks were so interested in optimizing for single thread CPU design they completely misunderstood where the best effort would be spent - if I had been around back then (speaking as an Elixir dev) I would have been way more interested in having 500 theead CPUs than getting down to nanometer scale dies. Thats what you get when everyone on the team is a bunch of C programmers
Before LLMs, the use of parallelism on your typical laptop was limited to application level parallelism, e.g. one thread for Outlook and one for each tab in Chrome.
The "AI laptop" boom is already fading. It turns out that LLMs, local or otherwise, just aren't very useful.
Like Big Data, LLMs are useful in a small niche of areas, like poorly summarizing meeting notes, or grammar check at a middle-school level.
On LLMs for coding tasks: I asked a programmer why they loved Claude and he showed me the output. Twenty years ago, that kind of code would have gotten someone PIP'd. Today it's considered better than most junior programmers...which is a sign of how far programming standards have fallen, and explains why most programs and apps are such buggy pieces of sh$t these days.
This article is so dumb. It totally ignores the memory price explosion that will make large fast memory laptops unfeasible for years and states stuff like this:
> How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly. It’s not possible to run these models on today’s consumer hardware, so real-world tests just can’t be done.
We know exactly the performance needed for a given responsiveness. TOPS is just a measurement independent from the type of hardware it runs on..
The less TOPS the slower the model runs so the user experience suffers. Memory bandwidth and latency plays a huge role too. And context, increase context and the LLM becomes much slower.
We don't need to wait for consumer hardware until we know much much is needed. We can calculate that for given situations.
It also pretends small models are not useful at all.
I think the massive cloud investments will put pressure away from local AI unfortunately. That trend makes local memory expensive and all those cloud billions have to be made back so all the vendors are pushing for their cloud subscriptions. I'm sure some functions will be local but the brunt of it will be cloud, sadly.
> How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly.
Why not extrapolate from open-source AIs which are available? The most powerful open-source AI (which I know of) is Kimi K2 and >600gb. Running this at acceptable speed requires 600+gb GPU/NPU memory. Even $2000-3000 AI-focused PCs like the DGX spark or Strix Halo typically top out at 128gb. Frontier models will only run on something that costs many times a typical consumer PC, and only going to get worse with RAM pricing.
In 2010 the typical consumer PC had 2-4gb of RAM. Now the typical PC has 12-16gb. This suggests RAM size doubling perhaps every 5 years at best. If that's the case, we're 25-30 years away from the typical PC having enough RAM to run Kimi K2.
But the typical user will never need that much RAM for basic web browsing, etc. The typical computer RAM size is not going to keep growing indefinitely.
What about cheaper models? It may be possible to run a "good enough" model on consumer hardware eventually. But I suspect that for at least 10-15 years, typical consumers (HN readers may not be typical!) will prefer capability, cheapness, and especially reliability (not making mistakes) over being able to run the model locally. (Yes AI datacenters are being subsidized by investors; but they will remain cheaper, even if that ends, due to economies of scale.)
The economics dictate that AI PCs are going to remain a niche product, similar to gaming PCs. Useful AI capability is just too expensive to add to every PC by default. It's like saying flying is so important, everyone should own an airplane. For at least a decade, likely two, it's just not cost-effective.
> It may be possible to run a "good enough" model on consumer hardware eventually
10-15 years?!!!! What is the definition of good enough? Qwen3 8B or A30B are quite capable models which run on a lot of hardware even today. SOTA is not just getting bigger, it's also getting more intelligence and running it more efficiently. There have been massive gains in intelligence at the smaller model sizes. It is just highly task dependent. Arguably some of these models are "good enough" already, and the level of intelligence and instruction following is much better from even 1 year ago. Sure not Opus 4.5 level, but still much could be done without that level of intelligence.
"Good enough" has to mean users won't be frequently frustrated if they transition to it from a frontier model.
> it is highly task dependent... much could be done without that level of intelligence
This is an enthusiast's glass-half-full perspective, but casual end users are gonna have a glass-half-empty perspective. Quen3-8B is impressive, but how many people use it as a daily driver? Most casual users will toss it as soon as it screws up once or twice.
The phrase you quoted in particular was imprecise (sorry) but my argument as a whole still stands. Replace "consumer hardware" with "typical PCs" - think $500 bestseller laptops from Walmart. AI PCs will remain niche luxury products, like gaming PCs. But gaming PCs benefit from being part of gaming culture and because cloud gaming adds input latency. Neither of these affects AI much.
You may be correct, but I wonder if we'll see Mac Mini sized external AI boxes that do have the 1TB of RAM and other hardware for running local models.
Maybe 100% of computer users wouldn't have one, but maybe 10-20% of power users would, including programmers who want to keep their personal code out of the training set, and so on.
I would not be surprised though if some consumer application made it desirable for each individual, or each family, to have local AI compute.
It's interesting to note that everyone owns their own computer, even though a personal computer sits idle half the day, and many personal computers hardly ever run at 80% of their CPU capacity. So the inefficiency of owning a personal AI server may not be as much of a barrier as it would seem.
> In 2010 the typical consumer PC had 2-4gb of RAM. Now the typical PC has 12-16gb. This suggests RAM size doubling perhaps every 5 years at best. If that's the case, we're 25-30 years away from the typical PC having enough RAM to run Kimi K2.
Part of the reason that RAM isn't growing faster is that there's no need for that much RAM at the moment. Technically you can put multiple TB of RAM in your machine, but no-one does that because it's a complete waste of money [0]. Unless you're working in a specialist field 16Gb of RAM is enough, and adding more doesn't make anything noticeably faster.
But given a decent use-case, like running an LLM locally, and you'd find demand for lots more RAM, and that would drive supply, and new technology developments, and in ten years it'll be normal to have 128TB of RAM in a baseline laptop.
Of course, that does require that there is a decent use-case for running an LLM locally, and your point that that is not necessarily true is well-made. I guess we'll find out.
[0] apart from a friend of mine working on crypto who had a desktop Linux box with 4TB of RAM in it.
The problem with this is that NPU have terrible, terrible support in the various software ecosystems because they are unique to their particular soc or whatever. No consistency even within particular companies.
You don't understand the needs of a common laptop user. Define the usecases that require reaching out to laptop instead of using the phone that is nearby. Those usecases don't need LLM for a common laptop user.
I mean, having a more powerful laptop is great, but at the same time, these guys are calling for a >10x increase in RAM and a far more powerful NPU. How will this affect pricing? How will it affect power management? It made it seem like most of the laptop will be dedicated to gen AI services, which I'm still not entirely convinced are quite THAT useful. I still want a cheap laptop that lasts all day and I also want to be able to tap that device's full power for heavy compute jobs!
The point is that when you run it on your own hardware you can feed the model your health data, bank statements and private journals and can be 5000% sure they’re not going anywhere
I've been playing around with my own home-built AI server for a couple months now. It is so much better than using a cloud provider. It is the difference between drag racing in your own car, and renting one from a dealership. You are going to learn far more doing things yourself. Your tools will be much more consistent and you will walk away with a far greater understanding of every process.
A basic last-generation PC with something like a 3060ti (12GB) is more than enough to get started. My current rig pulls less than 500w with two cards (3060+5060). And, given the current temperature outside, the rig helps heat my home. So I am not contributing to global warming, water consumption, or any other datacenter-related environmental evil.
The author seems unaware of how well recent Apple laptops run LLMs. This is puzzling and puts into question the validity of anything in this article.
But economically, it is still much better to buy a lower spec't laptop and to pay a monthly subscription for AI.
However, I agree with the article that people will run big LLMs on their laptop N years down the line. Especially if hardware outgrows best-in-class LLM model requirements. If a phone could run a 512GB LLM model fast, you would want it.
> economically, it is still much better to buy a lower spec't laptop and to pay a monthly subscription for AI
Uber is economical, too; but folks prefer to own cars, sometimes multiple.
And how there's market for all kinds of vanity cars, fast sportscars, expensive supercars... I imagine PCs & Laptops will have such a market, too: In probably less than a decade, may be a £20k laptop running a 671b+ LLM locally will be the norm among pros.
Paying $30-$70/day to commute is economical?
> Paying $30-$70/day to commute is economical?
When LLM use approaches this number, running one locally would be, yes. What you and other commentator seem to miss is, "Uber" is a stand-in for Cloud-based LLMs: Someone else builds and owns those servers, runs the LLMs, pays the electricity bills... while its users find it "economical" to rent it.
(btw, taxis are considered economical in parts of the world where owning cars is a luxury)
> Uber is economical, too
One time I took an Uber to work because my car broke down and was in the shop and the Uber driver (somewhat pointedly) made a comment that I must be really rich to commute to work via Uber because Ubers are so expensive
I think the author is aware of Apple silicon. The article mentions the fact Apple has unified memory and that this is advantageous for running LLMs.
Then idk why they say that most laptops are bad at running LLMs, Apple has a huge marketshare in the laptop market and even their cheapest laptops are capable in that realm. And their PC competitors are more likely to be generously specced out in terms of included memory.
> However, for the average laptop that’s over a year old, the number of useful AI models you can run locally on your PC is close to zero.
This straight up isn’t true.
Apple has a 10-18% market share for laptops. That's significant but it certainly isn't "most".
Most laptops can run at best a 7-14b model, even if you buy one with a high spec graphics chip. These are not useful models unless you're writing spam.
Most desktops have a decent amount of system memory but that can't be used for running LLMs at a useful speed, especially since the stuff you could run in 32-64GB RAM would need lots of interaction and hand holding.
And that's for the easy part, inference. Training is much more expensive.
Most laptops have 16GB of RAM or less. A little more than a year ago I think the base model Mac laptop had 8GB of RAM which really isn't fantastic for running LLMs.
So I'm hearing a lot of people running LLMs on Apple hardware. But is there actually anything useful you can run? Does it run at a usable speed? And is it worth the cost? Because the last time I checked the answer to all three questions appeared to be no.
Though maybe it depends on what you're doing? (Although if you're doing something simple like embeddings, then you don't need the Apple hardware in the first place.)
This paper shows a use case running on Apple silicon that’s theoretically valuable:
https://pmc.ncbi.nlm.nih.gov/articles/PMC12067846/
Who cares if result is right / wrong etc as it will all be different in a year … just interesting to see a test of desktop class hardware go ok.
I can definitely write code with a local model like Devstral small or a quantized granite, or a quantized deep-seek on an M1 Max w/ 64gb of ram.
Of course it depends what you’re doing.
Do you work offline often?
Essential.
By “PC”, they mean non-Apple devices.
Also, macOS only has around 10% desktop market share globally.
> Apple has a huge marketshare in the laptop market
Hello, from outside of California!
You still need ridiculously high spec hardware, and at Apple’s prices, that isn’t cheap. Even if you can afford it (most won't), the local models you can run are still limited and they still underperform. It’s much cheaper to pay for a cloud solution and get significantly better result. In my opinion, the article is right. We need a better way to run LLMs locally.
You still need ridiculously high spec hardware, and at Apple’s prices, that isn’t cheap.
You can easily run models like Mistral and Stable Diffusion in Ollama and Draw Things, and you can run newer models like Devstral (the MLX version) and Z Image Turbo with a little effort using LM Studio and Comfyui. It isn't as fast as using a good nVidia GPU or a cloud GPU but it's certainly good enough to play around with and learn more about it. I've written a bunch of apps that give me a browser UI talking to an API that's provided by an app running a model locally and it works perfectly well. I did that on an 8GB M1 for 18 months and then upgraded to a 24GB M4 Pro recently. I still have the M1 on my network for doing AI things in the background.
749 for an M4 air at Amazon right now
Try running anything interesting on these 8gb of ram.
You need 96gb or 128gb to do non trivial things. That is not yet 749 usd
Fair enough, but they start at 16GB nowadays.
64gb is fine.
I bought my M1 Max w/ 64gb of ram used. It's not that expensive.
Yes, the models it can run do not perform like chatgpt or claude 4.5, but they're still very useful.
Yeah, any Mac system specced with a decent amount of RAM since the M1 will run LLMs locally very well. And that’s exactly how the built-in Apple Intelligence service works: when enabled, it downloads a smallish local model. Since all Macs since the M1 have very fast memory available to the integrated GPU, they’re very good at AI.
The article kinda sucks at explaining how NPUs aren’t really even needed, they just have potential to make things more efficient in the future rather than depending on the power consumption involved with running your GPU.
Only if you want to take all the proprietary baggage and telemetry that comes with Apple platforms by default.
A Lenovo T15g with a 16gb 3080 mobile doesn’t do too badly and will run more than just Windows.
"How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly."
What's he talking about? It's trivial to calculate that.
Isn't the ability to run it more dependant on (V)RAM? With TOPS just dictating the speed at which it runs?
Strictly speaking, you don't need that much VRAM or even plain old RAM - just enough to store your context and model activations. It's just that as you run with less and less (V)RAM you'll start to bottleneck on things like SSD transfer bandwidth and your inference speed goes down to a crawl. But even that may or may not be an issue depending on your exact requirements: perhaps you don't need your answer instantly and can wait while it gets computed in the background. Or maybe you're running with the latest PCIe 5 storage which overall gives you comparable bandwidth to something like DDR3/DDR4 memory.
A good rule of thumb is that PP (Prompt Processing) is compute bound while TG (Token Generation) is (V)RAM speed bound.
It's also been done before...[0]
[0]: https://www.edge-ai-vision.com/2024/05/2024-edge-ai-and-visi...
It’s trivial to ask an AI to answer that. Well, I guess we know it’s not an AI generated article!
> state-of-the-art models
> hundreds of millions of parameters
lol
lmao, even
See: "3D TVs are driving the biggest change in TVs in decades"
A lazy easy cheap shot. But do you deny these aspects from the article are not coming? Or won't be still here in 5 years?
- Addition of more—and faster—memory.
- Consolidation of memory.
- Combination of chips on the same silicon.
All of these are also happening for non AI reasons. The move to SoC that really started with the M1 wasn't because of AI, but unified memory being the default is something we will see in 5 years. Unlike 3D TV.
We just had a series of articles and sysadmin outcry that major vendors were bringing 8gb laptops back to standard models because of the ram prices. In the short term, we're seeing a reduction.
Memory is absolutely not coming in the near future. Nobody can afford it.
> Addition of more—and faster—memory.
probably not after scam altman bought up half the world's supply for his shit company
> The move to SoC that really started with the M1
No it did not. There were numerous SoC that came before it and was inevitable in this space.
In order:
- People wanting more memory is not a novel feature. I am excited to find out how many people immediately want to disable the AI nonsense to free up memory for things they actually want to do.
- Same answer.
- I think the drive towards SOCs has been happening already. Apple's M-series utterly demolishes every PC chip apart from the absolute bleeding-edge available, includes dedicated memory and processors for ML tasks, and it's mature technology. Been there for years. To the extent PC makers are chasing this, I would say it's far more in response to that than anything to do with AI.
This article is just saying more laptops will have power efficient GPUs in it. A bit better than 3D TVs.
They might not use Apple silicon often. Other options are encouraging.
I was in the market for a laptop this month. Many new laptops now advertise AI features like this "HP OmniBook 5 Next Gen AI PC" which advertises:
"SNAPDRAGON X PLUS PROCESSOR - Achieve more everyday with responsive performance for seamless multitasking with AI tools that enhance productivity and connectivity while providing long battery life"
I don't want this garbage on my laptop, especially when its running of its battery! Running AI on your laptop is like playing Starcraft Remastered on the Xbox or Factorio on your steamdeck. I hear you can play DOOM on a pregnancy test too. Sure, you can, but its just going to be a tedious inferior experiance.
Really, this is just a fine example of how overhyped AI is right now.
Laptop manufacturers are too desperate to cash on the AI craze. There's nothing special about an 'AI PC'. It's just a regular PC with Windows Copilot... which is a standard Windows feature anyway.
>I don't want this garbage on my laptop, especially when its running of its battery!
The one bit of good news is it's not going to impact your battery life because it doesn't do any on-device processing. It's just calling an LLM in the cloud.
That's not quite correct. Snapdragon chips that are advertised as being good for "AI" also come with the Hexagon DSP, which is now used for (or targeted at) AI applications. It's essentially a separate vector processor with large vector sizes.
> It's just a regular PC with Windows Copilot... which is a standard Windows feature anyway.
"AI PC" branded devices get "Copilot+" and additional crap that comes with that due to the NPU. Despite desktops having GPUs with up to 50x more TOPs than the requirement, they don't get all that for some reason https://www.thurrott.com/mobile/copilot-pc/323616/microsoft-...
Is Microsoft trying to help NPU chip makers?
When is Wintel going to finally happen?
Microsoft has roughly $102 billion in cash (+ short-term investments). Intel’s market value is approximately $176 billion.
I've never really understood why Microsoft helped Intel's bottom line over decades.
With Azure, Microsoft has even more reason to buy Intel.
Doesn't this lead to a lot of tension between the hardware makers and Microsoft?
MS wants everyone to run Copilot on their shiny new data centre, so they can collect the data on the way.
Laptop manufacturers are making laptops that can run an LLM locally, but there's no point in that unless there's a local LLM to run (and Windows won't have that because Copilot). Are they going to be pre-installing Llama on new laptops?
Are we going to see a new power user / normal user split? Where power users buy laptops with LLMs installed, that can run them, and normal folks buy something that can call Copilot?
Any ideas?
It isn't just copilot that these laptops come with; manufacturers are already putting their own AI chat apps as well.
For example, the LG gram I recently got came with just such an app named Chat, though the "ai button" on the keyboard (really just right alt or control, I forget which) defaults to copilot.
If there's any tension at all, it's just who gets to be the default app for the "ai button" on the keyboard that I assume almost nobody actually uses.
Interesting. Yeah, that'll be the argument
> MS wants everyone to run Copilot on their shiny new data centre, so they can collect the data on the way.
MS doesn't care where your data is, they're happy to go digging through your C drive to collect/mine whatever they want, assuming you can avoid all the dark patterns they use to push you to save everything on OneDrive anyway and they'll record all your interactions with any other AI using Recall
I had assumed that they needed the usage to justify the investment in the data centre, but you could be right and they don't care.
Copilot is a local LLM (well SLM). https://learn.microsoft.com/en-us/windows/ai/apis/phi-silica
It's just marketing. The laptop makers will market it as if your laptop power makes a difference knowing full well that it's offloaded to the cloud.
For a slightly more charitable perspective, agentic AI means that there is still a bunch of stuff happening on the local machine, it's just not the inference itself.
There's nothing special with what Intel has lowered the bar as an AI PC so vendors can market it. Ollama can run a 4b model plenty fine on Tiger Lake with 8gb classic RAM.
But unified memory IS truly what makes an AI ready PC. The Apple Silicon proves that. People are willing to pay the premium, and I suspect unified memory will still be around and bringing us benefits even if no one cares about LLMs in 5 years.
Even collecting and sending all that data to the cloud is going to drain battery life. I'd really rather my devices only do what I ask them to than have AI running the background all the time trying to be helpful or just silently collecting data.
Copilot is just ChatGPT as an app.
If you don't use it, it will have no impact on your device. And it's not sending your data to the cloud except for anything you paste into it.
So, the new AI features like recall don’t exist?
Windows is going more and more into AI and embedding it into the core of the OS as much as it can. It’s not “an app”, even if that was true now it wouldn't be true for very long. The strategy is well communicated.
>> I'd really rather my devices only do what I ask them to
Linux hears your cry. You have a choice. Make it.
Unfortunately still loads of hurdles for most people.
AAA Games with anti-cheat that don't support Linux.
Video editing (DaVinci Resolve exists but is a pain to get up and running on many distros, KDenLive/OpenShot don't really cut it for most)
Adobe Suite (Photoshop/Lightroom specifically, and Premiere for Video Editing) - would like to see Affinity support Linux but hasn't happened so far. GIMP and DarkTable aren't really substitutions unless you pour a lot of time into them.
Tried moving to Linux on my laptop this past month, made it a month before a reinstall of Windows 11. Had issues with WiFi chip (managed to fix but had to edit config files deep in the system, not ideal), Fedora with LUKS encryption after a kernel update the keyboard wouldn't work to input the encryption key, no Windows Hello-like support (face ID). Had the most success with EndeavourOS but running Arch is a chore for most.
It's getting there, best it's ever been, but there's still hurdles.
> AAA Games with anti-cheat that don't support Linux.
I really don't understand people that want to play games so badly that they are willing to install a literal rootkit on their devices. I can understand if you're a pro gamer but it feels stupid to do it otherwise.
Most of the time they're not really informed that they are. I know Valorant does (Riot Games), one I've avoided in the past because of it.
But a lot of the time it's peer-pressure for wanting to play with friends who couldn't care less.
Riot Vanguard is a popular rootkit.
According to my friends, Arc Raders works well on linux. So it's very much, just a small selection of AAA games, so they can run anti-cheat, that probably doesn't even work. Can you name a triple a you want to play, that proton says is incompatible?
Gimp isn't a solution, sure but it works for what I need. Darktable does way more than I've ever wanted, so I can forgive it for the one time it crashed. Inkscape and blender both exceed my needs as well.
And Adobe is so user hostile, that I feel I need to call you a mean name to prove how I feel.... dummy!
Yes, I already feel bad, and I'm sorry. But trolling aside, listing applications that treat users like shit, aren't reasons to stay on the platform that also treats you like shit.
I get it, sometimes, being treated like shit is worth it because it's easier now that you're used to being disrespected. But an aversion to the effort it'd take for you to climb the learning curve of something different, isn't valid reason to help the disrespectful trash companies making the world worse, recruit more people for them to treat like trash.
Just because you use it, doesn't make it worth recommending.
I don't really PC game anymore, use my Xbox or a few older games my laptop's iGPU can handle, not at the moment anyway. Battlefield 6 is a big one recently that if I had a gaming PC set-up I'd probably want to play.
I know Adobe are... c-words, but their software is industry standard for a reason.
> Battlefield 6 is a big one recently that if I had a gaming PC set-up I'd probably want to play.
We definitely play very different games, I wouldn't touch it if you paid me. So I'm sure we both have a bit of sample bias in our expected rates of linux compatibility. Especially since EA is another company like Adobe. Also, the internet seems to think they have a cheating problem. I wonder how bad it really is, and if it's worth the cost of the anti-cheat.
They're industry standard because they were first. Not necessarily because they were better. They do have a feature set that's near impossible to beat, not even I can pretend like they don't. I'm just saying, respect and fairness is more important to me, than content aware fill ever will be.
Also, doesn't the Adobe suite work on Linux?
I think older versions do, like CS6 through WINE.
Photoshop CC 2024 apparently works somewhat, but no GPU support and the removal tool doesn't work apparently.
https://appdb.winehq.org/objectManager.php?sClass=version&iI...
Basically, no.
Part of me is starting to think Valve is going to be the best thing to happen to Linux (in this regard) since Ubuntu.
AI PCs also have NPUs which I guess provide accelerated matmuls, albeit less accelerated than a good discrete GPU.
I have a Snapdragon laptop and it is the best I've ever had. But the NPU is really almost useless.
This is a nice companion to the article: https://www.pcworld.com/article/2965927/the-great-npu-failur...
Agreed, I have the ARM based T14s for work.
The thing is nowhere near the performance as a macbook, but its silent and the battery lasts ages, which is a far cry from the same laptop with an Intel CPU, which is what many are running.
Company removes a lot of the AI bloat though.
It’s true that the AI marketing is largely nonsense, but the NPUs also don’t hurt, and you don’t have to make use of them.
> Running AI on your laptop is like playing Starcraft Remastered on the Xbox
A great analogy because there is Starcraft for a console - Nintendo 64 - and it is quite awkward. Split-screen multiplayer included.
Factorio runs really well on the deck though...
But yeah, fresh install of OS is a must for any new computer.
"Local AI" could be many different things. NPUs are too puny to run many recent models, such as image generation and llms. The article seems to gloss over many important details like this, for example the creative agency, what AI work are they doing?
> marketing firm Aigency Amsterdam, told me earlier this year that although she prefers macOS, her agency doesn’t use Mac computers for AI work.
re NPUs: they've been a marketing thing for years now, but I really have no idea how many of them are actually used when you run [whatever]. particularly after a year or two of software updates.
anyone have numbers? are they just an added expense that is supported for first party stuff for 6 months before they need a bigger model, or do they have staying power? clearly they are capable of being used to save power, but does anything do that in practice, in consumer hardware?
This mostly just shows you how far behind the M1 (which came out 5 years ago) all the non Apple laptops are.
Was never really into Apple hardware (mainly the price), however I recently got an M1 Mac Mini and an iPhone for app development, and the inference speed for as you say, a 5 year old chip is actually crazy.
If they made the M series fully open for Linux (I know Asahi is working away) I probably would never buy another non-M series processor again.
I got an M1 Mac Mini somewhat recently as well, to replace my ~2012 Mac Mini that I use as a media center PC. And frankly, it's overkill. Used ones can be had for $200-$300 USD, lower side with cosmetic damage. An absolute steal, IMO.
You can still get an M1 Macbook Air at retail for $599 ($300 for refurbs), which is a Chromebook price for a laptop that is better in pretty much every respect than any Chromebook.
Outside of Apple laptops (and arguably the Ryzen AI MAX 390), an "AI ready" laptop is simply marketing speak for "is capable of making HTTP requests."
The price of RAM is going to throw a wrench at that
With the wild ram prices, which btw are probably going to last out 2026, I expect 8 GB ram to be the new standard going on forward.
32 GB ram will be for enthusiasts with deep pockets, and professionals. Anything over that, exclusively professionals.
The conspiracy theorist inside me is telling me that big AI companies like OpenAI would rather see that people are using their puny laptops as terminals / shells only, to reach sky-based models, than to let them have beefy laptops and local models.
Not if a few investigations into the foundries and their datacenter deals stops that.
I predict we will see compute-in-flash before we see cheap laptops with 128+ gigs of ram.
The thing that is supposed to happen next is high-bandwidth flash. In theory, it could allow laptops to run the larger models without being extortionately costly, by loading directly from flash into the GPU (not by executing in flash) But I haven't seen figures of the actual bandwidth yet, and no doubt to start with it will be expensive. the underlying technology of flash has much higher read latency than dram, so it's not really clear (to me, at least) if they can deliver the speeds needed to remove the need to cache in VRAM just by increasing parallelism.
There was a company that did compute-in-dram, which was recently acquired by Qualcomm: https://www.emergentmind.com/topics/upmem-pim-system
I can't tell if this is optimism for compute-in-flash or pessimism with how RAM has been going lately!
We’ve had “compute in flash” for a few years now: https://mythic.ai/product/
Yeah especially since what is happening in the memory market
Feast and famine.
In three years we will be swimming in more ram than we know what to do with.
Kind of feel that's already the case today... 4GB I find is still plenty for even business workloads.
Video games have driven the need for hardware more than office work. Sadly games are already being scaled back and more time is being spent on optimization instead of content since consumers can't be expected to have the kind of RAM available they normally would and everyone will be forced to make do with whatever RAM they have for a long time.
That might not be the case. The kind of memory that will flood the second-hand market could not be the kind of memory we can stuff in laptops or even desktop systems.
Memristors are (IME) missing from the news. They promised to act as both persistent storage and fast RAM.
If only memristors weren't vaporware that has "shown promise" for 3 decades now and went nowhere.
You could get 128gb ram laptops from the time ddr4 came around: workstation class laptops with 4 ram slots would happily take 128gb of memory.
The fact that nowadays there are little to no laptops with 4 ran slots is entirely artificial.
I was mussing this summer if I should get a refurbed Thinkpad P16 with 96GB of RAM to run VMs purely in memory. Now that 96GB of ram cost as much as a second P16.
I feel you, so much. I was thinking of getting a second 64gb node for my homelab and i thought i’d save those money… now the ram alone cost as much as the node, and I’m crying.
Lesson learned: you should always listen to that voice inside your head that say: “but i need it…” lol
I rebuilt a workstation after a failed motherboard a year ago. I was not very excited about being forced to replace it on a days notice and cheaped out on the RAM (only got 32GB). This is like the third or fourth time I've taught myself the lesson to not pinch pennies when buying equipment/infrastructure assets. It's the second time the lesson was about RAM, so clearly I'm a slow learner.
By "we" do you mean consumers? No, "we" will get neither. This is unexpected, irresistable opportunity to create a new class, by controlling the technology that people are required and are desiring to use (large genAI) with a comprehensive moat — financial, legislative and technological. Why make affordable devices that enable at least partial autonomy? Of course the focus will be on better remote operation (networking, on-device secure computation, advancing narrative that equates local computation with extremism and sociopathy).
Push Washington to grill the foundries and their customers. Repeat until prices drop.
I think only a small percentage of users care that much about running LLMs locally to pay for extra hardware for it, put up with slower and lower-quality responses, etc. . It’ll never be as good as non-local offerings, and is more hassle.
I'm running GPT-OSS 120B on a MacBook Pro M3 Max w/128 GB. It is pretty good, not great, but better than nothing when the wifi on the plane basically doesn't work.
I feel like there's no point to get a graphics card nowadays. Clearly, graphics cards are optimized for graphics; they just happened to be good for AI but based on the increased significance of AI, I'd be surprised if we don't get more specialized chips and specialized machines just for LLMs. One for LLMs, a different one for stable diffusion.
With graphics processing, you need a lot of bandwidth to get stuff in and out of the graphics card for rendering on a high-resolution screen, lots of pixels, lots of refreshes, lots of bandwidth... With LLMs, a relatively small amount of text goes in and a relatively small amount of text comes out over a reasonably long amount of time. The amount of internal processing is huge relative to the size of input and output. I think NVIDIA and a few other companies already started going down that route.
But probably graphics cards will still be useful for stable diffusion; especially AI-generated videos as the inputs and output bandwidth is much higher.
Nah, that's just plain wrong.
First, GPGPU is powerful and flexible. You can make an "AI-specific accelerator", but it wouldn't be much simpler or much more power-efficient - while being a lot less flexible. And since you need to run traditional graphics and AI workloads both in consumer hardware? It makes sense to run both on the same hardware.
And bandwidth? GPUs are notorious for not being bandwidth starved. 4K@60FPS seems like a lot of data to push in or out, but it's nothing compared to how fast modern PCIe 5.0 x16 goes. AI accelerators are more of the same.
GPUs might not be bandwidth starved most of the time, but they absolutely are when generating text from an llm. It’s the whole reason why low precision floating point numbers are being pushed by nvidia.
LLMs are enormously bandwidth hungry. You have to shuffle your 800GB neural network in and out of memory for every token, which can take more time/energy than actually doing the matrix multiplies. GPUs are almost not high bandwidth enough.
But even so, for a single user, the output rate for a very fast LLM would be like 100 tokens per second. With graphics, we're talking like 2 million pixels, 60 times a second; 120 million pixels per second for a standard high res screen. Big difference between 100 tokens vs 120 million pixels.
24 bit pixels gives 16 million possible colors... For tokens, it's probably enough to represent every word of the entire vocabulary of every major national language on earth combined.
> You have to shuffle your 800GB neural network in and out of memory
Do you really though? That seems more like a constraint imposed by graphics cards. A specialized AI chip would just keep the weights and all parameters in memory/hardware right where they are and update them in-situ. It seems a lot more efficient.
I think that it's because graphics cards have such high bandwidth that people decided to use this approach but it seems suboptimal.
But if we want to be optimal; then ideally, only the inputs and outputs would need to move in and out of the chip. This shuffling should be seen as an inefficiency; a tradeoff to get a certain kind of flexibility in the software stack... But you waste a huge amount of CPU cycles moving data between RAM, CPU cache and Graphics card memory.
> Do you really though?
Yes.
It stays in on the hbm but it need to get shuffled to the place where it can actually do the computation. It’s a lot like a normal cpu. The cpu can’t do anything with data in the system memory, it has to be loaded into a cpu register. For every token that is generated, a dense llm has to read every parameter in the model.
If we did that it would be much more expensive, keeping all weights in SRAM is done by Groq for example.
This doesn't seem right. Where is it shuffling to and from? My drives aren't fast enough to load the model every token that fast, and I don't have enough system memory to unload models to.
From VRAM to the tensor cores and back. On a modern GPU you can have 1-2tb moving around inside the GPU every second.
This is why they use high bandwidth memory for VRAM.
This makes sense now, thanks!
If you're using a MoE model like DeepSeek V3 the full model is 671 GB but only 37 GB are active per token, so it's more like running a 37 GB model from the memory bandwidth perspective. If you do a quant of that it could e.g. be more like 18 GB.
You're probably not using an 800GB model.
It is right. The shuffling is from CPU memory to GPU memory, and from GPU memory to GPU. If you don’t have enough memory you can’t run the model.
How can I observe it being loaded into CPU memory? When I run a 20gb model with ollama, htop reports 3gb of total RAM usage.
Think of it like loading a moving truck where:
- The house is the disk
- You are the RAM
- The truck is the VRAM
There won't be a single time you can observe yourself carrying the weight of everything being moved out of the house because that's not what's happening. Instead you can observe yourself taking many tiny loads until everything is finally moved, at which point you yourself should not be loaded as a result of carrying things from the house anymore (but you may be loaded for whatever else you're doing).
Viewing active memory bandwidth can be more complicated than it'd seem to set up, so the easier way is to just view your VRAM usage as you load in the model freshly into the card. The "nvtop" utility can do this for most any GPU on Linux, as well as other stats you might care about as you watch LLMs run.
My confusion was on the shuffling process happening per token. If this was happening per token, it would be effectively the same as loading the model from disk every token.
Depends on map_location arg in torch.load: might be loaded straight to GPU memory
> Clearly, graphics cards are optimized for graphics; they just happened to be good for AI
I feel like the reverse has been true since after the Pascal era.
I don't doubt that there will be specialized chips that make AI easier, but they'll be more expensive than the graphics cards sold to consumers which means that a lot of companies will just go with graphics cards, either because the extra speed of specialized chips won't be worth the cost, or will they'll be flat out too expensive and priced for the small number of massive spenders who'll shell out insane amounts of money for any/every advantage (whatever they think that means) they can get over everyone else.
I’ve been running LLMs on my laptop (M3 Max 64GB) for a year now and I think they are ready, especially with how good mid sized models are getting. I’m pretty sure unified memory and energy efficient GPUs will be more than just a thing on Apple laptops in the next few years.
You doing code completion and agentic stuff successfully with local models? Got any tips? I've been out of the game for [checks watch] a few months and am behind on the latest. Is Cline the move?
Only because of Apples unified memory architecture. The groundwork is there, we just need memory to be cheaper so we can fit 512+GB now ;)
Memory prices will rise short term and generally fall long term, even with the current supply hiccup the answer is to just build out more capacity (which will happen if there is healthy competition). I meant, I expect the other mobile chip providers to adopt unified architecture and beefy GPU cores on chip and lots of bandwidth to connect it to memory (at the max or ultra level, at least), I think AMD is already doing UM at least?
> Memory prices will rise short term and generally fall long term, even with the current supply hiccup the answer is to just build out more capacity (which will happen if there is healthy competition)
Don't worry! Sam Altman is on it. Making sure there never is healthy competition that is.
https://www.mooreslawisdead.com/post/sam-altman-s-dirty-dram...
We’ve been through multiple cycles of scarcity/surplus DRAM cycles in the last couple of decades. Why do we think it will be different now?
> Why do we think it will be different now?
Margins. AI usage can pay a lot more. Even if they sell less than can still be more profitable.
In the past there wasn’t a high margin usage. Servers didn’t charge such a high premium.
High margins are exactly what should create a strong incentive to build more capacity. But that dynamic has been tamped down so far because we're all scared of a possible AI bubble that might pop at any moment.
My recent shower thought was the idea that Moores law hasnt slowed at all, we just went multi-core. Its crazy that the intel folks were so interested in optimizing for single thread CPU design they completely misunderstood where the best effort would be spent - if I had been around back then (speaking as an Elixir dev) I would have been way more interested in having 500 theead CPUs than getting down to nanometer scale dies. Thats what you get when everyone on the team is a bunch of C programmers
Before LLMs, the use of parallelism on your typical laptop was limited to application level parallelism, e.g. one thread for Outlook and one for each tab in Chrome.
The "AI laptop" boom is already fading. It turns out that LLMs, local or otherwise, just aren't very useful.
Like Big Data, LLMs are useful in a small niche of areas, like poorly summarizing meeting notes, or grammar check at a middle-school level.
On LLMs for coding tasks: I asked a programmer why they loved Claude and he showed me the output. Twenty years ago, that kind of code would have gotten someone PIP'd. Today it's considered better than most junior programmers...which is a sign of how far programming standards have fallen, and explains why most programs and apps are such buggy pieces of sh$t these days.
This article is so dumb. It totally ignores the memory price explosion that will make large fast memory laptops unfeasible for years and states stuff like this:
> How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly. It’s not possible to run these models on today’s consumer hardware, so real-world tests just can’t be done.
We know exactly the performance needed for a given responsiveness. TOPS is just a measurement independent from the type of hardware it runs on..
The less TOPS the slower the model runs so the user experience suffers. Memory bandwidth and latency plays a huge role too. And context, increase context and the LLM becomes much slower.
We don't need to wait for consumer hardware until we know much much is needed. We can calculate that for given situations.
It also pretends small models are not useful at all.
I think the massive cloud investments will put pressure away from local AI unfortunately. That trend makes local memory expensive and all those cloud billions have to be made back so all the vendors are pushing for their cloud subscriptions. I'm sure some functions will be local but the brunt of it will be cloud, sadly.
Horrible article. Low effort, low knowledge. Had no idea the bar was so low for an IEEE publication
The article is from mid-November (and probably was written even earlier), where the RAM price explosion wasn’t as striking yet.
also, state of the art models have hundreds of _billions_ of parameters.
It tells you about their ambitions..
I suppose it depends on the model, code was useless. As a lossy copy of an interactive Wikipedia it could be ok not good or great just ok.
Maybe for creative suggestions and editing it’d be ok.
Seems like wishful thinking.
> How many TOPS do you need to run state-of-the-art models with hundreds of millions of parameters? No one knows exactly.
Why not extrapolate from open-source AIs which are available? The most powerful open-source AI (which I know of) is Kimi K2 and >600gb. Running this at acceptable speed requires 600+gb GPU/NPU memory. Even $2000-3000 AI-focused PCs like the DGX spark or Strix Halo typically top out at 128gb. Frontier models will only run on something that costs many times a typical consumer PC, and only going to get worse with RAM pricing.
In 2010 the typical consumer PC had 2-4gb of RAM. Now the typical PC has 12-16gb. This suggests RAM size doubling perhaps every 5 years at best. If that's the case, we're 25-30 years away from the typical PC having enough RAM to run Kimi K2.
But the typical user will never need that much RAM for basic web browsing, etc. The typical computer RAM size is not going to keep growing indefinitely.
What about cheaper models? It may be possible to run a "good enough" model on consumer hardware eventually. But I suspect that for at least 10-15 years, typical consumers (HN readers may not be typical!) will prefer capability, cheapness, and especially reliability (not making mistakes) over being able to run the model locally. (Yes AI datacenters are being subsidized by investors; but they will remain cheaper, even if that ends, due to economies of scale.)
The economics dictate that AI PCs are going to remain a niche product, similar to gaming PCs. Useful AI capability is just too expensive to add to every PC by default. It's like saying flying is so important, everyone should own an airplane. For at least a decade, likely two, it's just not cost-effective.
> It may be possible to run a "good enough" model on consumer hardware eventually
10-15 years?!!!! What is the definition of good enough? Qwen3 8B or A30B are quite capable models which run on a lot of hardware even today. SOTA is not just getting bigger, it's also getting more intelligence and running it more efficiently. There have been massive gains in intelligence at the smaller model sizes. It is just highly task dependent. Arguably some of these models are "good enough" already, and the level of intelligence and instruction following is much better from even 1 year ago. Sure not Opus 4.5 level, but still much could be done without that level of intelligence.
"Good enough" has to mean users won't be frequently frustrated if they transition to it from a frontier model.
> it is highly task dependent... much could be done without that level of intelligence
This is an enthusiast's glass-half-full perspective, but casual end users are gonna have a glass-half-empty perspective. Quen3-8B is impressive, but how many people use it as a daily driver? Most casual users will toss it as soon as it screws up once or twice.
The phrase you quoted in particular was imprecise (sorry) but my argument as a whole still stands. Replace "consumer hardware" with "typical PCs" - think $500 bestseller laptops from Walmart. AI PCs will remain niche luxury products, like gaming PCs. But gaming PCs benefit from being part of gaming culture and because cloud gaming adds input latency. Neither of these affects AI much.
You may be correct, but I wonder if we'll see Mac Mini sized external AI boxes that do have the 1TB of RAM and other hardware for running local models.
Maybe 100% of computer users wouldn't have one, but maybe 10-20% of power users would, including programmers who want to keep their personal code out of the training set, and so on.
I would not be surprised though if some consumer application made it desirable for each individual, or each family, to have local AI compute.
It's interesting to note that everyone owns their own computer, even though a personal computer sits idle half the day, and many personal computers hardly ever run at 80% of their CPU capacity. So the inefficiency of owning a personal AI server may not be as much of a barrier as it would seem.
But will it ever lead to a Mac Mini-priced external AI box? Or will this always be a premium "pro" tier that seems to rival used car prices?
> but I wonder if we'll see Mac Mini sized external AI boxes that do have the 1TB of RAM
Isn't that the Mac Studio already? Ok, it seems to max at 512 GB.
> In 2010 the typical consumer PC had 2-4gb of RAM. Now the typical PC has 12-16gb. This suggests RAM size doubling perhaps every 5 years at best. If that's the case, we're 25-30 years away from the typical PC having enough RAM to run Kimi K2.
Part of the reason that RAM isn't growing faster is that there's no need for that much RAM at the moment. Technically you can put multiple TB of RAM in your machine, but no-one does that because it's a complete waste of money [0]. Unless you're working in a specialist field 16Gb of RAM is enough, and adding more doesn't make anything noticeably faster.
But given a decent use-case, like running an LLM locally, and you'd find demand for lots more RAM, and that would drive supply, and new technology developments, and in ten years it'll be normal to have 128TB of RAM in a baseline laptop.
Of course, that does require that there is a decent use-case for running an LLM locally, and your point that that is not necessarily true is well-made. I guess we'll find out.
[0] apart from a friend of mine working on crypto who had a desktop Linux box with 4TB of RAM in it.
I spent a good 30 seconds trying to figure out what DDS was an acronym for in this context.
The problem with this is that NPU have terrible, terrible support in the various software ecosystems because they are unique to their particular soc or whatever. No consistency even within particular companies.
You don't understand the needs of a common laptop user. Define the usecases that require reaching out to laptop instead of using the phone that is nearby. Those usecases don't need LLM for a common laptop user.
This must be referring mostly to windows, or non-Apple laptops
I mean, having a more powerful laptop is great, but at the same time, these guys are calling for a >10x increase in RAM and a far more powerful NPU. How will this affect pricing? How will it affect power management? It made it seem like most of the laptop will be dedicated to gen AI services, which I'm still not entirely convinced are quite THAT useful. I still want a cheap laptop that lasts all day and I also want to be able to tap that device's full power for heavy compute jobs!
I have no desire to run an LLM on my laptop when I can run one on a computer the size of six football fields.
The point is that when you run it on your own hardware you can feed the model your health data, bank statements and private journals and can be 5000% sure they’re not going anywhere
Regular people don't understand nor care about any of that. They'll happily take the Faustian bargain.
I've been playing around with my own home-built AI server for a couple months now. It is so much better than using a cloud provider. It is the difference between drag racing in your own car, and renting one from a dealership. You are going to learn far more doing things yourself. Your tools will be much more consistent and you will walk away with a far greater understanding of every process.
A basic last-generation PC with something like a 3060ti (12GB) is more than enough to get started. My current rig pulls less than 500w with two cards (3060+5060). And, given the current temperature outside, the rig helps heat my home. So I am not contributing to global warming, water consumption, or any other datacenter-related environmental evil.
> I am not contributing to global warming
lol