Experimenting with Local LLMs on macOS

(blog.6nok.org)

374 points | by frontsideair a day ago ago

263 comments

coffeecoders a day ago ago
I agree that it's kind of magical that you can download a ~10GB file and suddenly your laptop is running something that can summarize text, answer questions and even reason a bit.
The trick is balancing model size vs RAM: 12B–20B is about the upper limit for a 16GB machine without it choking.
What I find interesting is that these models don't actually hit Apple's Neural Engine, they run on the GPU via Metal. Core ML isn't great for custom runtimes and Apple hasn't given low-level developer access to the ANE afaik. And then there is memory bandwidth and dedicated SRAM issues. Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.
[-]
- giancarlostoro a day ago ago
  I feel like Apple needs a new CEO, I've felt this way for a long time. If I had been in charge of Apple I would have embraced local LLMs and built an inference engine that optimizes models that are designed for Nvidia, I also would have probably toyed around with the idea of selling server-grade Apple Silicon processors and opening up the GPU spec so people can build against it. Seems like Apple tries to play it too safe. While Tim Cook is good as a COO, he's still running Apple as a COO. They need a man of vision, not a COO at the helm.
  [-]
  - aurareturn a day ago ago
    I think if Cook had vision, he could have started something called Apple Enterprise and sold Apple Silicon as a server and made AI chips. I agree he’s too conservative and has no product vision. Great manager though.
    [-]
    - seanmcdirmid a day ago ago
      I was pleasantly surprised Apple Silicon came out at all. Someone has their eye on long term vision at Apple at least, they just didn't do this on a whim.
      [-]
      - flutas 17 hours ago ago
        Or someone told Tim "we can save $XYZ per phone if we switch to custom designed silicon, and potentially expand it to Mac as well so we no longer have Intel overheating our Macbooks."
        He was after-all more of an operations guy than a product guy before moving into the CEO role.
        [-]
        seanmcdirmid 16 hours ago ago
        The unified GPU and unified memory design was pretty important. They just didn’t go and replace intel, they replaced AMD/NVIDIA also. The GPUs in high end Apple silicon are even good enough for mid model inference, and unified memory makes it somewhat cost effective…that advantage probably wasn’t planned and comes from just a lot of good execution and smart R&D.
        [-]
        jychang 16 hours ago ago
        To be fair, Apple HATES Nvidia after the 8400M and 9400M debacle. They probably saw replacing Nvidia as a bigger benefit than replacing Intel.
    - nxobject 11 hours ago ago
      I think that would spread Apple’s chip team too thinly between competing priorities - and require them to do E2E stuff they’d never be interested in doing. What’s always happened, even during Jobs, is that Apple would do something nice and backend-y, and then not be able to keep it up as they’d pour resources into some consumer product. (See: WebObjects, Xserve, Mac OS Server.)
    - alt227 8 hours ago ago
      Apple silicon does not compete well in multicore spaces. People seem to think that because it can run single core things really well on a laptop, it can do anything. Servers regularly have 100-200 cpu cores maxing out of rapid fire threads. This is not what Apple silicon excels at.
      On top of that, it only performs so well on consumer devices because they control the hardware and OS and can tune both together. Creating server hardware would mean allowing linux to be installed on it, and would need to run equally well. Apple would never put the development time into linux kernel/drivers to make this happen.
      [-]
      - swiftcoder 4 hours ago ago
        > This is not what Apple silicon excels at
        Not at the moment, no. I feel like the Apple silicon team probably would rise to that challenge though
      - packetlost 6 hours ago ago
        I know off the top of my head at least 3 places that would happily purchase a couple of XServers (one of which probably still has one) running MacOS Server. Linux isn't as hard of a requirement as you think.
        [-]
        giancarlostoro 3 hours ago ago
        Hell... I can think of loads of places running servers on WINDOWS (namely all of my employers, including F500 companies) I am not surprised that someone would run macOS as a server. At least MacOS is Unix based ;)
      - otterley 5 hours ago ago
        > Apple silicon does not compete well in multicore spaces.
        Can you elaborate on this? Maybe with some useful metrics?
    - mrexroad a day ago ago
      They did have Xserve back in the day. As great as Apple silicon is for running local llms along with being a general-purpose computing device, it’s not clear that Apple silicon have enough of a differentiating advantage over a rack of nvidia gpus to make it worthwhile in enterprise…
      [-]
      - Miraste a day ago ago
        Strange to be saying this about Apple products, but its advantage is that it's way, way cheaper.
        [-]
        renmillar 9 hours ago ago
        Would probably be different if NVIDIA viewed it as competition for data center market share
    - brookst 8 hours ago ago
      Is expansion to all possible markets really a sign of product vision? Windows is in everything from ATMs to servers to cheap laptops, and I am not sure it’s a better product for it OR that Microsoft makes more money that way. Certainly the support burden for a huge number of long tail applications is huge.
      And I suppose we’re giving credit to other people for Watch, AirPods, Vision Pro?
    - giancarlostoro 7 hours ago ago
      It doesn't just end with AI, but it seems the most blatant. At a bare minimum, he could assign someone to fulfill that vision for AI. Google has their own chips which they scale. Apple doesn't need to rebuild ChatGPT, but they could very much do what Microsoft does with Phi and provide Apple Silicon trained and optimized base models for all their users. It seems they are already doing something for XCode and Swift, but they're just barely scratching the surface.
      I remember when the iPhone X became a thing, it was because consumers were extremely underwhelmed by Apple at the time. It's like they kicked it up less than a notch sadly.
      If Tim Cook decided to be a little more of a visionary, I would say keep him. I would at least prefer he would delegate someone to do the visionary work, he will eventually need a successor.
    - alexashka 21 hours ago ago
      [flagged]
      [-]
      - saagarjha 20 hours ago ago
        By calling everyone who buys Apple products 80 IQ, you are lowering the quality of the discourse here. Please don't do that.
        [-]
        alexashka 15 hours ago ago
        [flagged]
        [-]
        saagarjha 13 hours ago ago
        Don't do this here either.
      - xanderlewis 21 hours ago ago
        Anyone who doesn’t happen to do exactly what I do and have the same interests as me is ‘80 IQ’ — whatever that means. Got it.
      - giancarlostoro 19 hours ago ago
        Doesn't Google sell $2000 phones? I really dont get the argument here.
      - pmarreck 20 hours ago ago
        Oh look, it's a poor, green-text Google apologist who thinks phones with preinstalled crapware, an energy management model that doesn't stop any app from saturating your bandwidth, CPU or battery draw, and a security model that ensures you stand a good chance of becoming part of a crypto farm or botnet just because you downloaded an emulator from a third-party app store, means you have above an 80 IQ! LOL, way to virtue-signal your poverty, bro. These are tough times, I get it... But the first 2 Android phones I ever tried, I crashed within 5 minutes just by... get this... turning on their fucking Bluetooth. WHAT QUALITY. More like "what Chinese shovelware," amirite?
        (How does it feel? Literally turning around your inane opinion back onto you.)
        [-]
        bigyabai 19 hours ago ago
        It feels like you are particularly insecure and didn't need to spout that any more than the parent did.
        [-]
        pmarreck 7 hours ago ago
        Nope. I just want to show a douche what it looks like.
        [-]
        bigyabai 4 hours ago ago
        Oh, good. Your comment was materially indistinct from someone who took the "iPad and Vision Pro are toys" thing a bit too personally.
  - jbverschoor a day ago ago
    Local llm.. everybody is scared of privacy.. many people don’t want to buy subscriptions (still).
    Just sell a proper HomePod with 64GB-128GB ram, which handles everything including your personal LLM, Time Machine if needed, back to Mac (Tailscale/zerotier)
    + they can compete efficiently with the other. Cloud providers.
    [-]
    - brookst 8 hours ago ago
      It’s a mistake to generalize from the HN population.
      Most people don’t care about privacy (see: success of Facebook and TikTok). Most people don’t care about subscriptions (see: cable TV, Netflix).
      There may be a niche market for a local inference device that costs $1000 and has to be replaced every year or two during the early days of AI, but it’s not a market with decent ROI for Apple.
    - VagabundoP 10 hours ago ago
      Have a team pushing out opitmised open source models. Over time this thing could become the house AI. Basically Star Treks computer.
    - bigyabai a day ago ago
      > Just sell a proper HomePod with 64GB-128GB ram
      The same Homepod that almost sold as poorly as Vision Pro despite a $349.99 MSRP? Apple charges $400 to upgrade an M4 to 64GB and a whopping $1,200 for the 128GB upgrade.
      The consumer demand for a $800+ device like this is probably zilch, I can't imagine it's worth Apple's time to gussy up a nice UX or support it long-term. What you are describing is a Mac with extra steps, you could probably hack together a similar experience with Shortcuts if you had enough money and a use-case. An AI Homepod-server would only be efficient at wasting money.
      [-]
      - redundantly 17 hours ago ago
        > The same Homepod that almost sold as poorly as Vision Pro despite a $349.99 MSRP?
        The HomePod did poorly because competitor offerings with similar and better performing features were priced under $100. The difference in sound quality was not worth the >3x markup.
  - ako a day ago ago
    They have local LLMs, apple foundation models: https://developer.apple.com/documentation/FoundationModels
    [-]
    - andruby a day ago ago
      Apple often wants to do it their way. Unfortunately, their foundation models are way behind even the open models.
    - _delirium 7 hours ago ago
      There are local LLM coding models that ship with XCode now too.
  - jbs789 a day ago ago
    Sounds like you’ve got a solid handle on things - go do it!
    [-]
    - giancarlostoro 8 hours ago ago
      Give me a majority share in AAPL if that's what you want ;)
  - woooooo 10 hours ago ago
    Not to mention, build a car with all that cash they have. Xiaomi makes awesome cars, Apple branded electric could scoop all the brand equity that Elon passed away.
  - elAhmo a day ago ago
    I think shareholders are fine with Tim Cook as a CEO.
    [-]
    - utyop22 21 hours ago ago
      I sometimes read posts on here and just laugh.
      Its easy to sit in the armchair and say "just be a visionary bro" when they forget Tim worked under Steve for awhile before his death - he has some sense and understanding of what it takes to get a great product out of the door.
      Nvidia is generating a lot of revenue, sure - but what is the downstream impact on its customers with the hardware? All they have right now is negative returns to show for their spending. Could this change? Maybe. Is it likely? Not in my view.
      As it stands, Apple has made the absolute right choice in not wasting its cash and is demonstrating discipline. Which when all this LLM mania quietens, shareholders will respect.
      [-]
      - nxobject 11 hours ago ago
        Arguably, it’s why investors go in for Apple in the first place: Apple’s revenue fundamentally comes from consumer spending, whose prospects are relatively well understood by the average investor.
        (I think it’s why big shareholders don’t get angry that Apple doesn’t splash their cash around: their core value proposition is focused in a dizzying tech market; take it or leave it. It’s very Warren Buffett.)
      - moduspol 20 hours ago ago
        This. I wouldn’t exactly give them bonus points for the handling of Apple Intelligence, but beyond that, they’ve taken a much more measured and evidence-based approach to LLMs than the rest of big tech.
        If it ends up that we are in a bubble and it pops, Apple may be among the least impacted in big tech.
        [-]
        ChrisMarshallNY 8 hours ago ago
        Friend of mine, used to work for Apple.
        He told me that a popular Apple saying is "We're late to the party, but always best-dressed."
        I understand this. I'm not sure their choice of outfit has always been the best, but they have had enough success to continue making money.
        billbrown 5 hours ago ago
        Toyota did this with the EV mania until they lost their nerve and got rid of Toyoda as CEO. I hope Apple doesn't fall into the same trap. (I never thought Toyota would give in either.)
    - spease 16 hours ago ago
      Yes. And everyone is glossing over the benefit of unified memory for LLM applications. Apple may not have the models, but it has customer goodwill, a platform, and the logistical infrastructure to roll them out. It probably even has the cash to buy some AI companies outright; maybe not the big ones (for a reasonable amount, anyway) but small to midsize ones with domain-specific models that could be combined.
      Not to mention the “default browser” leverage it has with with iPhones, iPods, and watches.
  - brookst 8 hours ago ago
    Under Cook, Apple’s market cap has increased 10x, at a CAGR of 18%.
    Do you really think that they need something different? As a shareholder would you bet on your vision of focusing on server parts?
  - saagarjha 20 hours ago ago
    One does not simply put a 5090 into an existing chip.
    [-]
    - giancarlostoro 8 hours ago ago
      Not what I am suggesting. However, having trained a few different things on a modest M4 Pro chip (so not even their absolute most powerful chips mind you), and using it for local-first AI inference, I can see the value. A single server could serve an LLM for a small business and cost a lot less than running the same inference through a 5090 in terms of power usage.
      I could also see universities giving this type of compute access to students for cheaper to work on more basic less resource intensive models.
      [-]
      - saagarjha 2 hours ago ago
        I think a 5090 will handily beat it on power usage.
  - __loam 7 hours ago ago
    I'm glad Tim is the CEO instead of you.
    [-]
    - jasonvorhe 2 hours ago ago
      Why? This is something that plays into all of Apple's supposed strengths: Privacy/no strict cloud dependency/on-device compute, hardware/software optimization while owning the stack and combine that with good UI/UX for a broad target audience without sacrificing too much for the power users. OP never said that local AI would be the only topic a new CEO should focus on.
  - bigyabai a day ago ago
    Software-wise, it makes sense: Nvidia has the IP lead, industry buy-in and supports the OSes everyone wants to use.
    Hardware-wise though, I actually agree - Apple has dropped the ball so hard here that it's dumbfounding. They're the only TSMC customer that could realistically ship a comparable volume of chips as Nvidia, even without really impacting their smartphone business. They have hardware designers who can design GPUs from scratch, write proprietary graphics APIs and fine-tune for power efficiency. The only organizational roadblock that I can see is the executive vision, which has been pretty wishy-washy on AI for a while now. Apple wants to build a CoreML silo in a world where better products exist everywhere, it's a dead-end approach that should have died back in 2018.
    Contextually it's weird too, I've seen tons of people defend Cook's relationship with Trump as "his duty to shareholders" and the like. But whenever you mention crypto mining or AI datacenter markets, people act like Apple is above selling products that people want. Future MBAs will be taught about this hubris once the shape of the total damages come into view.
    [-]
    - nxobject 11 hours ago ago
      > They have hardware designers who can design GPUs from scratch, write proprietary graphics APIs and fine-tune for power efficiency. The only organizational roadblock that I can see is the executive vision, which has been pretty wishy-washy on AI since for a while now.
      The vision since Jobs has always been “build a great consumer product and own as much as you can while doing so”. That’s exactly how all of the design parameters of Ax/Mx series were determined and relentlessly optimized for - the fact that they have a highly competitive uarch was a salutary side-effect, but not a planned goal.
    - jen20 a day ago ago
      > But whenever you mention crypto mining or AI datacenter markets, people act like Apple is above selling products that people want.
      People also want comfortable mattresses and high quality coffee machines. Should Apple make them too?
      Apple not being in a particular industry is a perfectly valid choice, which is not remotely comparable to protecting their interests in the industries they are currently in. Selling datacenter-bound products is something Apple is not _remotely_ equipped for, and staffing up to do so at reasonable scale would not be a trivial task.
      As for crypto mining... JFC.
      [-]
      - bigyabai a day ago ago
        Apple is perfectly well equipped to sell datacenter products. They've done it in the past, even supporting Nvidia's compute drivers along the way. If they have the staff to design consumer-facing and developer-facing experiences, why wouldn't they address the datacenter?
        Money is money. 10 years ago people would have laughed at the notion of Nvidia abandoning the gaming market, now it's their most lucrative option. Apple can and should be looking at other avenues of profit while the App Store comes under scrutiny and the Mac market share refuses to budge. It should be especially urgent if unit margins are going down as suppliers leave China.
        [-]
        saagarjha 20 hours ago ago
        Apple makes more profit on iPhones than Nvidia does on its entire datacenter business. Why would they want to enter a highly competitive market that they have no expertise in on a whim?
        jen20 a day ago ago
        > They've done it in the past, even supporting Nvidia's compute drivers along the way. If they have the staff to design consumer-facing and developer-facing experiences, why wouldn't they address the datacenter?
        They did a horrific job of it before. The staff to design consumer facing experiences are busy doing exactly that. The developer facing experiences are very lean. The bandwidth simply isn't there to do DC products. Nor is the supply chain. Nor is the service supply chain. Etc, etc.
- zozbot234 18 hours ago ago
  From reverse engineered information (in the context of Asahi Linux, which can have raw hardware access to the ANE) it seems that the M1/M2 Apple Neural Engine provides exclusively for statically scheduled MADD's of INT8 or FP16 values.[0] This wastes a lot of memory bandwidth on padding in the context of newer local models which generally are more heavily quantized.
  (That is, when in-memory model values must be padded to FP16/INT8 this slashes your effective use of memory bandwidth, which is what determines token generation speed. GPU compute doesn't have that issue; one can simply de-quantize/pad the input in fast local registers to feed the matrix compute units, so memory bandwidth is used efficiently.)
  The NPU/ANE is still potentially useful for lowering power use in the context of prompt pre-processing, which is limited by raw compute as opposed to the memory bandwidth bound of token generation. (Lower power usage in this context will save on battery and may help performance by avoiding power/thermal throttling, especially on passively-cooled laptops. So this is definitely worth going for.)
  [0] Some historical information about bare-metal use of the ANE is available from the Whisper.cpp pull req: https://github.com/ggml-org/whisper.cpp/pull/1021 Even older information at: https://github.com/eiln/ane/tree/33a61249d773f8f50c02ab0b9fe... .
  More extensive information at https://github.com/tinygrad/tinygrad/tree/master/extra/accel... (from the Tinygrad folks) seems to basically confirm the above.
  (The jury is still out for M3/M4 which currently have no Asahi support - thus, no current prospects for driving the ANE bare-metal. Note however that the M3/Pro/Max ANE reported performance numbers are quite close to the M2 version, so there may not be a real improvement there either. M3 Ultra and especially the M4 series may be a different story.)
- GeekyBear a day ago ago
  > Hopefully Apple optimizes Core ML to map transformer workloads to the ANE.
  If you want to convert models to run on the ANE there are tools provided:
  > Convert models from TensorFlow, PyTorch, and other libraries to Core ML.
  https://apple.github.io/coremltools/docs-guides/index.html
  [-]
  - ls-a a day ago ago
    I thought Apple MLX can do that if you convert your model using it https://mlx-framework.org/
    [-]
    - woadwarrior01 a day ago ago
      MLX does not support the ANE.
      https://github.com/ml-explore/mlx/issues/18
      [-]
      - elpakal 21 hours ago ago
        Yes it does.
        That’s just an issue with stale and incorrect information. Here are the docs https://opensource.apple.com/projects/mlx/
        [-]
        woadwarrior01 13 hours ago ago
        No, it categorically doesn't. Not just that, it's CPU support is quite lacking (fp32 only). Currently, there are two ways to support the ANE: CoreML and MPSGraph.
        y1n0 20 hours ago ago
        Nothing in that documentation says anything about the Apple Neural Engine. MLX runs on the GPU.
        jychang 16 hours ago ago
        None of that uses the ANE.
    - GeekyBear a day ago ago
      It does indeed, and is more modern than Core ML.
  - coffeecoders a day ago ago
    It is less about conversion and more about extending ANE support for transformer-style models or giving developers more control.
    The issue is in targeting specific hardware blocks. When you convert with coremltools, Core ML takes over and doesn't provide fine-grained control - run on GPU, CPU or ANE. Also, ANE isn't really designed with transformers in mind, so most LLM inference defaults to GPU.
    [-]
    - aurareturn a day ago ago
      Neural Engine is optimized for power efficiency, not performance.
      Look for Apple to add matmul acceleration into the GPU instead. Thats how to truly speed up local LLMs.
- slacka a day ago ago
  I too found that interesting that Apple's Neural Engine doesn't work with local LLMs. Seems like Apple, AMD, and Intel are missing the AI boat by not properly supporting their NPUs in llama.cpp. Any thoughts on why this is?
  [-]
  - numpad0 a day ago ago
    Perhaps due to sizes? AI/NN models before LLM were magnitudes smaller, as evident in effectively all LLMs carrying "Large" in its name regardless of relative size differences.
  - Someone a day ago ago
    I guess that hardware doesn’t make things faster (¿yet?). If so I guess they would have mentioned it in https://machinelearning.apple.com/research/core-ml-on-device.... That is updated for Sequoia and says
    “This technical post details how to optimize and deploy an LLM to Apple silicon, achieving the performance required for real time use cases. In this example we use Llama-3.1-8B-Instruct, a popular mid-size LLM, and we show how using Apple’s Core ML framework and the optimizations described here, this model can be run locally on a Mac with M1 Max with about ~33 tokens/s decoding speed. While this post focuses on a particular Llama model, the principles outlined here apply generally to other transformer-based LLMs of different sizes.”
    [-]
    - cma 4 hours ago ago
      If it uses a lot less power it could still be a win for some use cases, like while on battery you might still want to run transformer based speech to text, RTX voice-like microphone denoising, image generation/infill in photo editing programs. In some use cases like RTX-voice like stuff during multiplayer gaming, you might want the GPU free to run the game even if it still suffers some memory bandwidth impact from having it running.
  - bigyabai a day ago ago
    Most NPUs are almost universally too weak to use for serious LLM inference. Most of the time you get better performance-per-watt out of GPU compute shaders, the majority of NPUs are dark silicon.
    Keep in mind - Nvidia has no NPU hardware because that functionality is baked-into their GPU architecture. AMD, Apple and Intel are all in this awkward NPU boat because they wanted to avoid competition with Nvidia and continue shipping simple raster designs.
    [-]
    - aurareturn a day ago ago
      Apple is in this NPU boat because they are optimized for mobile first.
      Nvidia does not optimize for mobile first.
      AMD and Intel were forced by Microsoft to add NPUs in order to sell “AI PCs”. Turns out the kind of AI that people want to run locally can’t run on an NPU. It’s too weak like you said.
      AMD and Intel both have matmul acceleration directly in their GPUs. Only Apple does not.
      [-]
      - bigyabai a day ago ago
        Nvidia's approach works just fine on mobile. Devices like the Switch have complex GPGPU pipelines and don't compromise whatsoever on power efficiency.
        Nonetheless, Apple's architecture on mobile doesn't have to define how they approach laptops, destops and datacenters. If the mobile-first approach is limiting their addressable market, then maybe Tim's obsessing over the wrong audience?
        [-]
        aurareturn 15 hours ago ago
        MacBooks benefit from mobile optimization. Apple just needs to add matmul hardware acceleration into their GPUs.
  - GeekyBear a day ago ago
    There is no NPU "standard".
    Llama.cpp would have to target every hardware vendor's NPU individually and those NPUs tend to have breaking changes when newer generations of hardware are released.
    Even Nvidia GPUs often have breaking changes moving from one generation to the next.
    [-]
    - montebicyclelo a day ago ago
      I think OP is suggesting that Apple / AMD / Intel do the work of integrating their NPUs into popular libraries like `llama.cpp`. Which might make sense. My impression is that by the time the vendors support a certain model with their NPUs the model is too old and nobody cares anyway. Whereas llama.cpp keeps up with the latest and greatest.
  - svachalek a day ago ago
    I think I saw something that got Ollama to run models on it? But it only works with tiny models. Seems like the neural engine is extremely power efficient but not fast enough to do LLMs with billions of parameters.
    [-]
    - reddit_clone 4 hours ago ago
      I am running Ollama with 'SimonPu/Qwen3-Coder:30B-Instruct_Q4_K_XL' on a M4 pro MBP with 48 GB of memory.
      From Emacs/gptel, it seems pretty fast.
      I have never used the proper hosted LLMS, so I don't have a direct comparison. But the above LLM answered coding questions in a handful of seconds.
      The cost of memory (and disk) upgrades in apple machines is exorbitant.
  - jondwillis a day ago ago
    https://github.com/Anemll/Anemll
  - a day ago ago
    [deleted]
- witnessme 6 hours ago ago
  Don't try 12-20B on 16GB. You should stick with 4-8B instead. You'll get way too slow tps and marginal perf improvements going higher on a 16GB machine.
- ai-christianson a day ago ago
  I can run GLM 4.5 Air and gpt-oss-120b both very reasonably. GPT OSS has particularly good latency.
  I'm on a 128GB M4 macbook. This is "powerful" today, but it will be old news in a few years.
  These models are just about getting as good as the frontier models.
- daemonologist 18 hours ago ago
  ONNX Runtime purports to support CoreML: https://onnxruntime.ai/docs/execution-providers/CoreML-Execu... , which gives a decent amount of compatibility for inference. I have no idea to what extent workloads actually end up on the ANE though.
  (Unfortunately ONNX doesn't support Vulkan, which limits it on other platforms. It's always something...)
- ru552 a day ago ago
  You're better served using Apple's MLX if you want to run models locally.
- zackmorris a day ago ago
  Don't get me started. Many new computers come with an NPU of some kind, which is superfluous to a GPU.
  But what's really going on is that we never got the highly multicore and distributed computers that could have started going mainstream in the 1980s, and certainly by the late 1990s when high-speed internet hit. So single-threaded performance is about the same now as 20 years ago. Meanwhile video cards have gotten exponentially more powerful and affordable, but without the virtual memory and virtualization capabilities of CPUs, so we're seeing ridiculous artificial limitations like not being able to run certain LLMs because the hardware "isn't powerful enough", rather than just having a slower experience or borrowing the PC in the next room for more computing power.
  To go to the incredible lengths that Apple went to in designing the M1, not just wrt hardware but in adding yet another layer of software emulation since the 68000 days, without actually bringing multicore with local memories to the level that today's VLSI design rules could allow, is laughable for me. If it wasn't so tragic.
  It's hard for me to live and work in a tech status quo so far removed from what I had envisioned growing up. We're practically at AGI, but also mired in ensh@ttification. Reflected in politics too. We'll have the first trillionaire before we solve world hunger, and I'm bracing for Skynet/Ultron before we have C3P0/JARVIS.
- jondwillis a day ago ago
  https://github.com/Anemll/Anemll
- wslh a day ago ago
  I find surprising that you can also do that from the browser (e.g. WebLLM). I imagine that in the near future we will run these engines locally for many use cases, instead of via APIs.
- wer232essf a day ago ago
  [flagged]
  [-]
  - o11c a day ago ago
    It's useless to mention number of parameters without also mentioning quantization, and to a lesser-but-still-significant extent context size, which determine how much RAM is needed.
    "It will run" is a different thing than "it will run without swapping or otherwise hitting a slow storage access path". That makes a speed difference of multiple orders of magnitude.
    This is one thing Ollama is good for. Possibly the only thing, if you listen to some of its competitors. But the choice of runner does nothing to avoid the fact that all LLMs are just toys.
punitvthakkar a day ago ago
So far I've not run into the kind of use cases that local LLMs can convincingly provide without making me feel like I'm using the first ever ChatGPT from 2022, in that they are limited and quite limiting. I am curious about what use cases the community has found that work for them. The example that one user has given in this thread about their local LLM inventing a Sun Tzu interview is exactly the kind of limitation I'm talking about. How does one use a local LLM to do something actually useful?
[-]
- narrator a day ago ago
  I have tried a lot of different LLMs and Gemma3:27b on a 48gb+ Macbook is probably the best for analyzing diaries and personal stuff you don't want to share with the cloud. The China models are comically bad with life advice. For example, I asked Deepseek to read my diaries and talk to me about my life goals and it told me in a very Confucian manner what the proper relationships in my life were for my stage of life and station in society. Gemma is much more western.
  [-]
  - solardev a day ago ago
    Lol, that's actually kinda cool. Did you get any interesting Eastern responses to your diary entries?
    I'm imagining something like...
    > Dear diary, I got bullied again today, and the bread was stale in my PB&J :(
    >> My son, remember this: The one who mocks others wounds his own virtue. The one who suffers mockery must guard his heart. To endure without hatred is strength; to strike without cause is disgrace. The noble one corrects himself first, then the world will follow.
  - punitvthakkar 11 hours ago ago
    That is fascinating. One insight I read about LLMs is that they do represent a world-view of the people who train it, and hence the country that ships the dominant LLM technology can spread widely its world-view on others. Your experience seems to validate that insight.
  - elorant a day ago ago
    Chinese models are also awful with translations. Even the Deepseek R1 model performs worse than Mistral small.
- crazygringo a day ago ago
  I see local LLM's being used mainly for automation as opposed to factual knowledge -- for classification, summarization, search, and things like grammar checking.
  So they need to be smart about your desired language(s) and all the everyday concepts we use in it (so they can understand the content of documents and messages), but they don't need any of the detailed factual knowledge around human history, programming languages and libraries, health, and everything else.
  The idea is that you don't prompt the LLM directly, but your OS tools make use of it, and applications prompt it as frequently as they fetch URL's.
  [-]
  - theshrike79 11 hours ago ago
    And local models are static, predictable and don't just go away when a new one comes out.
    This makes them perfect for automation tasks.
- vorticalbox 15 hours ago ago
  I keep a lot of notes, all my thoughts feelings both happy and sad, things I’ve done, etc. in obsidian. These are deeply personal and I don’t want this going to a cloud provider even if they “say” they don’t train on my chats.
  I forget a lot of things so I feed these into chromeDB and then use a LLM to chat with all my notes.
  I’ve started using abliterated models which have their refusal removed [0]
  Other use case is for work. I work with financial data and I have created an mcp that automates some of my job. Running model locally allows me to not worry about the information I feed it.
  [0] https://github.com/Sumandora/remove-refusals-with-transforme...
- dxetech a day ago ago
  There are situations where internet access is limited, or where there are frequent outages. An outdated LLM might be more useful than none at all. For example: my internet is out due to a severe storm, what safety precautions do I need to take?
  [-]
  - theshrike79 10 hours ago ago
    Just use Kiwix: https://kiwix.org/en/
  - volemo 10 hours ago ago
    Surely not the ones you get from an LLM?
  - punitvthakkar 11 hours ago ago
    Yes - emergency use cases make tons of sense.
- jondwillis a day ago ago
  I use, or at least try to use local models while prototyping/developing apps.
  First, they control costs during development, which depending on what you're doing, can get quite expensive for low or no budget projects.
  Second, they force me to have more constraints and more carefully compose things. If a local model (albeit something somewhat capable like gpt-oss or qwen3) can start to piece together this agentic workflow I am trying to model, chances are, it'll start working quite well and quite quickly if I switch to even a budget cloud model (something like gpt-5-mini.)
  However, dealing with these constraints might not be worth the time if you can stuff all of the documents in your context window for the cloud models and get good results, but it will probably be cheaper and faster on an ongoing basis to have split the task up.
- dragonwriter a day ago ago
  Well, a lot of what is possible with local models depends on what your local hardware is, but docling is a pretty good example of a library that can use local models (VLMs instead of regular LLMs) “under the hood” for productive tasks.
- ActorNightly 4 hours ago ago
  Smaller models require a lot more direction, a.k.a system prompt engineering, and sometimes custom wrappers . For example Gemma models are very eager to generate code even if you tell them not to.
- rukuu001 a day ago ago
  I'm running Gemma3-270M locally (MLX). I got a Python script that pulls down emails based on a whitelist and summarises them. The 270M model does a good job of this. This is running in a terminal. It means I barely look at my email during the day.
  [-]
  - ghilston 20 hours ago ago
    Any willingness to share this script? I've been working on some code to ingest things and summarize for them and I haven't gotten to email just yet.
    [-]
    - rukuu001 17 hours ago ago
      Watch this space. It’s pretty scrappy code and needs a cleanup. It also does other random stuff relating to calendar entries that I want to be reminded to appropriately prepare for.
      But yes I’ll share, and I guess post an update in this thread?
      [-]
      - renmillar 29 minutes ago ago
        I’d suggest the simple approach: run that script through Claude and have it extract just the email processing parts to create a clean CLI tool. This seems like exactly the type of refactoring task that LLMs are really good at.
      - ghilston 9 hours ago ago
        Okay will do. Yeah, I just finished my calendar reading code. I haven't prepared the data for my LLM ingestion yet. Sounds great, I'll refresh this thread in a few days
- jeffybefffy519 a day ago ago
  Gemma3 is pretty useful on a long haul flight without internet
  [-]
  - kristopolous 16 hours ago ago
    kimi v2 by moonshot. Give it a go
    [-]
    - mhuffman 11 hours ago ago
      What Mac are you using that can run kimi k2?
- bityard a day ago ago
  I use a local LLM for lots of little things that I used to use search engines for. Defining words, looking up unicode symbols for copy/paste, reminders on how to do X in bash or Python. Sometimes I use it as a starting point for high-level questions and curiosity and then move to actual human content or larger online models for more details and/or fact-checking if needed.
  If your computer is somewhat modern and has a decent amount of RAM to spare, it can probably run one of the smaller-but-still-useful models just fine, even without a GPU.
  My reasons:
  1) Search engines are actively incentivized to not show useful results. SEO-optimized clickbait articles contain long fluffy, contentless prose intermixed with ads. The longer they can keep you "searching" for the information instead of "finding" it, the better is for their bottom line. Because if you actually manage to find the information you're looking for, you close the tab and stop looking at ads. If you don't find what you need, you keep scrolling and generate more ad revenue for the advertisers and search engines. It's exactly the same reasons online dating sites are futile for most people: every successful match made results in two lost customers which is bad for revenue.
  LLMs (even local ones in some cases) are quite good at giving you direct answers to direct questions which is 90% of my use for search engines to begin with. Yes, sometimes they hallucinate. No, it's not usually a big deal if you apply some common sense.
  2) Most datacenter-hosted LLMs don't have ads built into them now, but they will. As soon as we get used to "trusting" hosted models due to how good they have become, the model developers and operators will figure out how to turn the model into a sneaky salesman. You'll ask it for the specs on a certain model of Dell laptop and it will pretend it didn't hear you and reply, "You should try HP's latest line of up business-class notebooks, they're fast, affordable, and come in 5 fabulous colors to suit your unique personal style!" I want to make sure I'm emphasizing that it's not IF this happens, it's WHEN.
  Local LLMs COULD have advertising at some point, but it will probably be rare and/or weird as these smaller models are meant mainly for development and further experimentation. I have faith that some open-weight models will always exist in some form, even if they never rival commercially-hosted models in overall quality.
  3) I've made peace with the fact that data privacy in the age of Big Tech is a myth, but that doesn't mean I can't minimize my exposure by keeping some of my random musings and queries to myself. Self-hosted AI models will never be as "good" as the ones hosted in datacenters, but they are still plenty useful.
  4) I'm still in the early stages of this, but I can develop my own tools around small local models without paying a hosted model provider and/or becoming their product.
  5) I was a huge skeptic about the overall value of AI during all of the initial hype. Then I realized that this stuff isn't some fad that will disappear tomorrow. It will get better. The experience will get more refined. It will get more accurate. It will consume less energy. It will be totally ubiquitous. If you fail to come to speed on some important new technology or trend, you will be left in the dust by those who do. I understand the skepticism and pushback, but the future moves forward regardless.
  [-]
  - punitvthakkar 11 hours ago ago
    All totally valid points and insights. This is great, thank you!
- luckydata a day ago ago
  Local models can do embedding very well, which is useful for things like building a screenshot manager for example.
  [-]
  - punitvthakkar 11 hours ago ago
    Whoa. I didn't think about using embeddings for screenshot management. How would I do this?
- mentalgear a day ago ago
  something like rewind or openRecall can use local LLMs for on-device semantic search.
- ivape a day ago ago
  I use Claude code in the terminal only mostly to figure out what to commit along with what to write for the commit message. I believe a solid 7-8b model can do this locally.
  So, that’s at least one small highly useful workflow robot I have a use for (and very easy to cook up on your own).
  I also have a use for terminal command autocompletion, which again, a small model can be great for.
  Something felt kind really wrong about sending entire folder contents over to Claude online, so I am absolutely looking to create the toolkit locally.
  The universe off offline is just getting started, and these big companies literally are telling you “watch out, we save this stuff”.
- bigyabai a day ago ago
  Qwen3 A3B (in my experience) writes code as-good-as ChatGPT 4o and much better than GPT-OSS.
  [-]
  - hu3 a day ago ago
    I just tested Qwen3 A3B vs ChatGPE a random prompt from my head and:
    > Please write a C# middleware to block requests from browser agents that contain any word in a specified list of words: openai, grok, gemini, claude.
    I used ChatpGPT 4o from GitHub Copilot inside VSCode. And Qwen3 A3B from here: https://deepinfra.com/Qwen/Qwen3-30B-A3B
    ChatGPT 4o was considerably better. Less verbose and less unnecessary abstractions.
    [-]
    - bigyabai a day ago ago
      You want the 2507 update of the model, I think the one you used is ~8-10 months out-of-date.
      [-]
      - jasonjmcghee a day ago ago
        No they want Qwen3-Coder-30B-A3B-Instruct
- segmondy a day ago ago
  The same way you use a cloud LLM.
  [-]
  - oblio a day ago ago
    I think the point was that for example for programming, people perceive state of the art LLMs as being net positive contributors, at least for mainstream programming languages and tasks, and I guess local LLMs aren't net positive contributors (i.e. an experienced programmer can build the same thing at least as fast when using an LLM).
    [-]
    - segmondy a day ago ago
      I know this is false, DeepSeekv3.1, GLM4.5, KimiK2-0905, Qwen-235B are all solid open models. Last night, I vibed rough 1300 lines of C server code in about an hour. 0 compilation error, ran without errors and got the job done. I want to meet this experienced programmer that can knock out 1300 lines of C code in an hour.
      [-]
      - drusepth a day ago ago
        Are 235B models classified as local LLMs? I guess they probably are, but others in this thread are probably looking more toward 20B-30B models and sizes that generally fit on the RAM you'd expect in average or slightly-higher-end hardware.
        My beefy 3D gamedev workstation with a 4090 and 128GB RAM can't even run a 235B model unless it's extremely quantized (and even then, only at like single-digit tokens/minute).
      - codazoda a day ago ago
        How much machine do you have to be able to run Qwen-235B locally?
      - nomel a day ago ago
        Without knowing what you were doing with that 1300 lines of code, there's not much insight that can be had from this.
      - brookst 8 hours ago ago
        I’m a mediocre C programmer on my best day and I assure you a highly competent programmer could probably use 200 lines of code to do what I achieve in 1300.
        Just counting lines is not a good proxy for how much effort it would take a good programmer.
        (And I am 100% pro LLM coding, just saying this isn’t a great argument)
      - oblio 15 hours ago ago
        Can you run any of those models without $20 000 worth of hardware that uses as much power and makes as much noise as a small factory?
daoboy a day ago ago
I'm running Hermes Mistral and the very first thing it did was start hallucinating.
I recently started an audio dream journal and want to keep it private. Set up whisper to transcribe the .wav file and dump it in an Obsidian folder.
The plan was to put a local llm step in to clean up the punctuation and paragraphs. I entered instructions to clean the transcript without changing or adding anything else.
Hermes responded by inventing an intereview with Sun Tzu about why he wrote the Art of War. When I stopped the process it apologized and advised it misunderstood when I talked about Sun Tzu. I never mentioned Sun Tzu or even provided a transcript. Just instructions.
We went around with this for a while before I could even get it to admit the mistake, and it refused to identify why it occurred in the first place.
Having to meticulously check for weird hallucinations will be far more time consuming than just doing the editing myself. This same logic applies to a lot of the areas I'd like to have a local llm for. Hopefully they'll get there soon.
[-]
- simonh a day ago ago
  It’s often been assumed that accuracy and ‘correctness’ would be easy to implement on computers because they operate on logic, in some sense. It’s originality and creativity that would be hard, or impossible because it’s not logical. Science Fiction has been full of such assumptions. Yet here we are, the actual problem is inventing new heavy enough training sticks to beat our AIs out of constantly making stuff up and lying about it.
  I suppose we shouldn’t be surprised in hindsight. We trained them on human communicative behaviour after all. Maybe using Reddit as a source wasn’t the smartest move. Reddit in, Reddit out.
  [-]
  - smallmancontrov a day ago ago
    Pre-training gets you GPT-3, not InstructGPT/ChatGPT. During fine-tuning OpenAI (and everyone else) specifically chose to "beat in" a heavy bias-to-action because a model that just answers everything with "it depends" and "needs more info" is even more useless than a model that turns every prompt into a creative writing exercise. Striking a balance is simply a hard problem -- and one that many humans have not mastered for themselves.
  - root_axis a day ago ago
    > It’s often been assumed that accuracy and ‘correctness’ would be easy to implement on computers because they operate on logic, in some sense. It’s originality and creativity that would be hard
    More fundamental than the training data is the fact that the generative outputs are statistical, not logical. This is why they can produce a sequence of logical steps but still come to incorrect or contradictory conclusions. This is also why they tackle creativity more easily since the acceptable boundaries of creative output is less rigid. A photorealistic video of someone sawing a cloud in half can still be entertaining art despite the logical inconsistencies in the idea.
  - HankStallone a day ago ago
    The worst news I've seen about AI was a study that said the major ones get 40% of their references from Reddit (I don't know how they determined that). That explains the cloying way it tries to be friendly and supportive, too.
    [-]
    - sandbags a day ago ago
      I saw someone reference this today and the question I had was whether this counted the trillions of words accrued from books and other sources. i.e. is it 40%? Or 40% of what they can find a direct attribution link for?
  - dragonwriter a day ago ago
    > It’s often been assumed that accuracy and ‘correctness’ would be easy to implement on computers because they operate on logic, in some sense. It’s originality and creativity that would be hard, or impossible because it’s not logical.
    It is easy, comparatively. Accuracy and correctness is what computers have been doing for decades, except when people have deliberately compromised that for performance or other priorities (or used underlying tools where someone else had done that, perhaps unwittingly.)
    > Yet here we are, the actual problem is inventing new heavy enough training sticks to beat our AIs out of constantly making stuff up and lying about it.
    LLMs and related AI technologies are very much an instance of extreme deliberate compromise of accuracy, correctness, and controllability to get some useful performance in areas where we have no idea how to analytically model the expected behavior but have lots of more or less accurate examples.
- a day ago ago
  [deleted]
JumpCrisscross a day ago ago
I don't think we're anywhere close to running cutting-edge LLMs on our phones or laptops.
What may be around the corner is running great models on a box at home. The AI lives at home. Your thin client talks to it, maybe runs a smaller AI on device to balance latency and quality. (This would be a natural extension for Apple to go into with its Mac Pro line. $10 to 20k for a home LLM device isn't ridiculous.)
[-]
- simonw a day ago ago
  Right now you can run some of the best available open weight models on a 512GB Mac Studio, which retails for around $10,000. Here's Qwen3-Coder-480B-A35B-Instruct running at 24 tokens/second at 4bit: https://twitter.com/awnihannun/status/1947771502058672219 and Deep Seek V3 0324 in 4-bit at 20 toks/sec https://twitter.com/awnihannun/status/1904177084609827054
  You can also string two 512GB Mac Studios together using MLX to load even larger models - here's 671B 8-bit DeepSeek R1 doing that: https://twitter.com/alexocheema/status/1899735281781411907
  [-]
  - zargon a day ago ago
    What these tweets about Apple silicon never show you: waiting 20+ minutes for it to ingest 32k context tokens. (Probably a lot longer for these big models.)
    [-]
    - logicprog 19 hours ago ago
      Yeah, I bought a used Mac Studio (an M1, to be fair, but still a Max and things haven't changed since) hoping to be able to run a decent LLM on it, and was sorely disappointed thanks to the prompt processing speed especially.
      [-]
      - alt227 8 hours ago ago
        No offense to you personally, but I find it very funny when people hear marketing copy for a product and think it can do anything they said it can.
        Apple silicon is still just a single consumer grade chip. It might be able to run certain end user software well, but it cannot replace a server rack of GPUs.
        [-]
        zargon 3 hours ago ago
        I don’t think this is a fair take in this particular situation. My comment is in response to Simon Willison, who has a very popular blog in the LLM space. This isn’t company marketing copy; it’s trusted third parties spreading this misleading information.
- brokencode a day ago ago
  Not sure about the Mac Pro, since you pay a lot for the big fancy case. The Studio seems more sensible.
  And of course Nvidia and AMD are coming out with options for massive amounts of high bandwidth GPU memory in desktop form factors.
  I like the idea of having basically a local LLM server that your laptop or other devices can connect to. Then your laptop doesn’t have to burn its battery on LLM work and it’s still local.
  [-]
  - theshrike79 10 hours ago ago
    It's really easy to whip up a simple box that runs local LLM for a whole home.
    Marketing it though? Not doable.
    Apple is pretty much the only company I see attempting this with some kind of AppleTV Pro.
  - JumpCrisscross a day ago ago
    > Not sure about the Mac Pro, since you pay a lot for the big fancy case. The Studio seems more sensible
    Oh wow, a maxed out Studio could run a 600B parameter model entirely in memory. Not bad for $12k.
    There may be a business in creating the software that links that box to an app on your phone.
    [-]
    - simonw a day ago ago
      I have been using a Tailscale VPN to make LM Studio and Ollama running on my Mac available to my iPhone when I leave the house.
    - brokencode a day ago ago
      Perhaps said software could even form an end to end encrypted tunnel from your phone to your local LLM server anywhere over the internet via a simple server intermediary.
      The amount of data transferred is tiny and the latency costs are typically going to be dominated by the LLM inference anyway. Not much advantage to doing LAN only except that you don’t need a server.
      Though the amount of people who care enough to buy a $3k - $10k server and set this up compared to just using ChatGPT is probably very small.
      [-]
      - JumpCrisscross a day ago ago
        > amount of people who care enough to buy a $3k - $10k server and set this up compared to just using ChatGPT is probably very small
        So I maxed that out, and it’s with Apple’s margins. I suspect you could do it for $5k.
        I’d also note that for heavy users of ChatGPT, the difference in energy costs for a home setup and the price for ChatGPT tokens may make this financially compelling for heavy users.
        [-]
        brokencode a day ago ago
        True, it may be profitable for pro users. At $200 a month for ChatGPT Pro, it may only take a few years to recoup the initial costs. Not sure about energy costs though.
        And of course you’d be getting a worse model, since no open source model currently is as good as the best proprietary ones.
        Though that gap should narrow as the open models improve and the proprietary ones seemingly plateau.
    - dghlsakjg a day ago ago
      That software is an HTTP request, no?
      Any number of AI apps allow you to specify a custom endpoint. As long as your AI server accepts connections to the internet, you're gravy.
      [-]
      - JumpCrisscross a day ago ago
        > That software is an HTTP request, no?
        You and I could write it. Most folks couldn’t. If AI plateaus, this would be a good hill to have occupied.
        [-]
        dghlsakjg a day ago ago
        My point is, what is there to build?
        The person that is willing to buy that appliance is likely heavily overlapped with the person that is more than capable of pointing one of the dozens of existing apps at a custom domain.
        Everyone else will continue to just use app based subscriptions.
        Streaming platforms have plateaued (at best), but self hosted media appliances are still vanishingly rare.
        Why would AI buck the trend that every other computing service has followed?
        [-]
        itsn0tm3 18 hours ago ago
        You don’t tell your media player company secrets ;)
        I think there is a market here, solely based on actual data privacy. Not sure how big it is but I can see quite some companies have use for it.
        [-]
        dghlsakjg 18 hours ago ago
        > You don’t tell your media player company secrets ;)
        No, but my email provider has a de-facto repository of incredibly sensitive documents. When you put convenience and cost up against privacy, the market has proven over and over that no one gives a shit.
        JumpCrisscross a day ago ago
        > what is there to build?
        Integrated solution. You buy the box. You download the app. It works like the ChatGPT app, except it's tunneling to the box you have at home which has been preconfigured to work with the app. Maybe you have a subscription to keep everything up to date. Maybe you have an open-source model 'store'.
- data-ottawa a day ago ago
  This is what I’m doing with my amd 395+.
  I’m running docker containers with different apps and it works well enough for a lot of my use cases.
  I mostly use Qwen Code and GPT OSS 120b right now.
  When the next generation of this tech comes through I will probably upgrade despite the price, the value is worth it to me.
  [-]
  - milgrum 21 hours ago ago
    How many TPS do you get running GPT OSS 120b on the 395+? Considering a Framework desktop for a similar use case, but I’ve been reading mixed things about performance (specifically with regards to memory bandwidth, but I’m not sure if that’s really the underlying issue)
    [-]
    - data-ottawa 7 hours ago ago
      30-40 at 64k context, but it's a mixture of experts model.
      A 70b dense model is slower
      Qwen coder 30b Q4 runs 40+.
- ben_w a day ago ago
  > $10 to 20k for a home LLM device isn't ridiculous.
  That price is ridiculous for most people. Silicon Valley payscales can afford that much, but see how few Apple Vision Pros got sold for far less.
- vonneumannstan a day ago ago
  Doesnt gpt-oss-120b perform better across the board at a fraction of the memory? Just specced a $4k mac studio that can easily run that at 128 gb memory.
- bigyabai a day ago ago
  > $10 to 20k for a home LLM device isn't ridiculous.
  At that point you are almost paying more than the datacenter does for inference hardware.
  [-]
  - JumpCrisscross a day ago ago
    > At that point you are almost paying more than the datacenter does for inference hardware
    Of course. You and I don't have their economies of scale.
    [-]
    - bigyabai a day ago ago
      Then please excuse me for calling your one-man $10,000 inference device ridiculous.
      [-]
      - JumpCrisscross a day ago ago
        > please excuse me for calling your one-man $10,000 inference device ridiculous
        It’s about the real price of early microcomputers.
        Until the frontier stabilizes, this will be the cost of competitive local inference. Not pretending what we can run on a laptop will compete with a data centre.
      - brookst 8 hours ago ago
        How is it not impressive to be able to do something at quantity 1 for roughly the same price megacorps get at quantity 100,000?
        Try building a F1 car at home. I guarantee your unit cost will be several orders of magnitude higher than the companies who make several a year.
      - simonw a day ago ago
        Plenty of hobbies are significantly more expensive than that.
        [-]
        bigyabai a day ago ago
        The rallying cry of money-wasters the world over. "At least it's not avgas!"
        [-]
        seanmcdirmid a day ago ago
        Some people lose lots of money on boats, some people buy a fancy computer instead and lose less, although still a lot of, money.
      - rpdillon a day ago ago
        I mean, not really? Yeah, I pay to go to the movies and sit in a theater that they let me buy a ticket for, but that doesn't mean people that want to set up a nice home theater are ridiculous, they just care more about controlling and customizing their experience.
        [-]
        grim_io 7 hours ago ago
        Some would argue that the home theater is a superior experience to a crowded, far away movie theater where the person's head in front of you takes up a quarter of the screen.
        The same can't be said for local inference. It is always interior in experience and quality.
        A reasonable home theater pays for itself over time if you watch a lot of movies. Plus you get to watch shows as well, which the limited theater program doesn't allow.
        I can buy over 8 years of the Claude max $100 plan for the price of the 512GB M3 Ultra. And I can't imagine the M3 being great at this after 5 years of hardware advancement.
  - vonneumannstan a day ago ago
    Almost? Isn't a single h100 like 30k which is the bare minimum to run a big model?
floweronthehill a day ago ago
I believe local llms are the future. It will only get better. Once we get to the level of even last year's state of the art I don't see any reason to use chatgpt/anthropic/other.
We don't even need one big model good at everything. Imagine loading a small model from a collection of dozens of models depending on the tasks you have in mind. There is no moat.
[-]
- root_axis a day ago ago
  It's true that local LLMs are only going to get better, but it's not clear they will become generally practical for the foreseeable future. There have been huge improvements to the reasoning and coding capabilities of local models, but most of that comes from refinements to training data and training techniques (e.g. RLHF, DPO, CoT etc), while the most important factor by far remains the capability to reduce hallucinations to comfortable margins using the raw statistical power you get with massive full-precision parameter counts. The hardware gap between today's SOTA models and what's available to the consumer are so massive that it'll likely be at least a decade before they become practical.
- a day ago ago
  [deleted]
- nomel a day ago ago
  Secure/private cloud compute seems to be the obvious future, to me.
linux2647 20 hours ago ago
Unrelated but I really enjoyed the wavy text effect on “opinions” in the first paragraph
[-]
- frontsideair 15 hours ago ago
  Thank you, it was the integral part of the whole post!
atentaten a day ago ago
Every blog post or article about running local LLMs should include something about which hardware was used.
[-]
- frontsideair 15 hours ago ago
  Good point, let me add a quick note.
Olshansky a day ago ago
+1 to LM Studio. Helped build a lot of intuition.
Seeing and navigating all the configs helped me build intuition around what my macbook can or cannot do, how things are configured, how they work, etc...
Great way to spend an hour or two.
[-]
- deepsquirrelnet a day ago ago
  I also like that it ships with some cli tools, including an openai compatible server. It’s great to be able to take a model that’s loaded and open up an endpoint to it for running local scripts.
  You can get a quick feel for how it works via the chat interface and then extend it programmatically.
grim_io a day ago ago
It's a crazy upside-down world where the Mac Studio M3 Ultra 512GB is the reasonable option among the alternatives if you intend to run larger models at usable(ish) speeds.
TYPE_FASTER a day ago ago
Also see https://github.com/Mozilla-Ocho/llamafile.
mg a day ago ago
Is anyone working on software that lets you run local LLMs in the browser?
In theory, it should be possible, shouldn't it?
The page could hold only the software in JavaScript that uses WebGL to run the neural net. And offer an "upload" button that the user can click to select a model from their file system. The button would not upload the model to a server - it would just let the JS code access it to convert it into WebGL and move it into the GPU.
This way, one could download models from HuggingFace, store them locally and use them as needed. Nicely sandboxed and independent of the operating system.
[-]
- simonw a day ago ago
  Transformers.js (https://huggingface.co/docs/transformers.js/en/index) is this. Some demos (should work in Chrome and Firefox on Windows, or Firefox Nightly on macOS and Linux):
  https://huggingface.co/spaces/webml-community/llama-3.2-webg... loads a 1.24GB Llama 3.2 q4f16 ONNX build
  https://huggingface.co/spaces/webml-community/janus-pro-webg... loads a 2.24 GB DeepSeek Janus Pro model which is multi-modal for output - it can respond with generated images in addition to text.
  https://huggingface.co/blog/embeddinggemma#transformersjs loads 400MB for an EmbeddingGemma demo (embeddings, not LLMs)
  I've collected a few more of these demos here: https://simonwillison.net/tags/transformers-js/
  You can also get this working with web-llm - https://github.com/mlc-ai/web-llm - here's my write-up of a demo that uses that: https://simonwillison.net/2024/Nov/29/structured-generation-...
  [-]
  - mg a day ago ago
    This might be a misunderstanding. Did you see the "button that the user can click to select a model from their file system" part of my comment?
    I tried some of the demos of transformers.js but they all seem to load the model from a server. Which is super slow. I would like to have a page the lets me use any model I have on my disk.
    [-]
    - simonw a day ago ago
      Oh sorry, I missed that bit.
      I got Codex + GPT-5 to modify that Llama chat example to implement the "load from local directory" pattern. It appears to work.
      First you'll need to grab the checkout of the local model (~1.3GB):
      git lfs install git clone https://huggingface.co/onnx-community/Llama-3.2-1B-Instruct-q4f16
      Then visit this page: https://static.simonwillison.net/static/2025/llama-3.2-webgp... - in Chrome or Firefox Nightly.
      Now click "Browse folder" and select the folder you just checked out with Git.
      Click the confusing "Upload" confirmation (it doesn't upload anything, just opens those files in the current browser session).
      Now click "Load local model" - and you should get a full working chat interface.
      Code is here: https://github.com/simonw/transformers.js-examples/commit/cd...
      Here's the full Codex session that I used to build this: https://gist.github.com/simonw/3c46c9e609f6ee77367a760b5ca01...
      I ran Codex against the https://github.com/huggingface/transformers.js-examples/tree... folder and prompted:
      > Modify this application such that it offers the user a file browse button for selecting their own local copy of the model file instead of loading it over the network. Provide a "download model" option too.
      Then later:
      > Build the production app and then make it available on localhost somehow
      And:
      > Uncaught (in promise) Error: Invalid configuration detected: both local and remote models are disabled. Fix by setting `env.allowLocalModels` or `env.allowRemoteModels` to `true`.
      And:
      > Add a bash script which will build the application such that I can upload a folder called llama-3.2-webgpu to http://static.simonwillison.net/static/2025/llama-3.2-webgpu... and http://static.simonwillison.net/static/2025/llama-3.2-webgpu... will serve the app
      (Note that this doesn't allow you to use any model on your machine, but it proves that it's possible.)
      [-]
      - simonw a day ago ago
        Wrote this all up on my blog here, including a GIF demo showing how to use it: https://simonwillison.net/2025/Sep/8/webgpu-local-folder/
      - mg 16 hours ago ago
        Awesome!
        Bookmarked. I will surely try it out once FireFox or Chromium on Linux support WebGPU in their default config.
- SparkyMcUnicorn a day ago ago
  Yes. MLC's inference engine runs on WebGPU/WASM.
  https://github.com/mlc-ai/web-llm-chat
  https://github.com/mlc-ai/mlc-llm
  https://github.com/mlc-ai/web-llm
  [-]
  - mg a day ago ago
    Yeah, something like that, but without the WebGPU requirement.
    Neither FireFox nor Chromium support WebGPU on Linux. Maybe behind flags. But before using a technology, I would wait until it is available in the default config.
    Lets see when browsers will bring WebGPU to Linux.
    [-]
    - SparkyMcUnicorn a day ago ago
      This should be what you're looking for. It doesn't utilize the GPU, but WebGL support is in the TODOs.
      https://github.com/ngxson/wllama
      https://huggingface.co/spaces/ngxson/wllama
    - simonw a day ago ago
      Firefox Nightly on macOS now supports WebGPU, and the documentation says the Linux build supports it too.
- generalizations a day ago ago
  This is an in-browser llamacpp implementation: https://github.com/ngxson/wllama
  And related is the whisper implementation: https://ggml.ai/whisper.cpp/
- vonneumannstan a day ago ago
  This one is pretty cool. Compile the gguf of an OSS LLM directly into an executable. Will open an interface in the browser to chat. Can also launch an OpenAI API style interface hosted locally.
  Doesn't work quite as well on Windows due to the executable file size limit but seems great for Mac/Linux flavors.
  https://github.com/Mozilla-Ocho/llamafile
- adastra22 a day ago ago
  You don’t need a browser to sandbox something. Easier and more performant to do GOU pass through to a container or VM.
  [-]
  - 01HNNWZ0MV43FF a day ago ago
    Container or VM is a bigger commitment. VMs need root and containers need Docker group and something like docker-compose or a shell script or something.
    idk it's just like, do I want to run to the store and buy a 24-pack of water bottles, and stash them somewhere, or do I want to open the tap and have clean drinking water
- paulirish a day ago ago
  Beyond all the wasm/webgpu approaches other folks have linked (mostly in the transformers.js ecosystem), there's been a standardized API brewing since 2019: https://webmachinelearning.github.io/webnn-intro/
  Demos here: https://webmachinelearning.github.io/webnn-samples/ I'm not sure any of them allow you to select a model file from disk, but that should be entirely straightforward.
- samsolomon a day ago ago
  Is Open WebUI something like you are looking for? The design has some awkwardness, but overall it's incorporated a ton of great features.
  https://openwebui.com/
  [-]
  - mg a day ago ago
    No, I'm looking for an html page with a button "Select LLM". After pressing that button and selecting a local LLM from disk, it would show an input field where you can type your question and then it would use the given LLM to create the answer.
    I'm not sure what OpenWebUI is, but if it was what I mean, they would surely have the page live and not ask users to install Docker etc.
    [-]
    - tmdetect a day ago ago
      I think what you want is this: https://github.com/mlc-ai/web-llm
    - bravetraveler a day ago ago
      It's both what you want and not; the chat/question interface is as you describe, lack-of-installation is not. The LLM work is offloaded to other software, not the browser.
      I would like to skip maintaining all this crap, though: I like your approach
    - Jemaclus a day ago ago
      You should install it, because it's exactly what you just described.
      Edit: From a UI perspective, it's exactly what you described. There's a dropdown where you select the LLM, and there's a ChatGPT-style chatbox. You just docker-up and go to town.
      Maybe I don't understand the rest of the request, but I can't imagine a software where a webpage exists and it just magically has LLMs available in the browser with no installation?
      [-]
      - craftkiller a day ago ago
        It doesn't seem exactly like what they are describing. The end-user interface is what they are describing but it sounds like they want the actual LLM to run in the browser (perhaps via webgpu compute shaders). Open WebUI seems to rely on some external executor like ollama/llama.cpp, which naturally can still be self-hosted but they are not executing INSIDE the browser.
        [-]
        Jemaclus a day ago ago
        Does that even exist? It's basically what they described but with some additional installation? Once you install it, you can select the LLM on disk and run it? That's what they asked for.
        Maybe I'm misunderstanding something.
        [-]
        craftkiller a day ago ago
        Apparently it does, though I'm learning about it for the first time in this thread also. Personally, I just run llama.cpp locally in docker-compose with anythingllm for the UI but I can see the appeal of having it all just run in the browser.
        https://github.com/mlc-ai/web-llm https://github.com/ngxson/wllama
        [-]
        Jemaclus a day ago ago
        Oh, interesting. Well, TIL.
      - andsoitis a day ago ago
        > You should install it, because it's exactly what you just described.
        Not OP, but it really isn't what' they're looking for. Needing to install stuff VS simply going to a web page are two very different things.
- mudkipdev a day ago ago
  It was done with gemma-3-270m, I hope someone will post a link to it below
- vavikk a day ago ago
  Not browser but Electron. For the browser you would have to run a local nodejs server and point the browser app to use the local API. I use electron with nodejs and react for UI. Yes I can switch models.
- coip a day ago ago
  Have you seen/used the webGPU spaces?
  https://huggingface.co/docs/transformers.js/en/guides/webgpu
  eta: its predecessor was using webGL
  [-]
  - mg a day ago ago
    WebGPU is not yet available in the default config of Linux browsers, so WebGL would have been perfect :)
balder1991 16 hours ago ago
As someone who sometimes downloads random models to play around on my 16GB Mac Mini, I like his suggestions of models. I guess these are the best ones for their sizes if you get down to 4 or 5 worth keeping.
SLWW a day ago ago
The use of the word "emergent" is concerning to me. I believe this to be an... exaggeration of the observed effect. Depending on the perspective and the knowledge of the domain, this might seem to some ad emergent, however we saw equally interesting developments with more complex Markov chaining given the sheer lack of computational resources and time. What we are observing is just another step up that ladder, another angle to enumerate and pick the best token next in the sequence given the information revealed by the proceeding words. Linguistics is all about efficient, lossless data-transfer. While it's "cool" and very surprising.. I don't believe we should be treating it as somewhere between a spell-checker and a sentient being. People aren't simple heuristic models, and to imply these machines are remotely close is woefully inaccurate and will lead to further confusion and disappointment in the future.
cchance 4 hours ago ago
#1 thing they need to do is open up ANE for developers to properly access
jerryliu12 a day ago ago
My main concern with running LLMs locally so far is that it absolutely kills your battery if you're constantly inferencing.
[-]
- seanmcdirmid a day ago ago
  It really does. On the other hand, if you have a power outlet handy, you can inference on the plane even without a net connection.
noja a day ago ago
I really like On-Device AI on iPhone (also runs on Mac): https://ondevice-ai.app in addition to LM Studio. It has a nice interface, with multiple prompt integration, and a good selection of models. Also the developer is responsive.
[-]
- LeoPanthera a day ago ago
  But it has a paid recurring subscription, which is hard to justify for something that runs entirely locally.
  [-]
  - noja a day ago ago
    I am using it without one so far. But if they continue to develop it I will upgrade.
- gazpachotron a day ago ago
  [dead]
tpae a day ago ago
Check out Osaurus - MIT Licensed, native, Apple Silicon–only local LLM server - https://github.com/dinoki-ai/osaurus
[-]
- colecut 21 hours ago ago
  thank you
tolerance a day ago ago
DEVONThink 4’s support for local models is great and could possibly contribute to the software’s enduring success for the next 10 years. I’ve found it helpful for summarizing documents and selections of text, but it can do a lot more than that apparently.
https://www.devontechnologies.com/blog/20250513-local-ai-in-...
anArbitraryOne 7 hours ago ago
I still don't think MacOS is such a great idea
jasonjmcghee a day ago ago
I think the best models around right now that most people can fit some quantization on their computer if it's a apple silicon Mac or gaming PC would be:
For non-coding: Qwen3-30B-A3B-Instruct-2507 (or the thinking variant, depending on use case)
For coding: Qwen3-Coder-30B-A3B-Instruct
---
If you have a bit more vram, GLM-4.5-Air or the full GLM-4.5
[-]
- all2 a day ago ago
  Note that Qwen3 and Deepseek are hobbled in Ollama; they cannot use tools as the tool portion of the system prompt is missing.
  Recommendation: use something else to run the model. Ollama is convenient, but insufficient for tool use for these models.
  [-]
  - theshrike79 9 hours ago ago
    Could you give a recommendation that works instead of saying what doesn't work?
    [-]
    - simonw 8 hours ago ago
      Try LM Studio or llama-server: https://simonwillison.net/2025/Aug/19/gpt-oss-with-llama-cpp...
Damogran6 a day ago ago
Oddly, my 2013 MacPro (Trashcan) runs LLMs pretty well, mostly because 64Gb of old school RAM is, like, $25.
[-]
- frontsideair 15 hours ago ago
  I'm interested in this, my impression was that the newer chips have unified memory and high memory bandwidth. Do you do inference on the CPU or the external GPU?
  [-]
  - Damogran6 6 hours ago ago
    I don't, I'm a REALLY light user. smaller LLMs work pretty well. I used a 40gb LLM and it was _pokey_, but it worked, and switching them is pretty easy. This is a 12 core Xeon with 64Gb RAM...my M4 mini is....okay with smaller LLMs, I have a Ryzen 9 with a RTX3070ti that's the best of the bunch, but none of this holds a candle to people that spend real money to experiment in this field.
jftuga a day ago ago
I have a macbook air M4 with 32 GB. What LM Studio models would you recommend for:
* General Q&A
* Specific to programming - mostly Python and Go.
I forgot the command now, but I did run a command that allowed MacOS to allocate and use maybe 28 GB of RAM to the GPU for use with LLMs.
[-]
- frontsideair 15 hours ago ago
  This is the command probably:
```
  sudo sysctl iogpu.wired_limit_mb=184320
```
  Source: https://github.com/ggml-org/llama.cpp/discussions/15396
- DrAwdeOccarim a day ago ago
  I adore Qwen 3 30b a3b 2507. Pretty easy to write an MCP to let us search the web with Brave API key. I run it on my Macbook Pro M3 Pro 36 GB.
  [-]
  - theshrike79 9 hours ago ago
    What are you running it on that lets you connect tools to it?
    [-]
    - DrAwdeOccarim 8 hours ago ago
      LM Studio. I just vibe code the nodeJS code.
- balder1991 16 hours ago ago
  You’ll certainly find better answers on /r/LocalLlama in Reddit for this.
KolmogorovComp a day ago ago
The really though spot is finding a good model for your use case. I’ve a 16Gb MB and have been paralyzed by the many options. I’ve settle for a quantisied 14B Qwen for now, but no idea if this is a good idea.
[-]
- frontsideair 14 hours ago ago
  14B Qwen was a good choice, but it became outdated a bit and seems like the new version of 4B surpassed it in benchmarks somehow.
  It's a balancing game, how slow a token generation speed can you tolerate? Would you rather get an answer quick, or wait for a few seconds (or sometimes minutes) for reasoning?
  For quick answers, Gemma 3 12B is still good. GPT-OSS 20B is pretty quick when reasoning is set to low, which usually doesn't think longer than one sentence. I haven't gotten much use out of Qwen3 4B Thinking (2507) but at least it's fast while reasoning.
lawxls a day ago ago
What is the best local model for cursor style autocomplete/code suggestions? And is there an extension for vs code which can integrate local model for such use?
[-]
- kergonath a day ago ago
  I have been playing with the continue.dev extension for vscodium. I got it to work with Ollama and the Mistral models (codestral, devstral and mistral-small). I did not go much further than experimenting yet, but it looks promising, entirely local and mostly open source. And even then, it’s much further than I got with most other tools I tried.
coldtea 12 hours ago ago
>I also use them for brain-dumping. I find it hard to keep a journal, because I find it boring, but when you’re pretending to be writing to someone, it’s easier. If you have friends, that’s much better, but some topics are too personal and a friend may not be available at 4 AM. I mostly ignore its responses, because it’s for me to unload, not to listen to a machine spew slop. I suggest you do the same, because we’re anthropomorphization machines and I’d rather not experience AI psychosis. It’s better if you don’t give it a chance to convince you it’s real. I could use a system prompt so it doesn’t follow up with dumb questions (or “YoU’Re AbSoLuTeLy CoRrEcT”s), but I never bothered as I already don’t read it.
Reads like someone starting to get their daily drinks, already using them for "company" and fun, and saying "I'm not an alcoholic, I can quit anytime".
jokoon a day ago ago
I am still looking for a local image captioner, any suggestion which are the 3 easiest to use?
[-]
- DrAwdeOccarim a day ago ago
  Minstral small 3.2 Q4_K_M and Gemma 3 12b 4 bit are amazing. I run both in LM Studio on a Macbook Pro M3 Pro with 36GB of RAM.
  [-]
  - jokoon 7 hours ago ago
    can I call it from the command line?
OvidStavrica a day ago ago
By far, the easiest (open source/Mac) is with Pico AI Server with Witsy for a front end:
https://picogpt.app/
https://apps.apple.com/us/app/pico-ai-server-llm-vlm-mlx/id6...
Witsy:
https://github.com/nbonamy/witsy
...and you really want at least 48G RAM to run >24B models.
a-dub a day ago ago
ollama is another good choice for this purpose. it's essentially a wrapper around llamacpp that adds easy downloading and management of running instances. it's great! also works on linux!
[-]
- frontsideair 15 hours ago ago
  Ollama adding a paid cloud version made me postpone this post for a few weeks at least. I don't object them to make money, but it was hard to recommend a tool for local usage and make the first instruction to go to settings and enable airplane mode.
  Luckily llama.cpp has come a long way and was at a point that I could easily recommend as the open source option instead.
jus3sixty a day ago ago
An awful lot of Monday morning quarterback CEOs are here running their mouths about what Tim Cook should do or what they would do. Chill out with the extremely confident ignorance. Tim Cook brought Apple to a billion dollars in free cash he doesn’t need to ride the hype train.
Also let’s not forget they are first and foremost designers of hardware and the arms race is only getting started.
[-]
- j45 a day ago ago
  Not sure I can think of anything that is more performant per watt for LLMs than Apple Silicon.
  [-]
  - saagarjha 20 hours ago ago
    A datacenter GPU is going to be an order of magnitude more efficient.
techlatest_net a day ago ago
[dead]
curtisszmania 17 hours ago ago
[dead]
wer232essf a day ago ago
[flagged]
wer232essf a day ago ago
[flagged]
wer232essf a day ago ago
[flagged]
[-]
- saagarjha 20 hours ago ago
  Your bot is broken my guy