5 comments

  • vessenes 24 minutes ago ago

    As a user of a lot of coding tokens I’m most interested in latency - these numbers are presumably for heavily batched workloads. I dearly wish Claude had a cerebras endpoint.

    I’m sure I’d use more tokens because I’d get more revs, but I don’t think token usage would increase linearly with speed: I need time to think about what I want to and what’s happened or is proposed. But I feel like I would be able to stay in flow state if the responses were faster, and that’s super appealing.

  • kingstnap an hour ago ago

    Impressive performance work. It's interesting that you still see these 40+% perf gains like this.

    Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping.

    • whoevercares 18 minutes ago ago

      Absolutely. LLM inference is still a greenfield — things like overlap scheduling and JIT CUDA kernels are very recent. We’re just getting started optimizing for modern LLM architectures, so cost/perf will keep improving fast.

  • androiddrew 24 minutes ago ago

    Now all we need is better support for AMD gpus, both CDNA and RDNA types

  • danielhanchen an hour ago ago

    Love vLLM!