18 comments

  • 34679 an hour ago ago

    Don't use LLMs for financial workflows. Use them to create software for financial workflows. Software doesn't "drift".

  • raffisk 3 hours ago ago

    Empirical study on LLM output consistency in regulated financial tasks (RAG, JSON, SQL). Governance focus: Smaller models (Qwen2.5-7B, Granite-3-8B) hit 100% determinism at T=0.0, passing audits (FSB/BIS/CFTC), vs. larger like GPT-OSS-120B at 12.5%. Gaps are huge (87.5%, p<0.0001, n=16) and survive multiple-testing corrections.

    Caveat: Measures reproducibility (edit distance), not full accuracy—determinism is necessary for compliance but needs semantic checks (e.g., embeddings to ground truth). Includes harness, invariants (±5%), and attestation.

    Thoughts on inverse size-reliability? Planning follow-up with accuracy metrics vs. just repro.

    • throwdbaaway 2 hours ago ago

      It is the reasoning. During the reasoning process, the top few tokens have very similar or even same logprobs. With gpt-oss-120b, you should be able to get deterministic output by turning off reasoning, e.g. by appending:

          {"role": "assistant", "content": "<think></think>"}
      
      Of course, the model will be less capable without reasoning.
      • raffisk an hour ago ago

        Good call—reasoning token variance is likely a factor, esp with logprob clustering at T=0. Your <think></think> workaround would work, but we need reasoning intact for financial QA accuracy.

        Also the mistral medium model we tested had ~70% deterministic outputs across the 16 runs for the text to sql gen and summarization in json tasks- and it had reasoning on. Llama 3.3 70b started to degrade and doesn’t have reasoning. But it’s a relevant variable to consider

    • doctorpangloss 2 hours ago ago

      “Determinism is necessary for compliance”

      Says who?

      The stuff you comply with changes in real time. How’s that for determinism?

      • raffisk an hour ago ago

        Author here—fair point, regs are a moving target . But FSB/BIS/CFTC explicitly require reproducible outputs for audits (no random drift in financial reports). Determinism = traceability, even when rules update at the very least

        Most groups I work with stick to traditional automation/rules systems, but top-down mandates are pushing them toward frontier models for general tasks—which then get plugged into these workflows. A lot stays in sandbox, but you'd be surprised what's already live in fin services.

        The authorities I cited (FSB/BIS/CFTC) literally just said last month AI monitoring is "still at early stage" cc https://www.fsb.org/2024/11/the-financial-stability-implicat...

        Curious how you'd tackle that real-time changing reg?

      • ulrashida 32 minutes ago ago

        Please give an example of a statutory compliance item that "changes in real time".

        That's not the way regulations work. Your compliance is measured against a fixed version of legislation.

      • nomel an hour ago ago

        Also, what happens if you add a space to the end of the prompt? Or write a 12.00 to 12.000?

    • colechristensen 2 hours ago ago

      Outputs not being deterministic with temperature = 0 doesn't match my understanding of what "temperature" meant, I thought the definition of T=0 was determinism.

      Is this perhaps inference implementation details somehow introducing randomness?

  • measurablefunc 3 hours ago ago

    This is b/c these things are Markov chains. You can not expect consistent results & outputs.

    • SrslyJosh 2 hours ago ago

      Using an LLM for a "financial workflow" makes as much sense as integrating one with Excel. But who needs correct results when you're just working with money, right? ¯\_(ツ)_/¯

      • mirekrusin 2 hours ago ago

        Humans are non deterministic yet they use excel, work with financial workflows and deal with the money.

        • thfuran an hour ago ago

          And because one system that aims to achieve deterministic operation can’t quite perfectly do so, we might as well abandon any attempt at determinism?

        • Terr_ an hour ago ago

          "Humans make math errors, yet they do math anyway, therefore this calculator that makes errors is also OK."

          What do you call the fallacy where the universe is imperfect, therefore nobody can have higher standards for anything?

          Mankind has spent literal centuries observing deficiencies and faults in human bookkeeping and calculation, constantly trying to improve it with processes and machinery. There's no good reason to suddenly stop caring about those issues simply because the latest proposal is marketed as "AI".

    • ACCount37 2 hours ago ago

      Did you actually read what the paper was about before leaving a low quality comment?