I'm arriving at the conclusion that deployments of LLMs is most suitable in areas where the cost of false positives and, crucially, false negatives are low.
If you cannot tolerate false negatives I don't see how you get around the inaccuracy of LLMs. As long as you can spot false positives and their rate is sufficiently low they are merely an annoyance.
I think this is a good consideration before starting a project leveraging LLMs
I completely agree. These are useful in fuzzy cases but we live in a fuzzy world. Most things are fuzzy and nothing is completely true or completely false.
If I as a human deploy code, it is not certain that it necessarily works - just like with LLMs. The extent is different however.
100% where we are having a lot of success is in processes that required somewhat repeatable fuzzy processing, which before could only be performed by people.
Cool thing is that, since LLMs are comparatively cheap, I can afford to run the same process a few times, to get a sense of confidence of the response.
In our latest project, the client expressed that our AI aided process was 11 times faster, and much more accurate than their previous process.
Has inaccuracies been an issue for any of the systems you have developed using LLMs? I hear your complaint quite a bit but it does not align with my experience. Definitely one shotting a chatbot around an esoteric problem introduces possible inaccuracies. If I get an LLM to interrogate a pdf or other document that error rate drops significantly and is mostly on the part of the structuring process and not the LLM.
Genuinely curious what others have experienced but specifically those that are using LLMs for business workflows. It is not to say any system is perfect but for purpose driven data pipelines LLMs can be pretty great.
Yes I've seen issues with both, but in part what's tricky about false negatives is also that you don't necessarily realise they are there. In the systems I've worked on we've made it simple for operators to verify the work the LLM has done, but this only guards against false positives, which are less problematic.
I've had pretty good success using LLMs for coding and in some ways they are perfect for that. False positives are usually obvious and false negatives don't matter because as long as the LLM finds a solution, it's not a huge deal if there was a better way to do it. Even when the LLM cannot solve the problem at all, it usually produces some useful artifacts for the human to build on.
That’s fair and I typically have utilized LLM workflows where I believe the current gen of models shine. Classifications, data structuring, summarization, etc.
I don't really track issues, as I don't need to. Just a recent example "please extract the tabular data from this visual" and the model had incorrect aligned records in one column, so the IDs were off by 1 in the data.
I'm sure in 95% of cases it gets it right, but it didn't this time, and I'm not sure how to actually work around that fact.
Not an attack on your experience at all! I would would definitely counter that multimodal are still error prone and much better output is achieved using a tool like textract and then an LLM on the output data.
I asked an LLM to guide me through a Salesforce process last week. It gave me step-by-step instructions, about 50% of which were fine while the others referenced options that didn't exist in the system. So I followed the steps until I got to a wrong one, then told it that was wrong, at which point it said that was wrong and gave me different instructions. After a few cycles of that and some trial-and-error, I had a working process.
It probably did save me some time, so I'd call it a mild success; but it didn't save a lot of time, and I only succeeded in the end because I know Salesforce pretty well and was just inexperienced at this one area, so I was able to see where it was probably going off the rails. Someone new to Salesforce would have been hopelessly lost by its advice.
It's understandable that an LLM wouldn't be very good at Salesforce, because there's a lot of bad information in the support forums out there, and the ways of doing things in it have changed multiple times over the years. But that's true of a lot of systems, so it's not an excuse, just a symptom of using LLMs that's probably not going to change.
I'm working on some AI projects and I'm building in "what just happened" kinda interface so folks understand if the result is in fact is what they wanted.
Management types seem baffled by the idea we would want this, even if they come around the next hour and say "hey user did something can you tell me what happened".
> The data also reveals a misalignment in resource allocation. More than half of generative AI budgets are devoted to sales and marketing tools, yet MIT found the biggest ROI in back-office automation—eliminating business process outsourcing, cutting external agency costs, and streamlining operations.
Makes sense. The people in charge of setting AI initiatives and policies are office people and managers who could be easily replaced by AI, but the people in charge not going to let themselves be replaced. Salesmen and engineers are the hardest to replace, yet they aren't in charge so they get replaced the fastest.
I think this is being overly complimenting to AI. I think the most obvious reason is that for almost all business use cases its not very helpful. All these initiatives have the same problem. Staff asking 'how can this actually help me,' because they can't get it to help them other than polishing emails, polishing code, and writing summaries which is not what most people's jobs are. Then you have to proofread all of this because AI makes a lot of mistakes and poor assumptions, on top of hallucinations.
I dont think Joe and Jane worker are purposely not using to protect their jobs, everyone wants ease at work, its just these LLM-based AI's dont offer much outside of some use cases. AI is vastly over-hyped and now we're in the part of the hype cycle where people are more comfortable saying to power, "This thing you love and think will raise your stock price is actually pretty terrible for almost all the things you said it would help with."
AI has its place, but its not some kind of universal mind that will change everything and be applicable in significant and fundamentally changing ways outside of some narrow use cases.
I'm on week 3 of making a video game (something I've never done before) with Claude/Chat and once I got past the 'tutorial level' design, these tools really struggle. I think even where an LLM would naturally be successful (structured logical languages), its still very underwhelming. I think we're just seeing people push back on hype and feeling empowered to say "This weird text autogenerator isn't helping me."
3D apps are particularly bad for AI. The LLMs are fantastic at web apps that produce an HTML DOM. But they suck at generating code for a 3D app that needs rendering, game logic, physics and similar stuff. All of that is much more complicated than a DOM. Plus, there is 100x the amount of training data for web apps. It is similarly harder to test 3D apps. Testing web code is glorious. You can access the UI via the DOM, execute events, and then check the DOM for success. None of that is possible in 3D, where there is just an image and a mouse, and no way to find and push a button or check the results. A few of the LLM IDEs allow you to add images, which could really help cross this gap, but most do not, and those that do are not designed to be able to detect rendering artifacts, or detect if a given object is in the right place.
Part of it is that the bosses often don't know what they want, so they leave the details up to marketing or whoever, so replacing marketing or whoever with AI would mean figuring out what they want. The boss can tell marketing, "Make a brochure for new product ABC," and marketing can run with that and present him with a mock-up, he can make a couple revisions, they shine it up based on those, and then they're done. To replace them completely with AI, he would have to provide a lot more guidance and it would take more iterations to get a correct result that he likes. It wouldn't be completely unlike the current process, but it would demand more of him, which wouldn't make him happy.
Last week I was talking to my boss about a project I've been working on for him, and he asked whether AI could help me with it to save time. I pointed out that a lot of the holdup in the project has been his not knowing exactly what he wants (because he's not sure what the software we're working with can do until I do it and show it to him), and an AI can't tell him what he wants any more than I can. Sometimes you just have to do the work, and technology can't help you.
There is a reason why sales and marketing is first. It has to do with hallucination.
People have figured out that even if you mess up sales/support/marketing, worse case you apologize and give a gift coupon. And then there is also the verbose nature of LLMs which makes it better suited to write marketing copies etc.
On business process outsourcing like customer support lot of companies are using LLMs, so that part is unclear to me.
Other BPO processes are accounting & finance, IT, human resources etc. And while companies can take that hallucination risk for customers, they see it as a serious risk. If for example, the accounting and finance operations get messed up due to AI hallucination companies will be in real hot water. Same goes for other back office functions like HR, compliance etc. So, most likely this statement is just hogwash.
> MIT found the biggest ROI in back-office automation
Can't find any source to this, even after searching in Google. To me who knows bit of this, I don't find it very believable. Compared to humans, AI struggles in places where a fixed structure and process is required.
Nobody actually wants half the useless tools companies are coming up with because most of the solutions are not really novel. They are just wrapping an LLM.
It's kinda like what I realized with the meta Ray-Bans: I can have these things on my face, they can tell me the answer to virtually any question in 10 seconds or less.
But I, as a human, rarely have questions to ask. When you walk in to your local grocery store - you generally know what you want and where to find it.
A ton of companies are just gluing LLM text boxes into apps and then scratching their heads when people don't use them.
Why?
Because the customer wasn't the user - it was their boss and shareholders. It was all done to make someone else think 'woah, they are following the trend!'.
The core issue with generative AI is that it all works best when focused in a narrow sense. There is like one or two really clever uses I've seen - disappointingly, one of them was Jira. The internal jargon dictionary tool was legitimately impressive. Will it make any more money? Probably not.
> But I, as a human, rarely have questions to ask.
Wow. This just does not match my personal experience. I do an hour or so walk around the reservoir near my house 4-5 times a week, letting my mind wander freely -- and I find that I stop on average at least five or ten times to take notes about questions to learn the answers to later, and occasionally decide that it's worth it to break pace to start learning the answer right then and there.
Thats super reasonable - I'm a person with ADHD so if I'm asking questions in a grocery store context - I might fully forget things or take way too long to get things done - Going for a walk in nature is absolutely a much better place for questions like that to me though. I think I would prefer to not have tech in the moment to take me out of the space.
As a fellow ADHDer, can confirm. I must aggressively mono-task to ensure things get done. I have to consciously manage which mode I'm in, "Goal" or "Explore". A simple heuristic I sometimes share with others is: "I can either 'think deeply' or 'do/talk/listen'. Doing both modes at once is possible but at reduced throughput and quality of each. Switching modes is laggy." It's not precisely accurate and there are exceptions but it gets the general idea across.
But do you need AI for those answers? I sometimes do the same thing, but Google/DDG/whatever works fine for most, and a niche app works for others (IDing a bird = Merlin app, for example).
Last year one of my berry bushes had browning leaves with some spots. Google search said infection, treatment plan, etc.
This year I snapped a pic and sent to chat gpt. Normal end of year die off, cut the brown branches away, here is a fertilizer schedule for end of year to support new growth for the next year.
ChatGPT makes gardening so much easier, and that is just one of many areas. Recipes are another, don't trust the math, but chat gpt can remix and elevate recipes so much better than Google recipe blog spam posts can.
Not the OP, but I ask way more questions now than I used to. Before, I’d sometimes wonder about things, but not enough to actually go and research them. Now, it’s as simple as asking the AI, and more often than not, I get a satisfying answer.
I read that as I-Ding a bird. It was a second of wondering what I-Ding a bird was until I got to "Merlin" and realized it was ID-ing a bird (face-palm emoji here).
I am in the same boat. I am always thinking about things and recently often asking ChatGPT for an answer. Having a natural language interface for questions has opened the door for me to many more questions.
It happens to me all the time, however, I want to have real answers. And while a LLM is sometimes involved, I usually go deeper, with some cross referencing, fact checking and primary sources. LLMs are great at giving you a starting point, but the problem with them is that it is impossible to distinguish between fact or fiction, so I always have to verify. Really, I have seen my fair share of falsehoods popping up on LLMs, sometimes on simple and uncontroversial topics.
On hot topics like politics, illegal drugs, gender and racial differences, etc... it may be impossible to even get an answer passed the filters.
> I rarely have questions of others but I always question myself.
There's only so many questions I have the ability to answer myself. Of those, there's only so many that I have the lifespan to answer myself. We stand on the shoulders of giants, and even on the shoulders of average people -- really it's shoulders all the way down. Unless the questioning itself is the source of joy (which it certainly sometimes is), I prefer to find out what others have learned when they asked the same questions. It's vanishingly rare that I believe I'm the first to think through something.
I mirror that experience, except for the latter half. I enjoy just being outside and letting my mind wander, letting it wonder about odd questions in the moment. I never actually want or care about the answers, I just like the feeling of thinking.
I already have my phone, I could look up the answers immediately. The reason I don't isn't that I can't. It's that asking the question is the point, not answering it.
When I walk around, I have many questions in my head. But I never stop to do something about it. If the question is important enough, it will stick and I'll do something about once I get back.
This is the modern curse: I know I can get an answer to nearly every question, and I can get it quickly, just taking my phone out of my pocket and dictating it, it takes zero effort. I feel it's worth to restrain oneself and just enjoy the walk. It just feels better.
I've tried to express a similar sentiment to people in the past - that 443rd redesign of the UI for JIRA that moves a button from one side to another. It isn't actually for you. You aren't the user of the software. The user of the software is the product manager (or equivalent role). They need to justify their current role or their next promotion.
Sadly, it takes away from my productivity when I was already used to the position of the button previously.
I do understand that sometimes things need to be redesigned. But crowing like you landed on the moon because your new phone icons now have "rounded edges with shading" or somesuch fuckery that will just slow down the rendering.. gets old and annoying really fast.
>Because the customer wasn't the user - it was their boss and shareholders. It was all done to make someone else think 'woah, they are following the trend!'.
I'm seeing this again and again. Customers as users seems like the last concern, if it is a concern at all. Adherence to the narrative du jour, fundraising from investors and hyping the useless product up to dump on retail are the primary concerns.
Vaporware or a useless, unlaunched product are advantageous here. Actual users might report how underwhelming or useless it is. Sky high development costs are touted as wins.
I just finished implementing a chatbot in a box for a clients sass. What problem does it solve? None that I can tell, other than now the sass “has ai”.
I still have access to the OpenAI dashboard. I can confirm nobody is actually using it.
We recently got a customer support request asking if we were going to "implement AI" on our website and then saying we could use it in our marketing if we did. No suggestion as to why they would find it useful, or what feature could be augmented with it. It's crazy that the hype is so high that random non-tech users suggest adding AI for marketing.
Embedded AIs are pretty dumb as a product in my opinion. Why would the customer pay you instead of their existing model vendor of choice? Why do they have to learn your chatbox - when it's probably using a crappier model and lacks the context of their preferred vendor.
I really don't want to pay for 5 different AI subscriptions, I want one subscription that works with all my other services (which I already pay for).
> Because the customer wasn't the user - it was their boss and shareholders.
I'm starting to get asked, "Could AI help you do such-and-such faster?" At first I tried to explain why the answer is no, because such-and-such doesn't lend itself to what AI is good at. But I'm starting to realize I'm going to have to tell them I am using it and maybe give them an example once in a while, because they're hearing too much about its wonderfulness to believe there's something it can't help with. They're going to think I'm just being stubborn even though I tell them I'm not opposed to using AI where it makes sense. If that means the job actually takes a little longer to add in the part where I use AI to speed it up, they'll be happier.
I think those kind of glasses may be really useful for blind people. I have seen similar glasses targeted at blind people, that at least in theory, seemed to me like a good idea.
I recall the glasses also can write on the screen inside the lens, which makes me think they may be good for deaf people as well.
It's just that these use-cases seem uncool, and big companies seem to have to be cool in order to keep either their status or their profits. But I have a feeling the technology may be really useful for some really vulnerable people.
Yes, there are people working on image recognition glasses for blind people.
Nobody seems to have been successful yet, and I think the focus on applying LLMs instead of dumb UI and mixed dumb and ML image processing is a large reason why.
Oh I do still enjoy the glasses, they are actually rather incredible, even though they do not have a screen. That said - These actually do have a Be My Eyes integration - It is incredibly impressive.
I use my Meta glasses heavily on vacation, and then occasionally else where. The latest Llama isn't as smart as OpenAI, so after a few wrong answers I gave up on day to day queries.
That said, the scenarios they are good at they are really good at. I was traveling in Europe and the glasses where translating engravings on castle walls, translating and summarizing historical plaques, and just generally letting me know what was going on around me.
Yeah I'm in charge of trying some AI experiments with my company and I look around the landscape for a little inspiration ... is everything just a wrapper on chatgpt or whatever?
I can do that too but it's also not very useful and I'm just shipping data off to some AI company too. Don't know if I want to feed client data elsewhere like that.
> There is like one or two really clever uses I've seen - disappointingly, one of them was Jira. The internal jargon dictionary tool was legitimately impressive. Will it make any more money? Probably not.
Sounds like Microsoft 365 Copilot at my org. Sucks at nearly everything, but it actually makes a fantastic search engine for emails, teams convos, sharepoint docs, etc. Much better that Microsoft's own global search stuff. Outside of coding, that's the only other real world use case I've found for LLMs - "get me all the emails, chats, and documents related to this upcoming meeting" and it's pretty good at that.
Though I'm not sure we should be killing the earth for better search, there are probably other, better ways to do it.
Agreed - 95% of the questions I ask Copilot, I could answer myself by searching emails, Teams messages and files - BUT Copilot does a far far better job than me, and quicker. I went from barely using it, to using it daily. I wouldn't say it is a massive speed boost for me, but I'd miss it if it was taken away.
Then the other 5% is the 'extra; it does for me, and gets me details I wouldn't have even known where to find.
But it is just fancy search for me so far - but fancy search I see as valuable.
My favorite copilot use is when I join a MS Teams meeting a few minutes late I can ask copilot: what have I missed? It does a fantastic job of summarizing who said what.
They also seem to be coming down in power usage substantially, at least for inference. There's pretty good models that can run on laptops now, and I still very much think we're in the model T phase of this technology so I expect further efficiency refinements. It also seems like they have recently hit a "cap" on the increase in intelligence models are getting for more raw power.
The trendline right now makes me wonder if we'll be talking about "dark datacenters" in the future the same way we talked about dark fiber after the dot com bubble.
This is an eye-opening sentence. It's quite hard to imagine how to live one's daily life with "few questions to ask." Perhaps this is a neurodivergent thing?
I meant mostly in the context of daily life tasks as a person with ADHD - so maybe a hair neurodivergent.
My issue isn't that I don't wonder things, it is that indulging the wonder would interrupt me from accomplishing almost anything. I would not very highly functioning if I allowed for non-critical thoughts to interrupt the flow.
When outside of trying to do specific things and in less focus-dependent tasks, I absolutely wonder and google and get lost on weird random topics.
I think I probably could have worded it more as "I rarely have questions worth knowing the answer to", where the cost of knowing answers is tied to the following rabbit holes and delays/forgotten tasks.
I'm autistic and I probably ask many more questions than most people.
I would also argue that ND people seem to be the heavier AI users, at least in my experience. Its a bit like the stereotypical 'wikipedia deep dive' but 10x.
I think this highlights an interesting point: Sensible use cases are unsexy. But the pushers want stuff, however unrealistic, that lends itself to breathless hype that can be blown out of proportion.
It's very much enough for drones tho... all you need is a tiny Jensen's chip, moped engine, some boom boom play-doh and you're ready to rock. No remote control needed.
Why not though? Current autopilot just attempts to keep plane on course/speed/altitude. Some can go further with auto-landing, but extreme emergency use only. I could see the airlines wanting to seek any fuel savings possible by possibly allowing AI to test slight changes to altitude/speed/course to conserve fuel based on some live inputs.
The mathematics that LLMs and machine learning are based on started off being developed for aircraft decades ago. It’s called “control theory”. So we had “AI” on airplanes first. Specifically we had adaptive control algorithms explicitly because of the problems introduced by fuel levels changing during the course of a flight.
In physics, we typically start with mass-spring-damper system representation. Elementary physics and engineering typically has assumptions such as mass being constant. You develop all sorts of dynamical models and intuition with that assumption. But an aircraft burns fuel as it flies, meaning its mass changes during the course of the flight. Thus your models drift and you have to adapt to that.
Pilots would have tomes they'd have to switch between at various points of the journey and adaptive control algorithms alleviated this. They still needed the actual reference guide in the cockpit as a risk mitigation.
The difference between that decades old application is that you don’t need a billion parameter model to do flight control. Most people do not understand the historic development of these techniques. The foundation of them has been around for a while. What we have done with the newest batch of "AI" is massively scale them up.
> "“Every single Monday was called 'AI Monday.' You couldn’t have customer calls, you couldn’t work on budgets, you had to only work on AI projects.”"
> "Vaughan saw that his team was not fully on board. His ultimate response? He replaced nearly 80% of the staff within a year"
Being that this is Fortune magazine, it makes sense that they're portraying it this way, but reading between the lines there a little bit, it seems like the staff knew what would happen and wasn't keen on replacing themselves.
5% are succeeding. People are trying AI for just about everything right now. 5% is pretty damn good, when AI clearly has a lot of room to get better.
The good models are quite expensive and slow. The fast & cheap models aren't that great - unless very specifically fine-tuned.
Will it get better enough so that that growth rate in success pilots grows from 5% - 25% in 5 years or 20? Who knows, but it almost certainly will grow.
It's hard to tell how much better the top foundation models will get over the next 5-10 years, but one thing that's certain is that the cost will go down substantially for the same quality over that time frame.
Not to mention all the new use cases people will keep trying over that timeline.
If in 10-years time, AI is succeeding in 2x as many use cases - that might not justify current valuations, but it will be a much better future - and necessary if we're planning on having ~25% of the population being retired / not working by then.
Without AI replacing a lot of jobs, we're gonna have a tough time retiring all the people we promised retirements to.
> 5% is pretty damn good, when AI clearly has a lot of room to get better.
That depends if the AI successes depended much on the leading edge of LLM developments, or if actually most of the value was just "low hanging fruit".
If the latter, that would imply the utility curve is levelling out, because new developments are not proving instrumental enough.
I'm thinking of an S curve: slow improvements through the 2010s, then a burst of activity as the tech became good enough to do something "real", followed by more gradual wins in efficiency and accuracy.
The MIT NANDA lab seems to have a link rot problem.
Their cardinal code repo is also 404. The NANDA Lab also does coding, their publication at AAAI 2025 is titled: "CoDream: Exchanging dreams instead of models for federated aggregation with heterogeneous models" [1]. However, the link to the Github repo is broken. Fascinating paper, sad about the missing code.
> I heard that SAP has an 80-90% deployment failure rate
Something to keep in mind is that ERP "failure" is frequently defined as went over budget or over time, even if it ultimately completed and provided the desired functionality.
It's a much smaller percentage of projects that are either cancelled or went live and significantly did not function as the business needed.
Not if every manufacturing company in the world decided to use your software anyway.
ERP rollouts can "fail" for lots of reasons that aren't to do with the software. They are usually business failures. Mostly, companies end up spending so much on trying to endlessly customize it to their idiosyncratic workflows that they exceed their project budgets and abandon the effort. In really bad cases like Birmingham they go live before actually finishing setup, and then lose control of their books and have to resort to hiring people to do the admin manually.
There's a saying about SAP: at some point gaining competitive advantage in manufacturing/retail became all about who could make SAP deployment a success.
This is no different to many other IT projects, most of them fail too. I think people who have never worked in an enterprise context don't realize that; it's not like working in the tech sector. In the tech industry if a project fails, it's probably because it was too ambitious and the tech itself just didn't work well. Or it was a startup whose tech worked, but they couldn't find PMF. But in normal, mature, profitable non-tech businesses a staggering number of business automation projects just fail for social or business reasons.
AI deployments inside companies are going to be like that. The tech works. The business side problems are where the failures are going to happen. Reasons will include:
• Not really knowing what they want the AI to do.
• No way to measure improved productivity, so no way to decide if the API spend is worth it.
• Concluding the only way to get a return is entirely replace people with AI and then having to re-hire them because the AI can't handle the last 5% of the work.
• Non-tech executives doing deals to use models or tech stacks that aren't the right kind or good enough.
Not if most of those failures are medium sized businesses with <1000 employees and your successes include a majority of the world's largest corporations that sell goods.
I think you're on the right track here. Most technology pilots fail. As long as risk/investment is managed appropriately, this is healthy. This seems to follow from Surgeon's Law... 90% of everything is crap [0].
> Despite the rush to integrate powerful new models, about 5% of AI pilot programs achieve rapid revenue acceleration; the vast majority stall, delivering little to no measurable impact on P&L.
This summer, I built two very sophisticated pieces of software. A financial ledger to power accrual accounting operations and a code generation framework that scaffolds a database from a defined data model to the frontend components and everything in between.
I used ChatGPT substantially. I'm not sure how long it would have taken without generative AI, but in reality, I would have just given up out of frustration or exhaustion. From the outside, it would appear to any domain expert that at least three other people worked on these giving the pace at which they got completed.
The completion of those two were seminal moments for me. I can't imagine how anyone, in any field of information systems, is not multiples more effective than they were five years ago. That directly affects a P&L and I can't think of anything in my career that is even remotely close to having that magnitude.
I don't know what encapsulates an AI pilot in these orgs, and I'm sure they are massively more complex than anything I've done. But to hear 95% of these efforts don't have a demonstrable effect is just wild.
> Did several domain experts tell you this or are you making it up?
It's an assertion among eight other engineers on the project with ~15 years of experience in the domain. They are domain experts. This part isn't up for debate.
> Generic tools like ChatGPT excel for individuals because of their flexibility, but they stall in enterprise use since they don’t learn from or adapt to workflows, Challapally explained.
Maybe I misunderstood this, but I took this to mean that people inside enterprises are struggling using tools like ChatGPT. They do point out that perhaps the tools are being deployed in the wrong areas:
> The data also reveals a misalignment in resource allocation. More than half of generative AI budgets are devoted to sales and marketing tools, yet MIT found the biggest ROI in back-office automation—eliminating business process outsourcing, cutting external agency costs, and streamlining operations.
But I've seen some amazing automation does in sales and marketing that directly affected sales efficiency and reduced sales and marketing expenses.
“AI pilots” in the article refers to developing AI-based tools, not to using AI for software development. These projects have a 95% failure rate of successfully deploying the AI tool being developed into production.
Regarding use of AI in software development (which is not what the article is about), the proof of the pudding isn’t in greenfield projects, it’s in longer-term software evolution and legacy code. Few disagree that AI saves time for prototyping or creating a first MVP.
You are correct. As I pointed out in another reply, I misinterpreted this part:
> Generic tools like ChatGPT excel for individuals because of their flexibility, but they stall in enterprise use since they don’t learn from or adapt to workflows, Challapally explained.
I didn't read the actual report (and probably wont) so I was figured "AI Pilots" _could_ (and honestly, should!) include the deployment of models to assist in any and all work (not necessarily even just coding - I just used it as an example).
> But to hear 95% of these efforts don't have a demonstrable effect is just wild.
Why tho? You used AI to make some software, but did you use AI to achieve rapid revenue acceleration?
That you used AI to build software seems tangential to whether it can increase revenues. Over the years, we've seen many technologies that didn't deliver on promises of rapidly increasing revenues despite being useful for creating software (cough OOP cough), so this new one failing to live up to expectations isn't surprising. Actually given the history of technologies that over promise and under deliver on massive hype, disappointment should be the null hypothesis.
We've talked with a ton of AI companies and I was surprised how much of the challenges were the usual challenges in any project. Just amplified by the rush to do AI right now, but I haven't seen anything as bad as "You couldn’t have customer calls, you couldn’t work on budgets, you had to only work on AI projects.”
This is proof LLMs are viable and productive in my opinion. The baseline rate for business failure over 5 years is around 90%, so they say. With how much hype surrounds LLM wrapper startups this is still an astounding amount of novel business model creation.
At this rate, how is it better than pure random chance?
The article mentions 19-20 year old founders, focused on solving single user problems, were the successes.
The sample size is 300 public AI deployments and an undisclosed number of private in-house AI projects. And the survey seems to only consider business applications, as compared with end-user applications like media and software. That's significant but not definitive.
Isn't it more likely that existing problems with low hanging fruit, perhaps unpopular answers, that could be solved by leaning on "AI". And perhaps "AI" wasn't the key to success?
This resonates. Upskilling to AI tools is perhaps the biggest problem of our day. One idea we have to tackle this problem is to bring onboarding/learning directly into the user's work environment, track struggles and offer targeted support, and create continuous feedback loops. If anyone has faced challenges with increasing activation and retention of users on pilots (or external-facing products), would love to chat and see how we can help .
I think for a long time, cutting corners so that the number can go up next quarter has worked surprisingly well. Genuinely, I don't think a lot of corporations view offering a better product as a viable means of competing in the 2025 marketplace.
For them, AI is not the next industrial revolution, it's the next overseas outsourcing; AI isn't a way to bring new value to customers, it's a way to bring roughly the same value (read worse) but at a much cheaper cost to them. If they get their way, everything will get worse, while they make more money. That's the value proposition at play here.
Because for the typical office - documents are strewn about on random network drives and are not formatted similarly. This combined with the inability to nail down 100% accuracy on even just internal doc search is just too much to overcome for non-tech industry offices. My office is mind blown if i use Gemini to extract data from a PDF and convert it to an .xlsx or .csv
As a technically minded person but not a comp sci guy, refining document search is like staring into a void and every option uses different (confusing) terminology. This makes it extra difficult for me to both do my regular job AND learn the multiple names/ways to do the exact same thing between platforms.
The only solution that has any reliability for me so far are Gemini instances where i upload only the files i wish to search and just keep it locked to a few questions per instance before it starts to hallucinate.
My attempt at RAG search implementation was a disaster that left me more confused than anything.
Because you mentioned the use case specifically, I wanted to point you to the fact that Excel has been able to convert images to tables for a while now. Literally screenshot a table from your PDF and it will convert to table. Not trying to diminish any additional capabilities you're getting from Gemini, but this screenshot to table feature has been huge for my finance team.
There are very few use cases at companies where you need to generate something. You want to work with the company's often very private disparate data (with access controls etc.) You wouldn't even have enough data to train a custom LLM, much less use a generic one.
In my experience is that LLMs get you 80%of the way to a solution almost immediately but that last 20% when it comes to missing knowledge, data, or accuracy is a complete tar pit and will wreck adoption. Especially since many vendors are selling products that are wrappers and provide generic, non-customised solutions. I hear the same from others doing trials with various AI tools as well.
Any consumer facing AI project has to contend with the fact that GenAI is predominantly associated with "slop." If you're not actively using an AI tool, most of your experience with GenAI is seeing social media or Youtube flooded with low quality AI content, or having to deal with useless AI customer support. This gives the impression that AI is just cheap garbage, and something that should be actively avoided.
I think one reason for this is that LLMs are sort of maximally if accidentally designed to fuck up our brains. Despite all the advancements in the last five years I see them as still, fundamentally, text transformation machines which have only very limited sort of intelligence. Yet because nothing in history has been able to generate language except humans, most of us are not prepared to make rational judgements about their capabilities and those of us that may be also often fail to do so.
The fact that we live in an era where tech people have been so investor pilled that overstating the capabilities of technology is basically second nature does not help.
I'm arriving at the conclusion that deployments of LLMs is most suitable in areas where the cost of false positives and, crucially, false negatives are low.
If you cannot tolerate false negatives I don't see how you get around the inaccuracy of LLMs. As long as you can spot false positives and their rate is sufficiently low they are merely an annoyance.
I think this is a good consideration before starting a project leveraging LLMs
I agree, and it's why I think AI is a good $50 billion industry but not a $5 trillion industry.
I completely agree. These are useful in fuzzy cases but we live in a fuzzy world. Most things are fuzzy and nothing is completely true or completely false.
If I as a human deploy code, it is not certain that it necessarily works - just like with LLMs. The extent is different however.
100% where we are having a lot of success is in processes that required somewhat repeatable fuzzy processing, which before could only be performed by people.
Cool thing is that, since LLMs are comparatively cheap, I can afford to run the same process a few times, to get a sense of confidence of the response.
In our latest project, the client expressed that our AI aided process was 11 times faster, and much more accurate than their previous process.
Has inaccuracies been an issue for any of the systems you have developed using LLMs? I hear your complaint quite a bit but it does not align with my experience. Definitely one shotting a chatbot around an esoteric problem introduces possible inaccuracies. If I get an LLM to interrogate a pdf or other document that error rate drops significantly and is mostly on the part of the structuring process and not the LLM.
Genuinely curious what others have experienced but specifically those that are using LLMs for business workflows. It is not to say any system is perfect but for purpose driven data pipelines LLMs can be pretty great.
Yes I've seen issues with both, but in part what's tricky about false negatives is also that you don't necessarily realise they are there. In the systems I've worked on we've made it simple for operators to verify the work the LLM has done, but this only guards against false positives, which are less problematic.
I've had pretty good success using LLMs for coding and in some ways they are perfect for that. False positives are usually obvious and false negatives don't matter because as long as the LLM finds a solution, it's not a huge deal if there was a better way to do it. Even when the LLM cannot solve the problem at all, it usually produces some useful artifacts for the human to build on.
That’s fair and I typically have utilized LLM workflows where I believe the current gen of models shine. Classifications, data structuring, summarization, etc.
> as long as the LLM finds a solution, it's not a huge deal if there was a better way to do it
It might not matter short term, but midterm such debt becomes a huge burden.
I don't really track issues, as I don't need to. Just a recent example "please extract the tabular data from this visual" and the model had incorrect aligned records in one column, so the IDs were off by 1 in the data.
I'm sure in 95% of cases it gets it right, but it didn't this time, and I'm not sure how to actually work around that fact.
Not an attack on your experience at all! I would would definitely counter that multimodal are still error prone and much better output is achieved using a tool like textract and then an LLM on the output data.
I asked an LLM to guide me through a Salesforce process last week. It gave me step-by-step instructions, about 50% of which were fine while the others referenced options that didn't exist in the system. So I followed the steps until I got to a wrong one, then told it that was wrong, at which point it said that was wrong and gave me different instructions. After a few cycles of that and some trial-and-error, I had a working process.
It probably did save me some time, so I'd call it a mild success; but it didn't save a lot of time, and I only succeeded in the end because I know Salesforce pretty well and was just inexperienced at this one area, so I was able to see where it was probably going off the rails. Someone new to Salesforce would have been hopelessly lost by its advice.
It's understandable that an LLM wouldn't be very good at Salesforce, because there's a lot of bad information in the support forums out there, and the ways of doing things in it have changed multiple times over the years. But that's true of a lot of systems, so it's not an excuse, just a symptom of using LLMs that's probably not going to change.
I'm working on some AI projects and I'm building in "what just happened" kinda interface so folks understand if the result is in fact is what they wanted.
Management types seem baffled by the idea we would want this, even if they come around the next hour and say "hey user did something can you tell me what happened".
Like guies ... it's not 100%...
> The data also reveals a misalignment in resource allocation. More than half of generative AI budgets are devoted to sales and marketing tools, yet MIT found the biggest ROI in back-office automation—eliminating business process outsourcing, cutting external agency costs, and streamlining operations.
Makes sense. The people in charge of setting AI initiatives and policies are office people and managers who could be easily replaced by AI, but the people in charge not going to let themselves be replaced. Salesmen and engineers are the hardest to replace, yet they aren't in charge so they get replaced the fastest.
I think this is being overly complimenting to AI. I think the most obvious reason is that for almost all business use cases its not very helpful. All these initiatives have the same problem. Staff asking 'how can this actually help me,' because they can't get it to help them other than polishing emails, polishing code, and writing summaries which is not what most people's jobs are. Then you have to proofread all of this because AI makes a lot of mistakes and poor assumptions, on top of hallucinations.
I dont think Joe and Jane worker are purposely not using to protect their jobs, everyone wants ease at work, its just these LLM-based AI's dont offer much outside of some use cases. AI is vastly over-hyped and now we're in the part of the hype cycle where people are more comfortable saying to power, "This thing you love and think will raise your stock price is actually pretty terrible for almost all the things you said it would help with."
AI has its place, but its not some kind of universal mind that will change everything and be applicable in significant and fundamentally changing ways outside of some narrow use cases.
I'm on week 3 of making a video game (something I've never done before) with Claude/Chat and once I got past the 'tutorial level' design, these tools really struggle. I think even where an LLM would naturally be successful (structured logical languages), its still very underwhelming. I think we're just seeing people push back on hype and feeling empowered to say "This weird text autogenerator isn't helping me."
3D apps are particularly bad for AI. The LLMs are fantastic at web apps that produce an HTML DOM. But they suck at generating code for a 3D app that needs rendering, game logic, physics and similar stuff. All of that is much more complicated than a DOM. Plus, there is 100x the amount of training data for web apps. It is similarly harder to test 3D apps. Testing web code is glorious. You can access the UI via the DOM, execute events, and then check the DOM for success. None of that is possible in 3D, where there is just an image and a mouse, and no way to find and push a button or check the results. A few of the LLM IDEs allow you to add images, which could really help cross this gap, but most do not, and those that do are not designed to be able to detect rendering artifacts, or detect if a given object is in the right place.
Part of it is that the bosses often don't know what they want, so they leave the details up to marketing or whoever, so replacing marketing or whoever with AI would mean figuring out what they want. The boss can tell marketing, "Make a brochure for new product ABC," and marketing can run with that and present him with a mock-up, he can make a couple revisions, they shine it up based on those, and then they're done. To replace them completely with AI, he would have to provide a lot more guidance and it would take more iterations to get a correct result that he likes. It wouldn't be completely unlike the current process, but it would demand more of him, which wouldn't make him happy.
Last week I was talking to my boss about a project I've been working on for him, and he asked whether AI could help me with it to save time. I pointed out that a lot of the holdup in the project has been his not knowing exactly what he wants (because he's not sure what the software we're working with can do until I do it and show it to him), and an AI can't tell him what he wants any more than I can. Sometimes you just have to do the work, and technology can't help you.
There is a reason why sales and marketing is first. It has to do with hallucination.
People have figured out that even if you mess up sales/support/marketing, worse case you apologize and give a gift coupon. And then there is also the verbose nature of LLMs which makes it better suited to write marketing copies etc.
On business process outsourcing like customer support lot of companies are using LLMs, so that part is unclear to me.
Other BPO processes are accounting & finance, IT, human resources etc. And while companies can take that hallucination risk for customers, they see it as a serious risk. If for example, the accounting and finance operations get messed up due to AI hallucination companies will be in real hot water. Same goes for other back office functions like HR, compliance etc. So, most likely this statement is just hogwash.
> MIT found the biggest ROI in back-office automation
Can't find any source to this, even after searching in Google. To me who knows bit of this, I don't find it very believable. Compared to humans, AI struggles in places where a fixed structure and process is required.
Nobody actually wants half the useless tools companies are coming up with because most of the solutions are not really novel. They are just wrapping an LLM.
It's kinda like what I realized with the meta Ray-Bans: I can have these things on my face, they can tell me the answer to virtually any question in 10 seconds or less.
But I, as a human, rarely have questions to ask. When you walk in to your local grocery store - you generally know what you want and where to find it. A ton of companies are just gluing LLM text boxes into apps and then scratching their heads when people don't use them.
Why?
Because the customer wasn't the user - it was their boss and shareholders. It was all done to make someone else think 'woah, they are following the trend!'.
The core issue with generative AI is that it all works best when focused in a narrow sense. There is like one or two really clever uses I've seen - disappointingly, one of them was Jira. The internal jargon dictionary tool was legitimately impressive. Will it make any more money? Probably not.
> But I, as a human, rarely have questions to ask.
Wow. This just does not match my personal experience. I do an hour or so walk around the reservoir near my house 4-5 times a week, letting my mind wander freely -- and I find that I stop on average at least five or ten times to take notes about questions to learn the answers to later, and occasionally decide that it's worth it to break pace to start learning the answer right then and there.
Thats super reasonable - I'm a person with ADHD so if I'm asking questions in a grocery store context - I might fully forget things or take way too long to get things done - Going for a walk in nature is absolutely a much better place for questions like that to me though. I think I would prefer to not have tech in the moment to take me out of the space.
As a fellow ADHDer, can confirm. I must aggressively mono-task to ensure things get done. I have to consciously manage which mode I'm in, "Goal" or "Explore". A simple heuristic I sometimes share with others is: "I can either 'think deeply' or 'do/talk/listen'. Doing both modes at once is possible but at reduced throughput and quality of each. Switching modes is laggy." It's not precisely accurate and there are exceptions but it gets the general idea across.
But do you need AI for those answers? I sometimes do the same thing, but Google/DDG/whatever works fine for most, and a niche app works for others (IDing a bird = Merlin app, for example).
Last year one of my berry bushes had browning leaves with some spots. Google search said infection, treatment plan, etc.
This year I snapped a pic and sent to chat gpt. Normal end of year die off, cut the brown branches away, here is a fertilizer schedule for end of year to support new growth for the next year.
ChatGPT makes gardening so much easier, and that is just one of many areas. Recipes are another, don't trust the math, but chat gpt can remix and elevate recipes so much better than Google recipe blog spam posts can.
Not the OP, but I ask way more questions now than I used to. Before, I’d sometimes wonder about things, but not enough to actually go and research them. Now, it’s as simple as asking the AI, and more often than not, I get a satisfying answer.
I read that as I-Ding a bird. It was a second of wondering what I-Ding a bird was until I got to "Merlin" and realized it was ID-ing a bird (face-palm emoji here).
I am in the same boat. I am always thinking about things and recently often asking ChatGPT for an answer. Having a natural language interface for questions has opened the door for me to many more questions.
It happens to me all the time, however, I want to have real answers. And while a LLM is sometimes involved, I usually go deeper, with some cross referencing, fact checking and primary sources. LLMs are great at giving you a starting point, but the problem with them is that it is impossible to distinguish between fact or fiction, so I always have to verify. Really, I have seen my fair share of falsehoods popping up on LLMs, sometimes on simple and uncontroversial topics.
On hot topics like politics, illegal drugs, gender and racial differences, etc... it may be impossible to even get an answer passed the filters.
I rarely have questions of others but I always question myself. :shrug:
There’s a difference between asking out loud or another being vs asking yourself internally.
> I rarely have questions of others but I always question myself.
There's only so many questions I have the ability to answer myself. Of those, there's only so many that I have the lifespan to answer myself. We stand on the shoulders of giants, and even on the shoulders of average people -- really it's shoulders all the way down. Unless the questioning itself is the source of joy (which it certainly sometimes is), I prefer to find out what others have learned when they asked the same questions. It's vanishingly rare that I believe I'm the first to think through something.
I think not having those instant answers available is a big part of why your mind wanders in that setting.
I mirror that experience, except for the latter half. I enjoy just being outside and letting my mind wander, letting it wonder about odd questions in the moment. I never actually want or care about the answers, I just like the feeling of thinking.
I already have my phone, I could look up the answers immediately. The reason I don't isn't that I can't. It's that asking the question is the point, not answering it.
My walk is also around a reservoir, also 4-5 times a week and the length of the walk around it is also 1 hour.
Are you the guy that walks the poodle?
When I walk around, I have many questions in my head. But I never stop to do something about it. If the question is important enough, it will stick and I'll do something about once I get back.
This is the modern curse: I know I can get an answer to nearly every question, and I can get it quickly, just taking my phone out of my pocket and dictating it, it takes zero effort. I feel it's worth to restrain oneself and just enjoy the walk. It just feels better.
I've tried to express a similar sentiment to people in the past - that 443rd redesign of the UI for JIRA that moves a button from one side to another. It isn't actually for you. You aren't the user of the software. The user of the software is the product manager (or equivalent role). They need to justify their current role or their next promotion.
Sadly, it takes away from my productivity when I was already used to the position of the button previously.
I do understand that sometimes things need to be redesigned. But crowing like you landed on the moon because your new phone icons now have "rounded edges with shading" or somesuch fuckery that will just slow down the rendering.. gets old and annoying really fast.
>Because the customer wasn't the user - it was their boss and shareholders. It was all done to make someone else think 'woah, they are following the trend!'.
I'm seeing this again and again. Customers as users seems like the last concern, if it is a concern at all. Adherence to the narrative du jour, fundraising from investors and hyping the useless product up to dump on retail are the primary concerns.
Vaporware or a useless, unlaunched product are advantageous here. Actual users might report how underwhelming or useless it is. Sky high development costs are touted as wins.
> Because the customer wasn't the user - it was their boss and shareholders.
It's kinda funny that some online shops are now bragging how great their customer support is because they DON'T use LLM bots xD
Dealing with real humans in the future will be the ultimate VIP treatment.
I just finished implementing a chatbot in a box for a clients sass. What problem does it solve? None that I can tell, other than now the sass “has ai”.
I still have access to the OpenAI dashboard. I can confirm nobody is actually using it.
We recently got a customer support request asking if we were going to "implement AI" on our website and then saying we could use it in our marketing if we did. No suggestion as to why they would find it useful, or what feature could be augmented with it. It's crazy that the hype is so high that random non-tech users suggest adding AI for marketing.
Embedded AIs are pretty dumb as a product in my opinion. Why would the customer pay you instead of their existing model vendor of choice? Why do they have to learn your chatbox - when it's probably using a crappier model and lacks the context of their preferred vendor.
I really don't want to pay for 5 different AI subscriptions, I want one subscription that works with all my other services (which I already pay for).
Now the sass can sass you
> Because the customer wasn't the user - it was their boss and shareholders.
I'm starting to get asked, "Could AI help you do such-and-such faster?" At first I tried to explain why the answer is no, because such-and-such doesn't lend itself to what AI is good at. But I'm starting to realize I'm going to have to tell them I am using it and maybe give them an example once in a while, because they're hearing too much about its wonderfulness to believe there's something it can't help with. They're going to think I'm just being stubborn even though I tell them I'm not opposed to using AI where it makes sense. If that means the job actually takes a little longer to add in the part where I use AI to speed it up, they'll be happier.
I think those kind of glasses may be really useful for blind people. I have seen similar glasses targeted at blind people, that at least in theory, seemed to me like a good idea.
I recall the glasses also can write on the screen inside the lens, which makes me think they may be good for deaf people as well.
It's just that these use-cases seem uncool, and big companies seem to have to be cool in order to keep either their status or their profits. But I have a feeling the technology may be really useful for some really vulnerable people.
Yes, there are people working on image recognition glasses for blind people.
Nobody seems to have been successful yet, and I think the focus on applying LLMs instead of dumb UI and mixed dumb and ML image processing is a large reason why.
Oh I do still enjoy the glasses, they are actually rather incredible, even though they do not have a screen. That said - These actually do have a Be My Eyes integration - It is incredibly impressive.
"Because the customer wasn't the user - it was their boss and shareholders".
Previous management fads: https://en.wikipedia.org/wiki/Management_fad
Obviously in the right contexts, these methods provided value. But they became widely misapplied, causing a lot of harm.
And the Wikipedia list is far from exhaustive.
I use my Meta glasses heavily on vacation, and then occasionally else where. The latest Llama isn't as smart as OpenAI, so after a few wrong answers I gave up on day to day queries.
That said, the scenarios they are good at they are really good at. I was traveling in Europe and the glasses where translating engravings on castle walls, translating and summarizing historical plaques, and just generally letting me know what was going on around me.
Yeah I'm in charge of trying some AI experiments with my company and I look around the landscape for a little inspiration ... is everything just a wrapper on chatgpt or whatever?
I can do that too but it's also not very useful and I'm just shipping data off to some AI company too. Don't know if I want to feed client data elsewhere like that.
> There is like one or two really clever uses I've seen - disappointingly, one of them was Jira. The internal jargon dictionary tool was legitimately impressive. Will it make any more money? Probably not.
Sounds like Microsoft 365 Copilot at my org. Sucks at nearly everything, but it actually makes a fantastic search engine for emails, teams convos, sharepoint docs, etc. Much better that Microsoft's own global search stuff. Outside of coding, that's the only other real world use case I've found for LLMs - "get me all the emails, chats, and documents related to this upcoming meeting" and it's pretty good at that.
Though I'm not sure we should be killing the earth for better search, there are probably other, better ways to do it.
Agreed - 95% of the questions I ask Copilot, I could answer myself by searching emails, Teams messages and files - BUT Copilot does a far far better job than me, and quicker. I went from barely using it, to using it daily. I wouldn't say it is a massive speed boost for me, but I'd miss it if it was taken away.
Then the other 5% is the 'extra; it does for me, and gets me details I wouldn't have even known where to find.
But it is just fancy search for me so far - but fancy search I see as valuable.
My favorite copilot use is when I join a MS Teams meeting a few minutes late I can ask copilot: what have I missed? It does a fantastic job of summarizing who said what.
> Though I'm not sure we should be killing the earth for better search
Are we, though? What I have read so far suggests the carbon footprint of training models like gpt4 was "a couple weeks of flights from SFO to NYC" https://andymasley.substack.com/p/individual-ai-use-is-not-b...
They also seem to be coming down in power usage substantially, at least for inference. There's pretty good models that can run on laptops now, and I still very much think we're in the model T phase of this technology so I expect further efficiency refinements. It also seems like they have recently hit a "cap" on the increase in intelligence models are getting for more raw power.
The trendline right now makes me wonder if we'll be talking about "dark datacenters" in the future the same way we talked about dark fiber after the dot com bubble.
> I, as a human, rarely have questions to ask
This is an eye-opening sentence. It's quite hard to imagine how to live one's daily life with "few questions to ask." Perhaps this is a neurodivergent thing?
I meant mostly in the context of daily life tasks as a person with ADHD - so maybe a hair neurodivergent. My issue isn't that I don't wonder things, it is that indulging the wonder would interrupt me from accomplishing almost anything. I would not very highly functioning if I allowed for non-critical thoughts to interrupt the flow. When outside of trying to do specific things and in less focus-dependent tasks, I absolutely wonder and google and get lost on weird random topics.
I think I probably could have worded it more as "I rarely have questions worth knowing the answer to", where the cost of knowing answers is tied to the following rabbit holes and delays/forgotten tasks.
I always ponder how many people have a refrigerator in their home their entire life, and what percentage of them don't know how it works.
I've asked several gfs, and they don't have even a hint of how it works. Guy friends do a bit better but not as well as you'd think.
So yes, people live their entire lives not asking obvious questions.
I'm autistic and I probably ask many more questions than most people.
I would also argue that ND people seem to be the heavier AI users, at least in my experience. Its a bit like the stereotypical 'wikipedia deep dive' but 10x.
Don’t try and diagnose people like this please. Even if you’re qualified, and I doubt you are, it’s very insensitive.
Oh what a blissful environment the mind that is not full of constant questions begging to be answered and explored must be.
I'll just be over here, floating (often treading water) in a raging river of "what ifs ...", "I wonder ifs..." And, "Hmmms?"
> … disappointingly, one of them was Jira.
I think this highlights an interesting point: Sensible use cases are unsexy. But the pushers want stuff, however unrealistic, that lends itself to breathless hype that can be blown out of proportion.
Am I the only one who looked at this shortened headline and wondered why anyone is allowing AIs to fly airplanes?
No. I also thought that even a 95% success rate wouldn't be good enough for airplanes.
I just assumed it was developed by Boeing.
As a rule of thumb, airplanes subsystems are expected to have 99.99999% reliability, so the whole gets 99.9999%.
Airline airplanes are currently more than one order of magnitude better than this. But if you have that, you can claim your plane works.
It's very much enough for drones tho... all you need is a tiny Jensen's chip, moped engine, some boom boom play-doh and you're ready to rock. No remote control needed.
we can do it once we know how they work. which will be never.
Why not though? Current autopilot just attempts to keep plane on course/speed/altitude. Some can go further with auto-landing, but extreme emergency use only. I could see the airlines wanting to seek any fuel savings possible by possibly allowing AI to test slight changes to altitude/speed/course to conserve fuel based on some live inputs.
The mathematics that LLMs and machine learning are based on started off being developed for aircraft decades ago. It’s called “control theory”. So we had “AI” on airplanes first. Specifically we had adaptive control algorithms explicitly because of the problems introduced by fuel levels changing during the course of a flight.
In physics, we typically start with mass-spring-damper system representation. Elementary physics and engineering typically has assumptions such as mass being constant. You develop all sorts of dynamical models and intuition with that assumption. But an aircraft burns fuel as it flies, meaning its mass changes during the course of the flight. Thus your models drift and you have to adapt to that.
Pilots would have tomes they'd have to switch between at various points of the journey and adaptive control algorithms alleviated this. They still needed the actual reference guide in the cockpit as a risk mitigation.
The difference between that decades old application is that you don’t need a billion parameter model to do flight control. Most people do not understand the historic development of these techniques. The foundation of them has been around for a while. What we have done with the newest batch of "AI" is massively scale them up.
Yes, I wish it was written "Pilot Programs" or something.
It certainly made me do a double-take.
haha I thought the same and also thought "but everyone uses autopilot, what's the problem"
> "“Every single Monday was called 'AI Monday.' You couldn’t have customer calls, you couldn’t work on budgets, you had to only work on AI projects.”"
> "Vaughan saw that his team was not fully on board. His ultimate response? He replaced nearly 80% of the staff within a year"
Being that this is Fortune magazine, it makes sense that they're portraying it this way, but reading between the lines there a little bit, it seems like the staff knew what would happen and wasn't keen on replacing themselves.
These seems like a glass-is-half-empty view.
5% are succeeding. People are trying AI for just about everything right now. 5% is pretty damn good, when AI clearly has a lot of room to get better.
The good models are quite expensive and slow. The fast & cheap models aren't that great - unless very specifically fine-tuned.
Will it get better enough so that that growth rate in success pilots grows from 5% - 25% in 5 years or 20? Who knows, but it almost certainly will grow.
It's hard to tell how much better the top foundation models will get over the next 5-10 years, but one thing that's certain is that the cost will go down substantially for the same quality over that time frame.
Not to mention all the new use cases people will keep trying over that timeline.
If in 10-years time, AI is succeeding in 2x as many use cases - that might not justify current valuations, but it will be a much better future - and necessary if we're planning on having ~25% of the population being retired / not working by then.
Without AI replacing a lot of jobs, we're gonna have a tough time retiring all the people we promised retirements to.
> 5% is pretty damn good, when AI clearly has a lot of room to get better.
That depends if the AI successes depended much on the leading edge of LLM developments, or if actually most of the value was just "low hanging fruit".
If the latter, that would imply the utility curve is levelling out, because new developments are not proving instrumental enough.
I'm thinking of an S curve: slow improvements through the 2010s, then a burst of activity as the tech became good enough to do something "real", followed by more gradual wins in efficiency and accuracy.
I agree it's an S-curve, but it's anyone's guess where on the S we are.
And regardless, I still see this as very positive for society - and don't care as much about whether or not this is an AI bubble or not.
5% success is actually way higher than I thought it would be. At that rate I suppose there will be actually profitable AI companies with VC subsidies
5% success rate might mean: if you get it to work, you are capturing value that the other 95% are not.
A lot of this must come down to execution. And there's a lot of snake oil out there at the execution layer.
"So you're telling me there's a chance"
https://www.youtube.com/watch?v=KX5jNnDMfxA
5% is not unexpected, as startup success rates are normally about 1:22 over 3 years. lol =3
The MIT report linked in the article is giving a 404 for some reason. Here is the web archive version: https://web.archive.org/web/20250818145714if_/https://nanda....
https://github.com/Papr-ai/papers/blob/main/v0.1%20State%20o...
The MIT NANDA lab seems to have a link rot problem.
Their cardinal code repo is also 404. The NANDA Lab also does coding, their publication at AAAI 2025 is titled: "CoDream: Exchanging dreams instead of models for federated aggregation with heterogeneous models" [1]. However, the link to the Github repo is broken. Fascinating paper, sad about the missing code.
[1] https://mitmedialab.github.io/codream.github.io/
What's the failure rates if technology pilots in general for comparison?
For example, I heard that SAP has an 80-90% deployment failure rate back in the day, but don't have a citable source for it.
> I heard that SAP has an 80-90% deployment failure rate
Something to keep in mind is that ERP "failure" is frequently defined as went over budget or over time, even if it ultimately completed and provided the desired functionality.
It's a much smaller percentage of projects that are either cancelled or went live and significantly did not function as the business needed.
Depends on industry I would think. In my previous industry it was something like 25 %, in my current industry it is closer to 80 %.
That is not remotely true tbh. The company would have failed long ago if it were
Not if every manufacturing company in the world decided to use your software anyway.
ERP rollouts can "fail" for lots of reasons that aren't to do with the software. They are usually business failures. Mostly, companies end up spending so much on trying to endlessly customize it to their idiosyncratic workflows that they exceed their project budgets and abandon the effort. In really bad cases like Birmingham they go live before actually finishing setup, and then lose control of their books and have to resort to hiring people to do the admin manually.
There's a saying about SAP: at some point gaining competitive advantage in manufacturing/retail became all about who could make SAP deployment a success.
This is no different to many other IT projects, most of them fail too. I think people who have never worked in an enterprise context don't realize that; it's not like working in the tech sector. In the tech industry if a project fails, it's probably because it was too ambitious and the tech itself just didn't work well. Or it was a startup whose tech worked, but they couldn't find PMF. But in normal, mature, profitable non-tech businesses a staggering number of business automation projects just fail for social or business reasons.
AI deployments inside companies are going to be like that. The tech works. The business side problems are where the failures are going to happen. Reasons will include:
• Not really knowing what they want the AI to do.
• No way to measure improved productivity, so no way to decide if the API spend is worth it.
• Concluding the only way to get a return is entirely replace people with AI and then having to re-hire them because the AI can't handle the last 5% of the work.
• Non-tech executives doing deals to use models or tech stacks that aren't the right kind or good enough.
etc
Not if most of those failures are medium sized businesses with <1000 employees and your successes include a majority of the world's largest corporations that sell goods.
I think you're on the right track here. Most technology pilots fail. As long as risk/investment is managed appropriately, this is healthy. This seems to follow from Surgeon's Law... 90% of everything is crap [0].
[0] https://en.wikipedia.org/wiki/Sturgeon%27s_law
https://archive.is/bdi7b
> Despite the rush to integrate powerful new models, about 5% of AI pilot programs achieve rapid revenue acceleration; the vast majority stall, delivering little to no measurable impact on P&L.
This summer, I built two very sophisticated pieces of software. A financial ledger to power accrual accounting operations and a code generation framework that scaffolds a database from a defined data model to the frontend components and everything in between.
I used ChatGPT substantially. I'm not sure how long it would have taken without generative AI, but in reality, I would have just given up out of frustration or exhaustion. From the outside, it would appear to any domain expert that at least three other people worked on these giving the pace at which they got completed.
The completion of those two were seminal moments for me. I can't imagine how anyone, in any field of information systems, is not multiples more effective than they were five years ago. That directly affects a P&L and I can't think of anything in my career that is even remotely close to having that magnitude.
I don't know what encapsulates an AI pilot in these orgs, and I'm sure they are massively more complex than anything I've done. But to hear 95% of these efforts don't have a demonstrable effect is just wild.
> From the outside, it would appear to any domain expert that at least three other people worked on these giving the pace at which they got completed.
Did several domain experts tell you this or are you making it up?
> I can't imagine how anyone, in any field of information systems, is not multiples more effective than they were five years ago.
Perhaps "they are massively more complex than anything I've done"
> Did several domain experts tell you this or are you making it up?
It's an assertion among eight other engineers on the project with ~15 years of experience in the domain. They are domain experts. This part isn't up for debate.
I think they mean integrating AI into the business system directly and not using it to code things. I can see that having a more neutral impact
> Generic tools like ChatGPT excel for individuals because of their flexibility, but they stall in enterprise use since they don’t learn from or adapt to workflows, Challapally explained.
Maybe I misunderstood this, but I took this to mean that people inside enterprises are struggling using tools like ChatGPT. They do point out that perhaps the tools are being deployed in the wrong areas:
> The data also reveals a misalignment in resource allocation. More than half of generative AI budgets are devoted to sales and marketing tools, yet MIT found the biggest ROI in back-office automation—eliminating business process outsourcing, cutting external agency costs, and streamlining operations.
But I've seen some amazing automation does in sales and marketing that directly affected sales efficiency and reduced sales and marketing expenses.
“AI pilots” in the article refers to developing AI-based tools, not to using AI for software development. These projects have a 95% failure rate of successfully deploying the AI tool being developed into production.
Regarding use of AI in software development (which is not what the article is about), the proof of the pudding isn’t in greenfield projects, it’s in longer-term software evolution and legacy code. Few disagree that AI saves time for prototyping or creating a first MVP.
You are correct. As I pointed out in another reply, I misinterpreted this part:
> Generic tools like ChatGPT excel for individuals because of their flexibility, but they stall in enterprise use since they don’t learn from or adapt to workflows, Challapally explained.
I didn't read the actual report (and probably wont) so I was figured "AI Pilots" _could_ (and honestly, should!) include the deployment of models to assist in any and all work (not necessarily even just coding - I just used it as an example).
> But to hear 95% of these efforts don't have a demonstrable effect is just wild.
Why tho? You used AI to make some software, but did you use AI to achieve rapid revenue acceleration?
That you used AI to build software seems tangential to whether it can increase revenues. Over the years, we've seen many technologies that didn't deliver on promises of rapidly increasing revenues despite being useful for creating software (cough OOP cough), so this new one failing to live up to expectations isn't surprising. Actually given the history of technologies that over promise and under deliver on massive hype, disappointment should be the null hypothesis.
I can’t help feeling that we’re rapidly heading towards the “trough of disillusionment”.
(How should I invest if I have this thesis)
short nvidia?
We've talked with a ton of AI companies and I was surprised how much of the challenges were the usual challenges in any project. Just amplified by the rush to do AI right now, but I haven't seen anything as bad as "You couldn’t have customer calls, you couldn’t work on budgets, you had to only work on AI projects.”
Warning for gratuitous self promotion: https://humansignal.com/blog/9-criteria-for-successful-ai-pr...
Actual report (State of AI in Business 2025): https://news.ycombinator.com/item?id=44941374
How much money can you pull out as a failed startup founder?
About a mil? Maybe two? Seems realistic…
People have to invent whatever seems reasonable while squinting given how much accumulation of capital there is.
The guys with money are easy to fool. Just lie to them about your „product”, get the cash, get out of the rat race, smooth sailing.
Of course easier said than done. I can’t lie this convincingly, I don’t have the con man skillset or connections.
So I’m stuck in a 9 to 5. Zzz…
> Of course easier said than done. I can’t lie this convincingly, I don’t have the con man skillset or connections.
Isn't the idea that you're not a shitty human being enough in and of itself?
I am. I'm working for a despicable company for money.
And an incompetent one at that. I can't grab a bag and leave.
This is proof LLMs are viable and productive in my opinion. The baseline rate for business failure over 5 years is around 90%, so they say. With how much hype surrounds LLM wrapper startups this is still an astounding amount of novel business model creation.
At this rate, how is it better than pure random chance?
The article mentions 19-20 year old founders, focused on solving single user problems, were the successes.
The sample size is 300 public AI deployments and an undisclosed number of private in-house AI projects. And the survey seems to only consider business applications, as compared with end-user applications like media and software. That's significant but not definitive.
Isn't it more likely that existing problems with low hanging fruit, perhaps unpopular answers, that could be solved by leaning on "AI". And perhaps "AI" wasn't the key to success?
This resonates. Upskilling to AI tools is perhaps the biggest problem of our day. One idea we have to tackle this problem is to bring onboarding/learning directly into the user's work environment, track struggles and offer targeted support, and create continuous feedback loops. If anyone has faced challenges with increasing activation and retention of users on pilots (or external-facing products), would love to chat and see how we can help .
There was an article on HN about the valuations of AI being out of touch with the question; what problem is being solved?
We use generative imagery/video at my job and it's adding value. I see value being added for coders.
There's real innovation happening, but I find it's mostly companies cutting corners making customer service even shittier than it already was.
> real innovation happening, but I find it's mostly companies cutting corners
There's a meme that I think fits: https://i.redd.it/20rpdamxef0f1.jpeg
I think for a long time, cutting corners so that the number can go up next quarter has worked surprisingly well. Genuinely, I don't think a lot of corporations view offering a better product as a viable means of competing in the 2025 marketplace.
For them, AI is not the next industrial revolution, it's the next overseas outsourcing; AI isn't a way to bring new value to customers, it's a way to bring roughly the same value (read worse) but at a much cheaper cost to them. If they get their way, everything will get worse, while they make more money. That's the value proposition at play here.
actual report here: https://github.com/Papr-ai/papers/blob/main/v0.1%20State%20o...
Same source as https://news.ycombinator.com/item?id=44940944
Oh god what is this website it gives me a headache with all the pop-ups and auto playing videos.
I remember when it was being said that computers in business had basically the same impact.
Comparing an universal computing machine to what is essentially a fancy autocomplete is just bonkers.
The title led me to assume it was about the aircraft type of pilot.
lots of bad partial solutions looking for problems companies rushed to implement
Why so bad?
Because for the typical office - documents are strewn about on random network drives and are not formatted similarly. This combined with the inability to nail down 100% accuracy on even just internal doc search is just too much to overcome for non-tech industry offices. My office is mind blown if i use Gemini to extract data from a PDF and convert it to an .xlsx or .csv
As a technically minded person but not a comp sci guy, refining document search is like staring into a void and every option uses different (confusing) terminology. This makes it extra difficult for me to both do my regular job AND learn the multiple names/ways to do the exact same thing between platforms.
The only solution that has any reliability for me so far are Gemini instances where i upload only the files i wish to search and just keep it locked to a few questions per instance before it starts to hallucinate.
My attempt at RAG search implementation was a disaster that left me more confused than anything.
Because you mentioned the use case specifically, I wanted to point you to the fact that Excel has been able to convert images to tables for a while now. Literally screenshot a table from your PDF and it will convert to table. Not trying to diminish any additional capabilities you're getting from Gemini, but this screenshot to table feature has been huge for my finance team.
https://support.microsoft.com/en-us/office/insert-data-from-...
try https://www.papr.ai for RAG. built it to solve this problem
It's in the name: generative AIs.
There are very few use cases at companies where you need to generate something. You want to work with the company's often very private disparate data (with access controls etc.) You wouldn't even have enough data to train a custom LLM, much less use a generic one.
Turns out that garbage text has very little intrinsic value
In my experience is that LLMs get you 80%of the way to a solution almost immediately but that last 20% when it comes to missing knowledge, data, or accuracy is a complete tar pit and will wreck adoption. Especially since many vendors are selling products that are wrappers and provide generic, non-customised solutions. I hear the same from others doing trials with various AI tools as well.
Any consumer facing AI project has to contend with the fact that GenAI is predominantly associated with "slop." If you're not actively using an AI tool, most of your experience with GenAI is seeing social media or Youtube flooded with low quality AI content, or having to deal with useless AI customer support. This gives the impression that AI is just cheap garbage, and something that should be actively avoided.
I think one reason for this is that LLMs are sort of maximally if accidentally designed to fuck up our brains. Despite all the advancements in the last five years I see them as still, fundamentally, text transformation machines which have only very limited sort of intelligence. Yet because nothing in history has been able to generate language except humans, most of us are not prepared to make rational judgements about their capabilities and those of us that may be also often fail to do so.
The fact that we live in an era where tech people have been so investor pilled that overstating the capabilities of technology is basically second nature does not help.
I mean 5% not failing is pretty standard for any startup-driven thing.
[dead]