I always thought the companies I worked for would implement chaos testing shortly after this talk/blog released. However; only last year did we do anything even approaching chaos testing. I think this goes to show that the adage “the future is already here just unevenly distributed” carries some truth in some contexts!
I think the companies I worked for were prioritizing working on no issue deployments (built from a series of documented and undocumented manual processes!) rather than making services resilient through chaos testing. As a younger dev this priority struck me as heresy (come on guys, follow the herd!); as a more mature dev I understand time & effort are scarce resources and the daily toil tax needs to be paid to make forward progress… it’s tough living in a non-ideal world!
Chaos testing rarely uncovers anything significant or actionable beyond things you can suss out yourself with a thorough review but has the added potential for customer harm if you don't have all your ducks in a row. It also neatly requires, as a prerequisite, for you to have your ducks in a row.
I think that's why most companies don't do it. A lot of tedium and the main benefit was actually getting your ducks in a row.
It's a great way of thinking about resiliency and fault tolerance, but it's also definitely on the very mature end of the systems engineering spectrum.
If you know things will break when you start making non-deterministic configuration changes, you aren't ready for chaos engineering. Most companies never get out of this state.
I distilled these ideas over subsequent years into several talks on “Failing Over without Falling Over”. Investing anything in resilience without testing that it actually works is a waste of resources. Thats the underlying lesson. https://github.com/adrianco/slides/blob/master/FailingWithou...
I recently made a "garbage monkey" script for work which will spam random buttons on the UI to make sure that animations and stuff work correctly even if the user is somehow pressing things faster than a user could. It has been pretty useful in uncovering some problems, though it only works with "buttons", and wont do touchscreen events etc
Wish we lived in the universe where the term 'monkey' won over 'agent'. Would have given everything a cool Planet of the Apes feel.
I remember this getting a lot of buzz at the time, but few orgs are at the level of sophistication to implement chaos testing effectively.
Companies all want a robust DR strategy, but most outages are self-inflicted and time spent on DR would be better spent improving DX, testing, deployment and rollback.
I was reading this the other day looking for ideas on how to test query retries in our app. I suppose we could go at it from the network side by introducing latency and such.
However, it’d be great if there also was a proxy or something that could inject pg error codes.
I suspect Netflix built the Simian Army largely out of necessity, since at the time, AWS did not offer much native ways to deliberately inject failure or validate resilience or compliance at scale.
Today, many of these ideas map directly to some of their managed services like AWS Fault Injection Simulator, AWS Resilience Hub, or AWS Config, AWS Inspector, Security Hub, GuardDuty, and IAM Access Analyzer for example.
There is also a big third-party ecosystem (Gremlin, LitmusChaos, Chaos Mesh, Steadybit, etc...) offering similar capabilities, often with better multi-cloud or CI/CD integration.
Some of these Netflix tools, I dont think they get much maintenance now, but as free options, they can be cheaper to run than AWS managed services or Marketplace offerings...
I always thought the companies I worked for would implement chaos testing shortly after this talk/blog released. However; only last year did we do anything even approaching chaos testing. I think this goes to show that the adage “the future is already here just unevenly distributed” carries some truth in some contexts!
I think the companies I worked for were prioritizing working on no issue deployments (built from a series of documented and undocumented manual processes!) rather than making services resilient through chaos testing. As a younger dev this priority struck me as heresy (come on guys, follow the herd!); as a more mature dev I understand time & effort are scarce resources and the daily toil tax needs to be paid to make forward progress… it’s tough living in a non-ideal world!
Chaos testing rarely uncovers anything significant or actionable beyond things you can suss out yourself with a thorough review but has the added potential for customer harm if you don't have all your ducks in a row. It also neatly requires, as a prerequisite, for you to have your ducks in a row.
I think that's why most companies don't do it. A lot of tedium and the main benefit was actually getting your ducks in a row.
It's a great way of thinking about resiliency and fault tolerance, but it's also definitely on the very mature end of the systems engineering spectrum.
If you know things will break when you start making non-deterministic configuration changes, you aren't ready for chaos engineering. Most companies never get out of this state.
I distilled these ideas over subsequent years into several talks on “Failing Over without Falling Over”. Investing anything in resilience without testing that it actually works is a waste of resources. Thats the underlying lesson. https://github.com/adrianco/slides/blob/master/FailingWithou...
I recently made a "garbage monkey" script for work which will spam random buttons on the UI to make sure that animations and stuff work correctly even if the user is somehow pressing things faster than a user could. It has been pretty useful in uncovering some problems, though it only works with "buttons", and wont do touchscreen events etc
Wish we lived in the universe where the term 'monkey' won over 'agent'. Would have given everything a cool Planet of the Apes feel.
I remember this getting a lot of buzz at the time, but few orgs are at the level of sophistication to implement chaos testing effectively.
Companies all want a robust DR strategy, but most outages are self-inflicted and time spent on DR would be better spent improving DX, testing, deployment and rollback.
Anyone have experience chaos testing Postgres?
I was reading this the other day looking for ideas on how to test query retries in our app. I suppose we could go at it from the network side by introducing latency and such.
However, it’d be great if there also was a proxy or something that could inject pg error codes.
I know of https://github.com/Shopify/toxiproxy but it is not protocol aware, you might be able to add it yourself.
I suspect Netflix built the Simian Army largely out of necessity, since at the time, AWS did not offer much native ways to deliberately inject failure or validate resilience or compliance at scale.
Today, many of these ideas map directly to some of their managed services like AWS Fault Injection Simulator, AWS Resilience Hub, or AWS Config, AWS Inspector, Security Hub, GuardDuty, and IAM Access Analyzer for example.
There is also a big third-party ecosystem (Gremlin, LitmusChaos, Chaos Mesh, Steadybit, etc...) offering similar capabilities, often with better multi-cloud or CI/CD integration.
Some of these Netflix tools, I dont think they get much maintenance now, but as free options, they can be cheaper to run than AWS managed services or Marketplace offerings...