Ok,so people use NTP to "synchronize" their clocks and then write applications that assume the clocks are in exact sync and can use timestamps for synchronization, even though NTP can see the clocks aren't always in sync. Do I have that right?
If you are an engineer at Google dealing with Spanner, then you can in fact assume clocks are well synchronized and can use timestamps for synchronization. If you get commit timestamps from Spanner you can compare them to determine exactly which commit happened first. That’s a stronger guarantee than the typical Serializable database like postgresql: https://www.postgresql.org/docs/current/transaction-iso.html...
That’s the radical developer simplicity promised by TrueTime mentioned in the article.
That’s actually not at all what TrueTime guarantees and assuming they’ve solved a physical impossibility is dangerous technically as a founding assumption for higher level tech (which thankfully Spanner does not do).
What TrueTime says is that clocks are synchronized within some delta just like NTP, but that delta is significantly smaller thanks to GPS time sync. That enables applications to have tighter bounds on waiting to see if a conflict may exist before committing which is why Spanner is fast. CockroachDB works similarly but given the logistical challenge of getting GPS receivers into data centers, they worked to achieve a smaller delta through better NTP-like timestamps and generally get fairly close performance.
> Bounded Uncertainty: TrueTime provides a time interval, [earliest, latest], rather than a single timestamp. This interval represents the possible range of the current time with bounded uncertainty. The uncertainty is caused by clock drift, synchronization delays, and other factors in distributed systems.
Alternatively, you could guarantee the same synchronization using PPS and PTP to each host's DCD pin of their serial port or to specialized hardware such as modern PTP-enabled smart NICs/FPGAs that can accept PPS input. GPS+PPS gets you to within 20-80ns global synchronization depending on implementation (assuming you're all mostly in the same inertial frame), and allows you to make much stronger guarantees than TrueTime (due to higher precision distributed ordering guarantees, which translate to lower latency and higher throughput distributed writes).
Unfortunate that the author doesn’t bring up FoundationDB version stamps, which to me feel like the right solution to the problem. Essentially, you can write a value you can’t read until after the transaction is committed and the synchronization infrastructure guarantees that value ends up being monotonically increasing per transaction. They use similar “write only” operations for atomic operations like increment.
Yes. A consistent total ordering is what you need (want) in distributed computing. Ultimately, causality is what is important, but consistent ordering of concurrent operations makes things much easier to work with.
Timesync isn’t a nightmare at all. But it is a deep rabbit hole.
The best approach, imho, is to abandon the concept of a global time. All timestamps are wrt a specific clock. That clock will skew at a rate that varies with time. You can, hopefully, rely on any particular clock being monotonous!
My mental model is that you form a connected graph of clocks and this allows you to convert arbitrary timestamps from any clock to any clock. This is a lossy conversion that has jitter and can change with time. The fewer stops the better.
I kinda don’t like PTP. Too complicated and requires specialized hardware.
This article only touches on one class of timesync. An entirely separate class is timesync within a device. Your phone is a highly distributed compute system with many chips each of which has their own independent clock source. It’s a pain in the ass.
You also have local timesync across devices such as wearables or robotics. Connecting to a PTP system with GPS and atomic clocks is not ideal (or necessary).
I wouldn't say it's a 'nightmare'. It's just more complicated than what regular folk think computers work when it comes to time sync. There's nothing nightmareish or scary about this, it's just using the best solution for your scenario, understanding limitations and adjusting expectations/requirements accordingly, perhaps relaxing consistency requirements.
I worked on the NTP infra for a very large organization some time ago and the starriest thing I found was just how bad some of the clocks were on 'commodity hardware' but this just added a new parameter for triaging hardware for manufacturer replacement.
This is an ok article but it's just so very superficial. It goes too wide for such a deep subject matter.
I took to distributed systems like a duck to water. It was only much later that I figured out that while there are things I can figure out in one minute that took other people five, there were a lot of others that you will have to walk them through step by step or they would never get there. That really explained some interactions I’d had when I was younger.
In particular I don’t think the intuitions necessary to do distributed computing well would come to someone who snoozed through physics, who never took intro to computer engineering.
Sometimes hardware that has PTP support in the specs doesn't perform very well though, so if you do things at scale, being able to validate things like switches and network card drivers is useful too!
It's to the point timing server vendors I've spoken to have their own test labs where they have to validate network gear and then publish lists of recommended and tested configurations.
Even some older cards where you'd think the PTP issues would be solved still have weird driver quirks in Linux!
Normally I would nod at the title. Having lived it.
But I just watched/listened to a Richard Feynmann talk on the nature of time and clocks and the futility of "synchronizing" clocks. So I'm chuckling a bit. In the general sense, I mean. Yes yes, for practical purposes in the same reference frame on earth, it's difficult but there's hope. Now, in general ... synchronizing two clocks is ... meaningless?
Wild. My layperson mind goes to a simple example, which may or may not be possible, but please tell me if this is the gist:
Alice and Bob, in different reference frames, both witness events C and D occurring. Alice says C happened before D. Bob says D happened before C. They're both correct. (And good luck synchronizing your watches, Alice and Bob!)
Yes that definitely happens. People orbiting Polaris would be seeing two supernovas explode at different times than us due to the speed of light. Polaris is 400 light years away so the gap could be large.
But when you are moving you may see very closely spaced events in different order, because you’re moving toward Carol but at an angle to Doug. Versus someone else moving toward Doug at an angle to Carol.
PTP requires support not only on your network, but also on your peripheral bus and inside your CPU. It can't achieve better-than-NTP results without disabling PCI power saving features and deep CPU sleep states.
You can if you just run PTP (almost) entirely on your NIC. The best PTP implementations take their packet timestamps at the MAC on the NIC and keep time based on that. Nothing about CPU processing is time-critical in that case.
> Google faced the clock synchronization problem at an unprecedented scale with Spanner, its globally distributed database. They needed strong consistency guarantees across data centers spanning continents, which requires knowing the order of transactions.
> Here’s a video of me explaining this.
Do you need a video? Do we need a 42 minute video to explain this?
I generally agree with Feynman on this stuff. We let explanations be far more complex than they need to be for most things, and it makes the hunt for accidental complexity harder because everything looks almost as complex as the problems that need more study to divine what is actually going on there.
For Spanner to be useful they needed a high transaction rate and in a distributed system that requires very tight grace periods for First Writer Wins. Tighter than you can achieve with NTP or system clocks. That’s it. That’s why they invented a new clock.
Google puts it this way:
Under external consistency, the system behaves as if all transactions run sequentially, even though Spanner actually runs them across multiple servers (and possibly in multiple datacenters) for higher performance and availability.
But that’s a bit thick for people who don’t spend weeks or years thinking about distributed systems.
Ok,so people use NTP to "synchronize" their clocks and then write applications that assume the clocks are in exact sync and can use timestamps for synchronization, even though NTP can see the clocks aren't always in sync. Do I have that right?
If you are an engineer at Google dealing with Spanner, then you can in fact assume clocks are well synchronized and can use timestamps for synchronization. If you get commit timestamps from Spanner you can compare them to determine exactly which commit happened first. That’s a stronger guarantee than the typical Serializable database like postgresql: https://www.postgresql.org/docs/current/transaction-iso.html...
That’s the radical developer simplicity promised by TrueTime mentioned in the article.
That’s actually not at all what TrueTime guarantees and assuming they’ve solved a physical impossibility is dangerous technically as a founding assumption for higher level tech (which thankfully Spanner does not do).
What TrueTime says is that clocks are synchronized within some delta just like NTP, but that delta is significantly smaller thanks to GPS time sync. That enables applications to have tighter bounds on waiting to see if a conflict may exist before committing which is why Spanner is fast. CockroachDB works similarly but given the logistical challenge of getting GPS receivers into data centers, they worked to achieve a smaller delta through better NTP-like timestamps and generally get fairly close performance.
https://programmingappliedai.substack.com/p/what-is-true-tim...
> Bounded Uncertainty: TrueTime provides a time interval, [earliest, latest], rather than a single timestamp. This interval represents the possible range of the current time with bounded uncertainty. The uncertainty is caused by clock drift, synchronization delays, and other factors in distributed systems.
Alternatively, you could guarantee the same synchronization using PPS and PTP to each host's DCD pin of their serial port or to specialized hardware such as modern PTP-enabled smart NICs/FPGAs that can accept PPS input. GPS+PPS gets you to within 20-80ns global synchronization depending on implementation (assuming you're all mostly in the same inertial frame), and allows you to make much stronger guarantees than TrueTime (due to higher precision distributed ordering guarantees, which translate to lower latency and higher throughput distributed writes).
Truetime is based on GPS and local atomic clocks. Google's latest timemasters are even better, around 10ns average.
Isn't that because Google has its own atomic clocks, rather than NTP which is (generally) using publicly available atomic clocks?
Unfortunate that the author doesn’t bring up FoundationDB version stamps, which to me feel like the right solution to the problem. Essentially, you can write a value you can’t read until after the transaction is committed and the synchronization infrastructure guarantees that value ends up being monotonically increasing per transaction. They use similar “write only” operations for atomic operations like increment.
Yes. A consistent total ordering is what you need (want) in distributed computing. Ultimately, causality is what is important, but consistent ordering of concurrent operations makes things much easier to work with.
Even just a single accurate clock is a nightmare... https://www.npr.org/2025/12/21/nx-s1-5651317/colorado-us-off...
Timesync isn’t a nightmare at all. But it is a deep rabbit hole.
The best approach, imho, is to abandon the concept of a global time. All timestamps are wrt a specific clock. That clock will skew at a rate that varies with time. You can, hopefully, rely on any particular clock being monotonous!
My mental model is that you form a connected graph of clocks and this allows you to convert arbitrary timestamps from any clock to any clock. This is a lossy conversion that has jitter and can change with time. The fewer stops the better.
I kinda don’t like PTP. Too complicated and requires specialized hardware.
This article only touches on one class of timesync. An entirely separate class is timesync within a device. Your phone is a highly distributed compute system with many chips each of which has their own independent clock source. It’s a pain in the ass.
You also have local timesync across devices such as wearables or robotics. Connecting to a PTP system with GPS and atomic clocks is not ideal (or necessary).
TicSync is cool and useful. https://sci-hub.se/10.1109/icra.2011.5980112
Vector clocks are one of the other things Barbara Liskov is known for.
I wouldn't say it's a 'nightmare'. It's just more complicated than what regular folk think computers work when it comes to time sync. There's nothing nightmareish or scary about this, it's just using the best solution for your scenario, understanding limitations and adjusting expectations/requirements accordingly, perhaps relaxing consistency requirements.
I worked on the NTP infra for a very large organization some time ago and the starriest thing I found was just how bad some of the clocks were on 'commodity hardware' but this just added a new parameter for triaging hardware for manufacturer replacement.
This is an ok article but it's just so very superficial. It goes too wide for such a deep subject matter.
I took to distributed systems like a duck to water. It was only much later that I figured out that while there are things I can figure out in one minute that took other people five, there were a lot of others that you will have to walk them through step by step or they would never get there. That really explained some interactions I’d had when I was younger.
In particular I don’t think the intuitions necessary to do distributed computing well would come to someone who snoozed through physics, who never took intro to computer engineering.
PTP isn't even that much more difficult, as long as you planned for it form the start
you buy the hardware, plug it all in, and it works
Sometimes hardware that has PTP support in the specs doesn't perform very well though, so if you do things at scale, being able to validate things like switches and network card drivers is useful too!
It's to the point timing server vendors I've spoken to have their own test labs where they have to validate network gear and then publish lists of recommended and tested configurations.
Even some older cards where you'd think the PTP issues would be solved still have weird driver quirks in Linux!
the Huygens algorithm is also worth a look
https://www.usenix.org/system/files/conference/nsdi18/nsdi18...
Another protocol that's not mentioned is PPS and its variants, such as WhiteRabbit.
A regular pulse is emitted from a specialized high-precision device, possibly over a specialized high-precision network.
Enables picosecond accuracy (or at least sub-nano).
Normally I would nod at the title. Having lived it.
But I just watched/listened to a Richard Feynmann talk on the nature of time and clocks and the futility of "synchronizing" clocks. So I'm chuckling a bit. In the general sense, I mean. Yes yes, for practical purposes in the same reference frame on earth, it's difficult but there's hope. Now, in general ... synchronizing two clocks is ... meaningless?
https://www.youtube.com/watch?v=zUHtlXA1f-w
Einstein was worried about whether people in two different relativistic frames would see cause and effect reversed.
Wild. My layperson mind goes to a simple example, which may or may not be possible, but please tell me if this is the gist:
Alice and Bob, in different reference frames, both witness events C and D occurring. Alice says C happened before D. Bob says D happened before C. They're both correct. (And good luck synchronizing your watches, Alice and Bob!)
Yes that definitely happens. People orbiting Polaris would be seeing two supernovas explode at different times than us due to the speed of light. Polaris is 400 light years away so the gap could be large.
But when you are moving you may see very closely spaced events in different order, because you’re moving toward Carol but at an angle to Doug. Versus someone else moving toward Doug at an angle to Carol.
That will be the case when Alice stands close to where C happens, and Bob stands close to where D happens.
It's a little trickier to imagine introducing cause-and-effect though. (Alice sees that C caused D to happen, Bob sees that D caused C to happen).
I think a "light cone" is the thought-experiment to look up here.
If Bob and Alice are moving at half the speed of light in opposite directions.
it might be meaningless, but in practical terms just don't check util.c from the gravity well into the git repo in orbit.
Love learning new things. This also explains why my casio clock sync starts skewing over time
PTP requires support not only on your network, but also on your peripheral bus and inside your CPU. It can't achieve better-than-NTP results without disabling PCI power saving features and deep CPU sleep states.
You can if you just run PTP (almost) entirely on your NIC. The best PTP implementations take their packet timestamps at the MAC on the NIC and keep time based on that. Nothing about CPU processing is time-critical in that case.
How so? If the NIC is processing the timestamps as it arrives/leaves on the wire, the latency and jitter in the rest of the system shouldn't matter.
> Google faced the clock synchronization problem at an unprecedented scale with Spanner, its globally distributed database. They needed strong consistency guarantees across data centers spanning continents, which requires knowing the order of transactions.
> Here’s a video of me explaining this.
Do you need a video? Do we need a 42 minute video to explain this?
I generally agree with Feynman on this stuff. We let explanations be far more complex than they need to be for most things, and it makes the hunt for accidental complexity harder because everything looks almost as complex as the problems that need more study to divine what is actually going on there.
For Spanner to be useful they needed a high transaction rate and in a distributed system that requires very tight grace periods for First Writer Wins. Tighter than you can achieve with NTP or system clocks. That’s it. That’s why they invented a new clock.
Google puts it this way:
Under external consistency, the system behaves as if all transactions run sequentially, even though Spanner actually runs them across multiple servers (and possibly in multiple datacenters) for higher performance and availability.
But that’s a bit thick for people who don’t spend weeks or years thinking about distributed systems.