This article is stuck in a very wide valley of being perhaps somewhat familiar with the domain concepts but nowhere near deep enough to draw any of the conclusions being drawn. It is close enough to being completely wrong.
The primary tradeoff of initcwnd is setting a reasonable window before you've learned anything about the path. BBR has little say on this because it takes, in relative terms, quite a while to go through its phases. An early BBR session is therefore not really superior to other congestion controls because that is not the problem it is really focused on.
Jacking up the initcwnd, you start to risk tail loss, which is the worst kind of loss for a sliding window.. especially in the primordial connection. There are ways of trying to deal with all that but they are loss predictions.
If you are a big enough operator, maybe you have some a priori knowledge to jack this up for certain situations. But people are also reckless and do not understand the tradeoffs or overall fairness that the transport community tries to achieve.
As other comments have pointed out, QUIC stacks also replicate congestion control and other algorithms based on the TCP RFCs. These are usually much simpler and lacking features compared to the mainline Linux TCP stack. It's not a free lunch and doesn't obviate the problem space any transport protocol has to make tradeoffs on.
Google has probably sent data to almost every /24 in the last hour. Probably 99% of their egress data goes to destinations where they've sent enough data recently to make a good estimate of bottleneck link speed and queue size.
Having to pick a particular initcwnd to be used for every new TCP connection is an architectural limitation. If they could collect data about each destination and start each TCP connection with a congestion window based on the recent history of transfers from any of their servers to that destination, it could be much better.
It's not a trivial problem to collect bandwidth and buffer size estimates and provide them to every server without delaying the connection, but it would be fun to build such a system.
They also miss the fact that even with an initcwnd of 10 the TLS negotiation isn’t going to consume it so the window starts growing long before content is actually sent.
Plus there’s no discussion of things like packet pacing
> Google also developed QUIC, which is HTTP over UDP. There’s no longer any congestion window to deal with, so entire messages can be sent at once.
I don't think that's true. QUIC implementations typically use the same congestion control algorithms, including both CUBIC and BBR, at least nominally. The latest RFCs for those discuss use with both TCP and QUIC. Though, perhaps when used with QUIC they have more degrees of freedom to tune things.
Vendors are pushing using QUIC and without other networking layers, thus the performance increase. For example, SMB3 over QUIC over the Internet, sans VPN.
L4S[0] also helps a lot with sensing congestion before young connections have yet to suffer their first lost packet...
Basically it sqrt's your actual packet loss rate as far as feedback frequency/density is concerned, without actually even typically having to enact that loss.
For example, you could get congestion feedback every 100th packet (as you would with 1% packet loss), during network conditions that would traditionally only have 0.01% packet loss.
From [1]:
Unless an AQM node schedules application flows explicitly, the likelihood that the AQM drops a Not-ECT Classic packet (p_C) MUST be roughly proportional to the square of the likelihood that it would have marked it if it had been an L4S packet (p_L). That is:
p_C ~= (p_L / k)2
The constant of proportionality (k) does not have to be standardized for interoperability, but a value of 2 is RECOMMENDED. The term 'likelihood' is used above to allow for marking and dropping to be either probabilistic or deterministic.
This might be one of those induced demand situations like building more lanes on a highway, which generally makes traffic worse.
The actual limiting factor for how horribly bloated frontend code becomes is that at some point, it becomes so bad that it noticably impacts your business negatively, and you need to improve it.
Increasing the TCP window so it managed at least the basic asset delivery makes sense, but if you need to cold start regularly and you have hundreds of kb of javascript, perhaps fix your stuff?
IIRC, all latency-driven congestion control algorithms suffer from violent rtt variance, which happens frequently in wireless networks. How does BBR perform under such circumstances?
I reckon bufferbloat is overhyped as a problem, it mattered to a small set of Internet connectivity in the 2010s and promptly went away as connectivity changed and improved, yet we continue to look at it like it was yesterdays problem.
Bufferbloat is alive and well. Try a t-mobile 5g home gateway. Oof.
I think cable modems have had a ton of improvement, and more fiber in our diet helps, but mobile can be tricky, and wifi is still highly variable (there's promising signs, but I don't know how many people update their access points)
Is it though, or is it just a scapegoat or a red herring, especially in the case with a wireless medium? That's been my experience with quick claims to bufferbloat, it's usually something else at play. But again ymmv.
There's always something else at play. Bufferbloat hides problems from the systems that can easily solve them. It doesn't cause problems, it makes them worse.
I mean, I did a speed test with t-mobile 5g home internet, download speed was impressive, but so was the difference in ping time during the download vs otherwise.
Sure, wireless is complex, but there were definitely some way too big buffers in the path. Add in some difficulty integrating their box into my network, and it wasn't for me.
Fair enough, I concede with your assessment, my understanding of bufferbloat (which I have to relearn everytime I look at it) is that the telltale sign is ping to any destination that traverses the uplink exhibits higher latency than usual when you're saturating your download. It's just a tricky thing to test given variability of conditions (and what might be deemed as expected operation) which is why I'm usually hesitant and sceptical, and I don't trust those speedtest websites to gauge it properly.
What do ISPs have to actually determine these issues outside of sketchy speedtest websites and vague reports or concerns from customers? What about placing probes in the correct places (e.g. in conditions where there is no additional loss or introduced latency between the end user and uplink). Also is this an actual problem that users are really having, or is it perceived because some benchmark / speedtest gave you a score.
There's a lot of issues and variables at play; this isn't a case of "it's always DNS". What tools do ISPs even have at their disposal and how accurate are they and does it uncover the actual problem users are experiencing? This is the real issue that ISPs of all size have to deal with.
> Also is this an actual problem that users are really having, or is it perceived because some benchmark / speedtest gave you a score.
The actual problem is I'm on a voip call and someone starts a big download (steam) and latency and jitter go to hell and the call is unusable. Bufferbloat test confirms that latency dramatically increases under load. Or same call but someone starts uploading something big.
If troublesome buffers are at the last mile connection and the ISP provides a modem/router, adding QoS limiting downloads and uploads to about 90% of the acheived physical connection will avoid the issue. The buffers are still too big, but they won't fill under normal conditions, so it's not a problem. You could still fill the buffers if there's a big flow that doesn't use effective congestion control, or a large enough number of flows so that the minimum send rate is still too much; or when the physical connection rate changes, but good enough.
Otherwise, ISP visibility can be limited. Not all equipment will report on buffer use, and even if it does, it may not report on a per port basis, and even then, the timing of measurement might miss things. What you're looking for is a 'standing buffer' where a port always has at least N packets waiting and the buffer does not drain for a meaningful amount of time. Ideally, you'd actually measure the buffer length in milliseconds, rather than packets, but that's asking a lot of the equipment.
There's a balance to be met as well. Smaller buffers mean packet drops, which is appropriate when dealing with standing buffers; but too small of buffers leads to problems if your flows are prone to 'micro bursts', lots of packets at once potentially on many flows, and then calm for a while. It's better to have room to buffer those.
Internet connectivity improvement has slowed a lot. It was improving at a good clip in the 00's but then a lot of usage moved to mobile data which also caused investment to shift away from broadband speedups. If we had 00's rate of improvement, people would have 100G connections at home now.
(wifi also dampened bandwidth demand for a long time - it didn't make sense to pay for faster-than-wifi broadband)
Never said there isn't connectivity or service quality issues inside or outside the home; just that bufferbloat specifically is a trope that should be put to rest.
Google has a long history of performing networking research, making changes, and pushing those changes to the entire internet. In 2011, they published one of my favorite papers, which described their decision to increase the TCP initial congestion window from 1 to 10 on their entire infrastructure.
This article is stuck in a very wide valley of being perhaps somewhat familiar with the domain concepts but nowhere near deep enough to draw any of the conclusions being drawn. It is close enough to being completely wrong.
The primary tradeoff of initcwnd is setting a reasonable window before you've learned anything about the path. BBR has little say on this because it takes, in relative terms, quite a while to go through its phases. An early BBR session is therefore not really superior to other congestion controls because that is not the problem it is really focused on.
Jacking up the initcwnd, you start to risk tail loss, which is the worst kind of loss for a sliding window.. especially in the primordial connection. There are ways of trying to deal with all that but they are loss predictions.
If you are a big enough operator, maybe you have some a priori knowledge to jack this up for certain situations. But people are also reckless and do not understand the tradeoffs or overall fairness that the transport community tries to achieve.
As other comments have pointed out, QUIC stacks also replicate congestion control and other algorithms based on the TCP RFCs. These are usually much simpler and lacking features compared to the mainline Linux TCP stack. It's not a free lunch and doesn't obviate the problem space any transport protocol has to make tradeoffs on.
Google has probably sent data to almost every /24 in the last hour. Probably 99% of their egress data goes to destinations where they've sent enough data recently to make a good estimate of bottleneck link speed and queue size.
Having to pick a particular initcwnd to be used for every new TCP connection is an architectural limitation. If they could collect data about each destination and start each TCP connection with a congestion window based on the recent history of transfers from any of their servers to that destination, it could be much better.
It's not a trivial problem to collect bandwidth and buffer size estimates and provide them to every server without delaying the connection, but it would be fun to build such a system.
You're forgetting wireless access networks with varying signal strength (home wifi; phone in the basement).
Wifi does some amount of retransmits of dropped packets at the link level.
This!
They also miss the fact that even with an initcwnd of 10 the TLS negotiation isn’t going to consume it so the window starts growing long before content is actually sent.
Plus there’s no discussion of things like packet pacing
can you please explain how packet pacing factors into this?
> Google also developed QUIC, which is HTTP over UDP. There’s no longer any congestion window to deal with, so entire messages can be sent at once.
I don't think that's true. QUIC implementations typically use the same congestion control algorithms, including both CUBIC and BBR, at least nominally. The latest RFCs for those discuss use with both TCP and QUIC. Though, perhaps when used with QUIC they have more degrees of freedom to tune things.
Also QUIC is not HTTP over UDP. HTTP/3 is HTTP over QUIC. QUIC is bidi streams and best-effort datagrams over UDP.
Vendors are pushing using QUIC and without other networking layers, thus the performance increase. For example, SMB3 over QUIC over the Internet, sans VPN.
L4S[0] also helps a lot with sensing congestion before young connections have yet to suffer their first lost packet...
Basically it sqrt's your actual packet loss rate as far as feedback frequency/density is concerned, without actually even typically having to enact that loss. For example, you could get congestion feedback every 100th packet (as you would with 1% packet loss), during network conditions that would traditionally only have 0.01% packet loss. From [1]:
Unless an AQM node schedules application flows explicitly, the likelihood that the AQM drops a Not-ECT Classic packet (p_C) MUST be roughly proportional to the square of the likelihood that it would have marked it if it had been an L4S packet (p_L). That is:
p_C ~= (p_L / k)2
The constant of proportionality (k) does not have to be standardized for interoperability, but a value of 2 is RECOMMENDED. The term 'likelihood' is used above to allow for marking and dropping to be either probabilistic or deterministic.
[0]: https://www.rfc-editor.org/rfc/rfc9330.html [1]: https://www.rfc-editor.org/rfc/rfc9331#name-the-strength-of-...
This might be one of those induced demand situations like building more lanes on a highway, which generally makes traffic worse.
The actual limiting factor for how horribly bloated frontend code becomes is that at some point, it becomes so bad that it noticably impacts your business negatively, and you need to improve it.
Increasing the TCP window so it managed at least the basic asset delivery makes sense, but if you need to cold start regularly and you have hundreds of kb of javascript, perhaps fix your stuff?
IIRC, all latency-driven congestion control algorithms suffer from violent rtt variance, which happens frequently in wireless networks. How does BBR perform under such circumstances?
Just the day I discovered TCP Congestion Windows and spent the day tweaking and benchmarking between Vegas, Reno, Cubic and TCTP
I've also tweaked and marked benches in Vegas and Reno, to my great shame.
How would this affect DDOS attacks? Would it make you more vulnerable?
Someone needs to reread all of Sally Floyd, et al. researches from International Computer Science Institute of UC Berkeley.
She touched the TCP congestion algorithms of all TCP variants Tahoe, Reno, New Reno, Carson, Vegas, SACK, Westwood, Illinois, Hybla, Compound, HighSpeed, BIC, CUBIC, DCTCP, BBR, BCP, XCP, RCP.
And it all boils down to:
* how much propagation delay are there,
* how long are each packets, and
* whether there are sufficient storage space for “buffer bloats”.
Also, TCP congestion algorithms are neatly pegged as
* reactive (loss-based)
* proactive (delay-based)
* predictive (bandwidth estimation)
https://egbert.net/blog/articles/tcp-evolution.html
also citations:
DUAL (Wang & Crowcroft, 1992) https://www.cs.wustl.edu/~jain/cis788-95/ftp/tcpip_cong/
TCP Veno (Fu & Liew, 2003) https://www.ie.cuhk.edu.hk/wp-content/uploads/fileadmin//sta... https://citeseerx.ist.psu.edu/document?doi=003084a34929d8d2c...
TCP Nice (Venkataramani, Kokku, Dahlin, 2002) https://citeseerx.ist.psu.edu/document?doi=10.1.1.12.8742
TCP-LP (Low Priority TCP, Kuzmanovic & Knightly, 2003) https://www.cs.rice.edu/~ek7/papers/tcplp.pdf
Scalable TCP (Kelly, 2003) https://www.hep.ucl.ac.uk/~ytl/talks/scalable-tcp.pdf
H-TCP (Leith & Shorten, 2004) https://www.hamilton.ie/net/htcp/
FAST TCP (Jin, Wei, Low, 2004/2005) https://netlab.caltech.edu/publications/FAST-TCP.pdf
TCP Africa (King, Baraniuk, Riedi, 2005) https://www.cs.rice.edu/~ied/comp600/PROJECTS/Africa.pdf
TCP Libra (Marfia, Palazzi, Pau, Gerla, Sanadidi, Roccetti, 2007) https://www.cs.ucla.edu/NRL/hpi/tcp-libra/
YeAH-TCP (Yet Another High-speed TCP, Baiocchi, Castellani, Vacirca, 2007) https://dl.acm.org/doi/10.1145/1282380.1282391
TCP-Nice and other background CCAs https://en.wikipedia.org/wiki/TCP_congestion_control
TCP-FIT (Wang, 2016) https://www.sciencedirect.com/science/article/abs/pii/S10848...
I seem to remember this coming up a few times over the years and it’s always bad iirc.
I reckon bufferbloat is overhyped as a problem, it mattered to a small set of Internet connectivity in the 2010s and promptly went away as connectivity changed and improved, yet we continue to look at it like it was yesterdays problem.
Bufferbloat is alive and well. Try a t-mobile 5g home gateway. Oof.
I think cable modems have had a ton of improvement, and more fiber in our diet helps, but mobile can be tricky, and wifi is still highly variable (there's promising signs, but I don't know how many people update their access points)
Is it though, or is it just a scapegoat or a red herring, especially in the case with a wireless medium? That's been my experience with quick claims to bufferbloat, it's usually something else at play. But again ymmv.
There's always something else at play. Bufferbloat hides problems from the systems that can easily solve them. It doesn't cause problems, it makes them worse.
I mean, I did a speed test with t-mobile 5g home internet, download speed was impressive, but so was the difference in ping time during the download vs otherwise.
Sure, wireless is complex, but there were definitely some way too big buffers in the path. Add in some difficulty integrating their box into my network, and it wasn't for me.
Fair enough, I concede with your assessment, my understanding of bufferbloat (which I have to relearn everytime I look at it) is that the telltale sign is ping to any destination that traverses the uplink exhibits higher latency than usual when you're saturating your download. It's just a tricky thing to test given variability of conditions (and what might be deemed as expected operation) which is why I'm usually hesitant and sceptical, and I don't trust those speedtest websites to gauge it properly.
Every speed test I tried that measures latency under load shows a large difference between fq_codel on and off.
This is very much a problem ISPs have to deal with as big pipes feed small pipes.
What do ISPs have to actually determine these issues outside of sketchy speedtest websites and vague reports or concerns from customers? What about placing probes in the correct places (e.g. in conditions where there is no additional loss or introduced latency between the end user and uplink). Also is this an actual problem that users are really having, or is it perceived because some benchmark / speedtest gave you a score.
There's a lot of issues and variables at play; this isn't a case of "it's always DNS". What tools do ISPs even have at their disposal and how accurate are they and does it uncover the actual problem users are experiencing? This is the real issue that ISPs of all size have to deal with.
> Also is this an actual problem that users are really having, or is it perceived because some benchmark / speedtest gave you a score.
The actual problem is I'm on a voip call and someone starts a big download (steam) and latency and jitter go to hell and the call is unusable. Bufferbloat test confirms that latency dramatically increases under load. Or same call but someone starts uploading something big.
If troublesome buffers are at the last mile connection and the ISP provides a modem/router, adding QoS limiting downloads and uploads to about 90% of the acheived physical connection will avoid the issue. The buffers are still too big, but they won't fill under normal conditions, so it's not a problem. You could still fill the buffers if there's a big flow that doesn't use effective congestion control, or a large enough number of flows so that the minimum send rate is still too much; or when the physical connection rate changes, but good enough.
Otherwise, ISP visibility can be limited. Not all equipment will report on buffer use, and even if it does, it may not report on a per port basis, and even then, the timing of measurement might miss things. What you're looking for is a 'standing buffer' where a port always has at least N packets waiting and the buffer does not drain for a meaningful amount of time. Ideally, you'd actually measure the buffer length in milliseconds, rather than packets, but that's asking a lot of the equipment.
There's a balance to be met as well. Smaller buffers mean packet drops, which is appropriate when dealing with standing buffers; but too small of buffers leads to problems if your flows are prone to 'micro bursts', lots of packets at once potentially on many flows, and then calm for a while. It's better to have room to buffer those.
codel/CAKE also came from that project, no middlebox needed.
Internet connectivity improvement has slowed a lot. It was improving at a good clip in the 00's but then a lot of usage moved to mobile data which also caused investment to shift away from broadband speedups. If we had 00's rate of improvement, people would have 100G connections at home now.
(wifi also dampened bandwidth demand for a long time - it didn't make sense to pay for faster-than-wifi broadband)
A relative of mine runs a WISP (800+ customers). He's using LibreQoS to prevent bufferbloat (not its only feature) for his entire network.
Someone hasn't travelled a lot outside of the house I see?
Wifi is still mostly shitty in most places in the world.
Then there are countries like Philippines with just all around slow internet everywhere.
Never said there isn't connectivity or service quality issues inside or outside the home; just that bufferbloat specifically is a trope that should be put to rest.
Google has a long history of performing networking research, making changes, and pushing those changes to the entire internet. In 2011, they published one of my favorite papers, which described their decision to increase the TCP initial congestion window from 1 to 10 on their entire infrastructure.