Adding to what everyone else has said, Japan is known to be a threshold nuclear state (from a weapons perspective). They explicitly stay around just weeks away from being able to perform a nuclear weapons test, and they are commonly referred to being "a screwdriver's turn" away from having a nuclear weapon.
They have massive government investment in not only maintaining that status, but also doing so on a completely domestic supply chain as much as possible.
Therefore they have the same need for supercomputers that the US national labs do (perhaps more so, since they're even more reliant on simulation), and heavily prefer locally sourced pieces of that critical infrastructure.
I wouldn't be surprised if an incredibly large part of the local push for Rapidus is to pull them off of TSMC and the supply chain risk for their nuclear program in case the whole China/Taiwan thing comes to a head.
Oh that's very interesting. Unfortunately for all of us many countries are revisiting nuclear ambitions it seems, but I can see how that makes sense from the Japanese perspective, given an environment with more aggression in general and a lot much reliable US as an ally.
But: if you consider the amount of nuclear generating capacity has (4th in the world, more than Russia), and its advanced space program, “within one year” probably means closer to “weeks or months” than “three hundred and sixty four days”.
Because the LLM craze has rendered last-gen Tensor accelerators from NVIDIA (& others) useless for all those FP64 HPC workloads. From the article:
> The Hopper H200 is 47.9 gigaflops per watt at FP64 (33.5 teraflops divided by 700 watts), and the Blackwell B200 is rated at 33.3 gigaflops per watt (40 teraflops divided by 1,200 watts). The Blackwell B300 has FP64 severely deprecated at 1.25 teraflops and burns 1,400 watts, which is 0.89 gigaflops per watt. (The B300 is really aimed at low precision AI inference.)
Do cards with intentionally handicapped FP64 actually use anywhere near their TDP when doing FP64? It's my understanding that FP64 performance is limited at the hardware level--whether by fusing off the extra circuits, or omitting them from the die entirely--in order to prevent aftermarket unlocks. So I would be quite surprised if the card could draw that much power when it's intentionally using only a small fraction of the silicon.
I'm finding conflicting info on this. It seems to be down to the specific GPU/core/microarchitecture. In some cases, the "missing" FP64 units do physically exist on the dies, but have been disabled--likely some of them were defective in manufacturing anyway--and this disabling can't be undone with custom firmware AFAIK (though I believe modern nVidia cards will only load nVidia-signed firmware anyway). Then, there are also dies that don't include the "missing" FP64 units at all, and so there's nothing to disable (though manufacturing defects may still lead to other components getting disabled for market segmentation and improved yields). This also seems to be changing over time; having lots of FP64 units and disabling them on consumer cards seems to have been more common in the past.
Nevertheless, my point is more that if FP64 performance is poor on purpose, then you're probably not using anywhere near the card's TDP to do FP64 calculations, so FLOPS/watt(TDP) is misleading.
In general: consumer cards with very bad FP64 performance have it fused off for product segmentation reasons, datacenter GPUs with bad FP64 performance have it removed from the chip layout to specialize for low precision. In either case, the main concern shouldn't be FLOPS/W but the fact that you're paying for so much silicon that doesn't do anything useful for HPC.
This theory only makes sense if consumer cards are sharing dies with enterprise/datacenter cards. If the consumer card SKUs are on their own dies, they're not going to etch something into silicon only to then fuse it off after the fact.
Regardless, there's "tricks" you can use to sort of extend the precision of hardware floating point - using a pair of e.g. FP32 numbers to implement something that's "almost" a FP64. Well known among numerics practitioners.
Until recently, consumer, workstation, and datacenter GPUs would all share a single core design that was instantiated in varying quantities per die to create a product stack. The largest die would often have little to no presence in the consumer market, but fundamentally it was made from the same building blocks. Now, having an entirely separate or at least heavily specialized microarchitecture for data center parts is common (because the extra design costs are worth it), but most workstation cards are still using the same silicon as consumer cards with different binning and feature fusing.
consumer cards don't share dies with datacenter cards, but they do share dies with workstation cards (the formerly quadro line), ex. the GB202 die is used by both the RTX PRO 5000/6000 Blackwell and the RTX 5090
I know some consumer cards have artificially limited FP64, but the AI focused datacenter cards have physically fewer FP64 units. Recently, the GB300 removed almost all of them, to the point that a GB300 actually has less FP64 TFLOPS than a 9 year old P100. FP32 is the highest precision used during training so it makes sense.
Pezy and the other Japanese native chips are first and foremost about HPC. The world may have picked up AI in the last 2 years, but the Japanese chipmakers are still thinking primarily about HPC, with AI as just one HPC workload.
These Pezy chips are also made for large clusters. There is a whole system design around the chips that wasn't presented here. The Pezy-SC2, for instance, was built around liquid immersion cooling. I am not sure you could ever buy an air-cooled version.
"Each immersion tank can contain 16 Bricks. A Brick consists of a backplane board, 32 PEZY-SC2 modules, 4 Intel Xeon D host processors, and 4 InfiniBand EDR cards. Modules inside a Brick are connected by hierarchical PCI Express fabric switches, and the Bricks are interconnected by InfiniBand."
I remember some offhand remarks on that. Apparently the rooms for these systems had cheap ladles hanged somewhere and engineers would have fun scooping out water puddles collecting on top of enclosures full of fluorinert coolants. That's tanks full of PFAS in layman's terms...
I think they just do what they have done well. LLMs don't take demand away from HPC, like physics and weather simulations. Arguably if some of their competitors divert resources to LLMs it might even be better for them.
It's unfortunate that they don't sell them on open markets. There are few of these accelerators that could threaten NVIDIA monopoly if prices(and manufacturing costs!) were right.
I thought they're more like "wire us our share of METI grant, we'll forward it to TSMC". Besides they wouldn't be going anywhere if that was chasing away 100% of customers.
Another one of these I still sometimes think about is NEC VectorEngine - they had 5 TFLOPS FP32 with 48GB of HBM2 totaling 1.5TB/s bandwidth at $10k in 2020. That was within a digit or two against NVIDIA at basically the same price. But they then didn't capitalize on it, just kept delivering to national institutes in ritualistic manners.
I do have basic conceptual understanding of these grant businesses and have vague intuitions as to how bureaucracy wants substantial capital investments and report files without commercial capitalizations, with emphasis on the last part, as it would disrupt internal politics inside government agencies and also creates unfair government competitive pressure against civilian sectors, but at some point it starts looking like cash campfires. I don't know exactly how slow are M4 Mac Studios relative to NVIDIA Tesla clusters normalized for VRAM, but they're considered comparable regardless just because they run LLMs at 10-20 tok/s. So it's just, unfortunate, that these accelerators of basically same nature as M-series CPUs are built, kept on idle, and then recycled.
The one that is in my mind as "no way these brochure figures are real" is PFN MN-Core - though it looks like they might be doing an LLM specific variant in the future. Hopefully they retail them.
I've come to wonder if this just because the culture of Japan itself has been so "crabby".
It's just too inward looking these days - probably why technical innovations in Japan don't get shaped to meet the worlds needs, but gets sold as if it were a luxury artisan-product (ala "swiss-made" stuff).
The trouble with people who criticize Japan (incl. Japanese) is that they think this is because of "old people & culture" - but actually, no, the "old" Japanese (in the 1900s-1980s) seemed to have been extraordinarily curious about the world, and also very clever in marketing things. The issue is most definitely "modern", but ofc. saying that is verboten in the dogma of liberalism.
You might be onto something. Anecdotally, I just came across some armchair economist lamenting lack of aptitude among Japanese startups for foreign currency acquisition, and I could only agree - I can't recall many Japanese startups or corporate business expansions primarily focusing on foreign sales. The mental model is always to get rich quick within Japan and/or South Asia and retire. "The world" outside is treated like tiny separate bonus rooms.
Thinking about Japanese economy in general and how it hadn't grown in 30-40 years: 40 years is technically two generations, but life in Japan hadn't deteriorated meaningfully during that period. Substantial socio-political improvements were made, university entrance had rose somewhat absurdly high, some new infrastructures were built, convenience store sandwich prices hasn't doubled, overtimes and harassments at workplaces are way more strictly scrutinized. There's the problem of employment ice age, but it's not as bad as the collapse of the Soviet Union; not at "Vladimir Putin was laid off KGB and drove taxi to make ends meet" levels, only "PhDs drove trucks". So "Japan still using FAX" narratives only partially make sense. Overall, it does feel that there is something strange is going on in this country, something like effective isolationism.
I guess my point is... it's unfortunate that these efforts and costs spent goes to waste, and we don't know why it's only built to be recycled.
The hardware is the easy part of accelerating NN training. Nvidia's software and infrastructure is so well designed and established that no competitor can threaten them even if they give away the hardware for free.
The math of NN training isn't complex at all. Designing the software stack to make a new pytorch backend is very doable with the budgets these AI companies have.
I suspect that whenever you look like you're making good progress on this front, nvidia gives you a lot of chips for free on condition you shelve the effort though!
The latest example being Tesla, who were designing their own hardware and software stack for NN training, then suspiciously got huge numbers of H100's ahead of other clients and cancelled the dojo effort.
I doubt that's what happened. They had designs that were massively expensive to fab/package, had much worse performance than the latest Nvidia hardware, and still needed massive amounts of custom in-house development.
To combat all of these issues, they were fighting with Nvidia (and losing) for access to leading edge nodes, which kept going up in price. Their personnel costs kept rising as the company became more politicized, people left to join other companies (e.g. densityai), and they became embroiled in the salary wars to replace them.
My suspicion is that Musk told them to just buy Nvidia instead of waiting around for years of slow iteration to get something competitive.
The custom silicon I was involved with experienced similar issues. It was too expensive and slow to try competing with Nvidia, and no one could stomach the costs to do so.
Seriously doubt that: free hardware (or 10s of bucks) would galvanize the community and achieve huge support - look at the Raspberry Pi project original prices and the consequences.
I actually really hate CUDA's programming model and feel like it's too low-level to actually get any productive work done. I don't really blame Nvidia because they basically invented the programmable GPU and it wouldn't be fair to have them also come up with the perfect programming model right out of the gate but at this point it's pretty clear that having independent threads work on their own programs makes no sense. High performance code requires scheduling across multiple threads in a way that is completely different if you are coming from CPUs.
Of course, one might mention that GPUs are nothing like CPUs–but the programming model works super hard to try to hide this. So it's not really well designed in my book. I actually quite like the compilers that people are designing these days to write block-level code, because I feel like it better represents the work people want to do and then you pick which way you want it lowered.
As for Nsight (Systems), it is…ok, I guess? It's fine for games and stuff I guess but for HPC or AI it doesn't really surface the information that you would want. People who are running their GPUs really hard know they have kernels running all the time and what the performance characteristics of them are. Nsight Compute is the thing that tells you that but it's kind of a mediocre profiler (some of this may be limitations of hardware performance counters) and to use it effectively you basically have to read a bunch of blog posts by people instead of official documentation.
Despite not having used it much, my impression was that Nvidia's "moat" was that they have good networking libraries, that they are pretty good (relatively) and making sure all their tools work, and they have had consistent investment on this for a decade.
GPUs are a type of barrel processor, which are optimized for workloads without cache locality. As a fundamental principle, they replace the CPU cache with latency hiding behavior. Consequently, you can't use algorithms and data structures designed for CPUs, since most of those assume the existence of a CPU cache. Some things are very cheap on a barrel processor that are very expensive on a CPU and vice versa, which changes the way you think about optimization.
The wide vectors on GPUs are somewhat irrelevant. Scalar barrel processors exist and have the same issues. A scalar barrel processor feels deceptively CPU-like and will happily compile and run normal CPU code. The performance will nonetheless be poor unless the C++ code is designed to be a good fit for the nature of a barrel processor, code which will look weird and non-idiomatic to someone who has only written code for CPUs.
There is no way to hide that a barrel processor is not a CPU even though they superficially have a lot of CPU-like properties. A barrel processor is extremely efficient once you learn to write code for them and exceptionally well-suited to HPC since they are not latency-sensitive. However, most people never learn how to write proper code for barrel processors.
Ironically, barrel processor style code architecture is easy to translate into highly optimized CPU code, just not the reverse.
I wanted to upvote you originally, but I'm afraid this is not correct. A GPU is not a barrel processor. In a barrel processor a single context is switched between multiple threads after each instruction. A barrel processor design has a singular instruction pipeline and a singular cache across all threads. In a GPU, due to the independence of the execution units, those threads will execute those instructions concurrently on all cores, as long as a program-based instruction dependency between threads is not introduced. It's true parallelism. Furthermore, each execution unit embeds its own instruction scheduler, it's own pipeline and its own L1 cache (see [1] for NVidia's architecture).
Barrel processors are a spectrum and GPUs are on one end of it. Yes, the classic canonical barrel processors (e.g. Tera architecture) more or less work as you outline. That is a 40 year old microarchitecture, they haven't been designed that way for decades.
Modern barrel processors implementations have complex microarchitectures that are much closer to a modern GPU in design. That is not accidental, the lineage is clearly there if you've worked on both. I will grant that vanishingly few people have ever seen or worked on a modern non-GPU barrel processor, since they are almost exclusively the domain of exotics built for government applications AFAICT.
They are similar enough wrt. how they hide memory access latency within each single processing core ("streaming multiprocessor") by switching across hardware threads ("wavefronts").
A context cannot be shared by multiple threads. Each thread must have its own context, otherwise all threads will crash immediately. Thus your description of a barrel processor is completely contrary to reality.
When threads are implemented only in software, without hardware support, you have what is called coarse-grained multithreading. In this case, a CPU core executes one thread, until that thread must wait for a long time, e.g. for the completion of some I/O operation. Then the operating system switches the context from the stalled thread to another thread that is ready to run, by saving all registers used by the old thread and restoring the registers of the new thread, from the values that were saved when the new thread has been executed last time.
Such multithreading is coarse-grained, because saving and restoring the registers is expensive so it cannot be done often.
When hardware assists context-switching, by being able to store internally in the CPU core multiple sets of registers, i.e. multiple thread contexts, then you can have FGMT (fine-grained multithreading). In the earliest CPUs with FGMT the switching of the thread contexts was done after each executed instruction, but in all more recent CPUs or GPUs with FGMT the context switching can be done after each clock cycle.
Barrel processors are a subset of the FGMT processors, the simplest and the least efficient of them. Barrel processors are now only of historical interest. Nobody has made barrel processors during the last decades. In barrel processors, the threads are switched in round robin, i.e. in a fixed order. You cannot choose the next thread to run. This wastes clock cycles, because the next thread in the fixed order may be stalled, waiting for some event, so nothing can be done during its allocated clock cycle.
The name "barrel", introduced by CDC 6600 in 1964, refers to the similarity with the barrel of a revolver, you can rotate it with a position, bringing the next thread for execution, but you cannot jump over a thread to reach some arbitrary position.
What is switched in a barrel CPU at each clock cycle between threads is not a context, i.e. not the registers, but the execution units of the CPU, which become attached to the context of the current thread, i.e. to its registers. For each thread there is a distinct set of registers, storing the thread context.
The descriptions of the internal architecture of GPUs are extremely confusing, because NVIDIA has chosen to replace in its documentation all the words that have been used for decades when describing CPUs with different words, with no apparent reason except of obfuscating the GPU architecture. AMD has followed NVIDIA, and they have created a third set of architectural terms, mapped one to one to those of NVIDIA, but using yet other words, for maximum confusion.
For instance, NVIDIA calls "warp" what in a CPU is called "thread". What NVIDIA calls "thread" is what in a CPU is called "vector lane" or "SIMD lane". What NVIDIA calls "stream multiprocessor" is what in a CPU is called "core".
Both GPUs and CPUs are made of multiple cores, which can execute programs in parallel.
Each core can execute multiple threads, which share the same execution units. For executing multiple threads, most if not all GPUs use FGMT, while most modern CPUs use SMT (Simultaneous Multithreading).
Unlike FGMT, SMT can exist only on superscalar processors, i.e. which can initiate the execution of multiple instructions in the same clock cycle. Only in that case it may also be possible to initiate the execution of instructions from distinct threads in the same clock cycle.
Some GPUs may be able to initiate 2 instructions per clock cycle, only when certain conditions are met, but for all such GPUs their descriptions are typically very vague and it may be impossible to determine whether those 2 instructions may come from different threads, i.e. from different warps in the NVIDIA terminology.
I've used DSPs, custom boards with compute hardware (FPGA image processing), and various kinds of GPUs. I would have a very hard time trying to point to ways in which the NVIDIA toolkit could be compared to what's out there and not come away with a massive sense of relief. For the most part 'it just works', the models are generic enough that you can actually get pretty close to the TDP on your own workloads with custom software and yet specific enough that you'll find stuff that makes your work easier most of the time.
I really can't complain, now, FPGAs, however... And if there ever is a company that comes out and improves substantially on this I'll be happy for sure but if you asked me off the bat what they should improve I honestly wouldn't know, especially not taking into account that this was an incremental effort over ~2 decades and that originated in an industry that has nothing to do with the main use case today and some detours into unrelated industries besides (crypto, for instance).
From fluid dynamics, FEA, crypto, gaming, genetics, AI and many others with a single generic architecture and delivering very good performance is no mean feat.
I'd love to hear in what way you would improve on their toolset.
Not the guy you replied to, but here are some improvements that feel obvious:
1. Memory indexing. It's a pain to avoid banking conflicts, and implement cooperative loading on transposed matrices. To improve this, (1) pop up a warning when banking conflicts are detected, (2) make cooperative loading solved by the compiler. It wouldn't be too hard to have a second form of indexing memory_{idx} that the compiler solves a linear programming problem for to maximize throughput (do you spend more thread cycles cooperative loading, or are banking conflicts fine because you have other things to work on?)
2. Why is there no warning when shared memory is unspecified? It isn't hard to check if you're accessing an index that might not have been assigned a value. The compiler should pop out a warning and assign it to 0.0, or maybe even just throw an error.
3. Timing - doesn't exist. Pretty much the gold standard is to run your kernel 10_000 times in a loop and subtract the time from before and after the loop. This isn't terribly important, I'm just getting flashbacks to before I learned `timeit` was a thing in Python.
Who cares. It's viable so long llama.cpp works and does 15 tok/s at under 500W or so. Whether the device accomplish that figure with a 8b q1 or a 1T BF16 weight files is not a fundamental boolean limiting factor, there will probably be some uses for such an instrument as proto-AGI devices.
There is a type of research called traffic surveys, which involves hiring few men with adequate education to sit or stand at an intersection for one whole day to count numbers of passing entities by types. YOLO wasn't accurate enough. I have gut feeling that vision enabled LLM would be. That doesn't require constant update or upgrades to latest NN innovations so no need to do full CUDA, so long one known good weight files work.
It's not all about NNs and AI. Take a look at the Top500, a lot of people are doing classical HPC work on Nvidia GPUs, which are increasingly not designed for this. Unfortunately the HPC market is just a lot smaller than the AI bubble.
I don't know why you are getting downvoted. This is 100% true. It's not like you can take any random data and train it into a NN. You have to transform the data, you have to write the low level GPU kernels which will actually run fast on that particular GPU, you also have to get the output and transform that as well. All of this is hard and very much impossible to create from scratch.
If people use PyTorch on a Nvidia GPU they are running layers and layers of code written by those that know how to write fast kernels for GPUs. In some cases they use assembly as well.
Nvidia stuck to one stack and wrote all their high level libraries on it, while their competitors switched from old APIs to new ones and never made anything close to CUDA.
Because in the context of LLM transformers, you really just need matrix multiplication to be hyper-optimized, it's 90-99% (citation needed) of the FLOPs. Get some normalization and activation functions in and you're good to go. It's not a massive software ecosystem.
CUDA and CUBLAS being capable of a bunch of other things is really cool, and would take a long time to catch up with, but getting the bare minimum to run LLMs on any platform with a bunch of GDDR7 channels and cores at a reasonable price would have people writing torch/ggml backends within weeks.
Have you tried to write a kernel for basic matrix multiplication? Because I have and I can assure you it is very hard to get 50% of maximum FLOPs, let alone 90%. It is nothing like CPUs where you write a * b in C and get 99% of the performance by the compiler.
And this is just basic matrix mult. If you add activation functions it will slow down even more. There is nothing easy about GPU programming, if you care about performance. CUDA gives you all that optimization on a plate.
Well, CUDA gives you a whole programming language where you have to figure out the optimization for your particular card's cache size and bus width.
I'm saying the API surface of what to offer for LLMs is pretty small. Yeah, optimizing it is hard but it's "one really smart person works for a few weeks" hard, and most of the tiling techniques are public. Speaking of which, thanks for that blog post, off to read it now.
It may also be worth noting that Japan has a pretty long history of marching to their own drummer in computing. They either created their own architectures or adopted others after pretty much everyone had moved on.
When you're building your own CPUs, why be beholden to US companies for GPUs? This makes perfect sense.
GPUs are great if your workload can use them, but not so great for more general tasks. These are more appropriate to more traditional supercomputing tasks, as in they're not optimized for lower precision AI stuff, like NVIDIA GPUs are.
Something doesn't add up here. The listed peak fp64 performance assumes one fp64 operation per clock per thread, yet there's very little description of how each PE performs 8 flops per cycle, only "threads are paired up such that one can take over processing when another one stalls...", classic latency-hiding. So the performance figures must assume that each PE has either an 8-wide SIMD unit (and 16-wide for fp32) or 8 separately schedulable execution units, neither of which seem likely given the supposed simplicity of the core (or 4 FMA EUs). Am I missing something?
Interesting that they’re investing in standard _AI_ toolchains, rather than standard HPC toolchains, even though I imagine Japanese supercomputing has more demand for the latter.
Great article documenting PEZY. It's incredible how close they are from NVidia despite being a very small team.
To me, this looks like a win.
Governments are there to finance projects like this that enable the country to have certain skillsets that wouldn't exist otherwise because of other countries having better solutions in the global market.
The fp64 GFLOPS per watt metric in the post is almost entirely meaningless to compare between these accelerators and NVIDIA GPUs, for example it says
> Hopper H200 is 47.9 gigaflops per watt at FP64 (33.5 teraflops divided by 700 watts)
But then if you consider H100 PCIe [0] instead, it's going to be 26000/350 = 74.29 GFLOPS per watt. If you go look harder you can find ones with better on-paper fp64 performance, for example AMD MI300X has 81.7 TFLOPs with typical board power of "750W Peak", which gives 108.9 GFLOPS per watt.
The truth is the power allocation of most GPGPUs are heavily tilted for Tensor usages. This has been the trend well before B300.
That's all for HPC.
And Pezy processors are certainly not designed for "AI" (i.e. linear algebra with lower input precision). For AI inference starting from 2020 everyone is talking about how many T(FL)OPS per watt, not G.
[0] which is a nerfed version of H200's precursor.
So are companies (Itanium, Windows Mobile, etc.) but what governments do well is funding the competitive baseline needed for big advances. We live in an age of wonders invented based on American research investment in the mid-20th century, and that worked because the government did not try to pick winners but invested in good work by qualified people (everything NIH, NAF, etc. do by competitive grants) or by promising to pay for capabilities not yet available (a lot of NASA and military stuff).
Just like it doesn’t work to try an ecosystem based on one species, a society has to blend government and private spending. They work on different incentives and timeframes, and both have pitfalls that the other might handle better.
But what governments often can do, is break local optimums clustering around the quarter economy and take moonshot chances and find paths otherwise never taken. Hopefully one of these paths are great.
The difficult thing becomes deciding when to pull the plug. Is ITER a good thing or not? (Results wise, it is, but for the money? Who can tell really.)
I wonder how much progress (if any) is being done on floating point formats other than IEEE floats; on serious adoption in hardware in particular. Stuff like posits [1] for instance look very promising.
the problem with posits is that they aren't enough better to be worth a switch. switching the industry over would cost billions in software rewrites and there are benefits, but they are fairly marginal.
For deep learning workloads the software for posits isn't the issue, it's doing anything that's not NVIDIA if you want to do it as a standalone product. For NVIDIA it's likely the penalty of not being able to share logic with standard size IEEE floats. If adopting posits allowed significantly smaller data types then NVIDIA would likely have adopted already.
I don't disagree, it's just that the advantage hasn't yet been shown to be big enough to justify dedicating die area in a mainstream chip. There's potential there, if I were designing an accelerator today I would look hard at posits and variations of blocked representations especially around four bits. A few years back I got to have coffee with John Gustafson which was pretty neat and got me more excited about the idea.
Adding to what everyone else has said, Japan is known to be a threshold nuclear state (from a weapons perspective). They explicitly stay around just weeks away from being able to perform a nuclear weapons test, and they are commonly referred to being "a screwdriver's turn" away from having a nuclear weapon.
They have massive government investment in not only maintaining that status, but also doing so on a completely domestic supply chain as much as possible.
Therefore they have the same need for supercomputers that the US national labs do (perhaps more so, since they're even more reliant on simulation), and heavily prefer locally sourced pieces of that critical infrastructure.
I wouldn't be surprised if an incredibly large part of the local push for Rapidus is to pull them off of TSMC and the supply chain risk for their nuclear program in case the whole China/Taiwan thing comes to a head.
Oh that's very interesting. Unfortunately for all of us many countries are revisiting nuclear ambitions it seems, but I can see how that makes sense from the Japanese perspective, given an environment with more aggression in general and a lot much reliable US as an ally.
Funny because I met quite a few Japanese that liked Trump leading up to his first term. They thought he was really going to fuck with China.
> They explicitly stay around just weeks away from being able to perform a nuclear weapons test
Do you have a citation for "weeks away"? Wikipedia only says "within one year": https://en.wikipedia.org/wiki/Japanese_nuclear_weapons_progr...
There’s no citation, because why would there be?
But: if you consider the amount of nuclear generating capacity has (4th in the world, more than Russia), and its advanced space program, “within one year” probably means closer to “weeks or months” than “three hundred and sixty four days”.
But... all of the Pezy chips in the article are fabbed by TSMC.
Hence my last paragraph speculating about some of the ambitions and definitions of success behind Rapidus.
Because the LLM craze has rendered last-gen Tensor accelerators from NVIDIA (& others) useless for all those FP64 HPC workloads. From the article:
> The Hopper H200 is 47.9 gigaflops per watt at FP64 (33.5 teraflops divided by 700 watts), and the Blackwell B200 is rated at 33.3 gigaflops per watt (40 teraflops divided by 1,200 watts). The Blackwell B300 has FP64 severely deprecated at 1.25 teraflops and burns 1,400 watts, which is 0.89 gigaflops per watt. (The B300 is really aimed at low precision AI inference.)
Do cards with intentionally handicapped FP64 actually use anywhere near their TDP when doing FP64? It's my understanding that FP64 performance is limited at the hardware level--whether by fusing off the extra circuits, or omitting them from the die entirely--in order to prevent aftermarket unlocks. So I would be quite surprised if the card could draw that much power when it's intentionally using only a small fraction of the silicon.
It's really to save die space for other functions, AFAIU there is no fusing to lock the features or anything like this.
I'm finding conflicting info on this. It seems to be down to the specific GPU/core/microarchitecture. In some cases, the "missing" FP64 units do physically exist on the dies, but have been disabled--likely some of them were defective in manufacturing anyway--and this disabling can't be undone with custom firmware AFAIK (though I believe modern nVidia cards will only load nVidia-signed firmware anyway). Then, there are also dies that don't include the "missing" FP64 units at all, and so there's nothing to disable (though manufacturing defects may still lead to other components getting disabled for market segmentation and improved yields). This also seems to be changing over time; having lots of FP64 units and disabling them on consumer cards seems to have been more common in the past.
Nevertheless, my point is more that if FP64 performance is poor on purpose, then you're probably not using anywhere near the card's TDP to do FP64 calculations, so FLOPS/watt(TDP) is misleading.
In general: consumer cards with very bad FP64 performance have it fused off for product segmentation reasons, datacenter GPUs with bad FP64 performance have it removed from the chip layout to specialize for low precision. In either case, the main concern shouldn't be FLOPS/W but the fact that you're paying for so much silicon that doesn't do anything useful for HPC.
This theory only makes sense if consumer cards are sharing dies with enterprise/datacenter cards. If the consumer card SKUs are on their own dies, they're not going to etch something into silicon only to then fuse it off after the fact.
Regardless, there's "tricks" you can use to sort of extend the precision of hardware floating point - using a pair of e.g. FP32 numbers to implement something that's "almost" a FP64. Well known among numerics practitioners.
Until recently, consumer, workstation, and datacenter GPUs would all share a single core design that was instantiated in varying quantities per die to create a product stack. The largest die would often have little to no presence in the consumer market, but fundamentally it was made from the same building blocks. Now, having an entirely separate or at least heavily specialized microarchitecture for data center parts is common (because the extra design costs are worth it), but most workstation cards are still using the same silicon as consumer cards with different binning and feature fusing.
consumer cards don't share dies with datacenter cards, but they do share dies with workstation cards (the formerly quadro line), ex. the GB202 die is used by both the RTX PRO 5000/6000 Blackwell and the RTX 5090
I know some consumer cards have artificially limited FP64, but the AI focused datacenter cards have physically fewer FP64 units. Recently, the GB300 removed almost all of them, to the point that a GB300 actually has less FP64 TFLOPS than a 9 year old P100. FP32 is the highest precision used during training so it makes sense.
A 53×53 bit multiplier is more than 4× the size of a 24×24 bit multiplier.
Pezy and the other Japanese native chips are first and foremost about HPC. The world may have picked up AI in the last 2 years, but the Japanese chipmakers are still thinking primarily about HPC, with AI as just one HPC workload.
These Pezy chips are also made for large clusters. There is a whole system design around the chips that wasn't presented here. The Pezy-SC2, for instance, was built around liquid immersion cooling. I am not sure you could ever buy an air-cooled version.
>liquid immersion cooling
Is the whole board submersed in liquid? Or just the processor?
https://www.wikiwand.com/en/articles/Gyoukou
"Each immersion tank can contain 16 Bricks. A Brick consists of a backplane board, 32 PEZY-SC2 modules, 4 Intel Xeon D host processors, and 4 InfiniBand EDR cards. Modules inside a Brick are connected by hierarchical PCI Express fabric switches, and the Bricks are interconnected by InfiniBand."
I remember some offhand remarks on that. Apparently the rooms for these systems had cheap ladles hanged somewhere and engineers would have fun scooping out water puddles collecting on top of enclosures full of fluorinert coolants. That's tanks full of PFAS in layman's terms...
But, importantly, nontoxic PFAS.
Funny site. Seems to be a reskin of the Wikipedia article https://en.wikipedia.org/wiki/Gyoukou
> Pezy-SC2, for instance, was built around liquid immersion cooling
Well that was a disappointing end to a sentence. I was hoping another company would invest a few million in HPC to play SC2!
https://www.youtube.com/watch?v=UuhECwm31dM
[flagged]
I think they just do what they have done well. LLMs don't take demand away from HPC, like physics and weather simulations. Arguably if some of their competitors divert resources to LLMs it might even be better for them.
It's unfortunate that they don't sell them on open markets. There are few of these accelerators that could threaten NVIDIA monopoly if prices(and manufacturing costs!) were right.
They do sell these on the open market. You just have to be in the market for an entire cluster. The minimum order quantity for Pezy is several racks.
I thought they're more like "wire us our share of METI grant, we'll forward it to TSMC". Besides they wouldn't be going anywhere if that was chasing away 100% of customers.
Another one of these I still sometimes think about is NEC VectorEngine - they had 5 TFLOPS FP32 with 48GB of HBM2 totaling 1.5TB/s bandwidth at $10k in 2020. That was within a digit or two against NVIDIA at basically the same price. But they then didn't capitalize on it, just kept delivering to national institutes in ritualistic manners.
I do have basic conceptual understanding of these grant businesses and have vague intuitions as to how bureaucracy wants substantial capital investments and report files without commercial capitalizations, with emphasis on the last part, as it would disrupt internal politics inside government agencies and also creates unfair government competitive pressure against civilian sectors, but at some point it starts looking like cash campfires. I don't know exactly how slow are M4 Mac Studios relative to NVIDIA Tesla clusters normalized for VRAM, but they're considered comparable regardless just because they run LLMs at 10-20 tok/s. So it's just, unfortunate, that these accelerators of basically same nature as M-series CPUs are built, kept on idle, and then recycled.
The one that is in my mind as "no way these brochure figures are real" is PFN MN-Core - though it looks like they might be doing an LLM specific variant in the future. Hopefully they retail them.
I've come to wonder if this just because the culture of Japan itself has been so "crabby".
It's just too inward looking these days - probably why technical innovations in Japan don't get shaped to meet the worlds needs, but gets sold as if it were a luxury artisan-product (ala "swiss-made" stuff).
The trouble with people who criticize Japan (incl. Japanese) is that they think this is because of "old people & culture" - but actually, no, the "old" Japanese (in the 1900s-1980s) seemed to have been extraordinarily curious about the world, and also very clever in marketing things. The issue is most definitely "modern", but ofc. saying that is verboten in the dogma of liberalism.
You might be onto something. Anecdotally, I just came across some armchair economist lamenting lack of aptitude among Japanese startups for foreign currency acquisition, and I could only agree - I can't recall many Japanese startups or corporate business expansions primarily focusing on foreign sales. The mental model is always to get rich quick within Japan and/or South Asia and retire. "The world" outside is treated like tiny separate bonus rooms.
Thinking about Japanese economy in general and how it hadn't grown in 30-40 years: 40 years is technically two generations, but life in Japan hadn't deteriorated meaningfully during that period. Substantial socio-political improvements were made, university entrance had rose somewhat absurdly high, some new infrastructures were built, convenience store sandwich prices hasn't doubled, overtimes and harassments at workplaces are way more strictly scrutinized. There's the problem of employment ice age, but it's not as bad as the collapse of the Soviet Union; not at "Vladimir Putin was laid off KGB and drove taxi to make ends meet" levels, only "PhDs drove trucks". So "Japan still using FAX" narratives only partially make sense. Overall, it does feel that there is something strange is going on in this country, something like effective isolationism.
I guess my point is... it's unfortunate that these efforts and costs spent goes to waste, and we don't know why it's only built to be recycled.
Is there a know secondhand marketplace for retired supercomputer hardware?
The hardware is the easy part of accelerating NN training. Nvidia's software and infrastructure is so well designed and established that no competitor can threaten them even if they give away the hardware for free.
The math of NN training isn't complex at all. Designing the software stack to make a new pytorch backend is very doable with the budgets these AI companies have.
I suspect that whenever you look like you're making good progress on this front, nvidia gives you a lot of chips for free on condition you shelve the effort though!
The latest example being Tesla, who were designing their own hardware and software stack for NN training, then suspiciously got huge numbers of H100's ahead of other clients and cancelled the dojo effort.
I doubt that's what happened. They had designs that were massively expensive to fab/package, had much worse performance than the latest Nvidia hardware, and still needed massive amounts of custom in-house development.
To combat all of these issues, they were fighting with Nvidia (and losing) for access to leading edge nodes, which kept going up in price. Their personnel costs kept rising as the company became more politicized, people left to join other companies (e.g. densityai), and they became embroiled in the salary wars to replace them.
My suspicion is that Musk told them to just buy Nvidia instead of waiting around for years of slow iteration to get something competitive.
The custom silicon I was involved with experienced similar issues. It was too expensive and slow to try competing with Nvidia, and no one could stomach the costs to do so.
> if they give away the hardware for free.
Seriously doubt that: free hardware (or 10s of bucks) would galvanize the community and achieve huge support - look at the Raspberry Pi project original prices and the consequences.
In fact, if any such thing would happen, I would wager Nvidia stock would tank massively.
Say, release has extensions to a RISC-V design.
*as :D
I don't know about well designed but it's definitely established.
Could you elaborate?
I've only done a little work on CUDA, but I was pretty impressed with it and with their NSys tools.
I'm curious what you wish was different.
I actually really hate CUDA's programming model and feel like it's too low-level to actually get any productive work done. I don't really blame Nvidia because they basically invented the programmable GPU and it wouldn't be fair to have them also come up with the perfect programming model right out of the gate but at this point it's pretty clear that having independent threads work on their own programs makes no sense. High performance code requires scheduling across multiple threads in a way that is completely different if you are coming from CPUs.
Of course, one might mention that GPUs are nothing like CPUs–but the programming model works super hard to try to hide this. So it's not really well designed in my book. I actually quite like the compilers that people are designing these days to write block-level code, because I feel like it better represents the work people want to do and then you pick which way you want it lowered.
As for Nsight (Systems), it is…ok, I guess? It's fine for games and stuff I guess but for HPC or AI it doesn't really surface the information that you would want. People who are running their GPUs really hard know they have kernels running all the time and what the performance characteristics of them are. Nsight Compute is the thing that tells you that but it's kind of a mediocre profiler (some of this may be limitations of hardware performance counters) and to use it effectively you basically have to read a bunch of blog posts by people instead of official documentation.
Despite not having used it much, my impression was that Nvidia's "moat" was that they have good networking libraries, that they are pretty good (relatively) and making sure all their tools work, and they have had consistent investment on this for a decade.
GPUs are a type of barrel processor, which are optimized for workloads without cache locality. As a fundamental principle, they replace the CPU cache with latency hiding behavior. Consequently, you can't use algorithms and data structures designed for CPUs, since most of those assume the existence of a CPU cache. Some things are very cheap on a barrel processor that are very expensive on a CPU and vice versa, which changes the way you think about optimization.
The wide vectors on GPUs are somewhat irrelevant. Scalar barrel processors exist and have the same issues. A scalar barrel processor feels deceptively CPU-like and will happily compile and run normal CPU code. The performance will nonetheless be poor unless the C++ code is designed to be a good fit for the nature of a barrel processor, code which will look weird and non-idiomatic to someone who has only written code for CPUs.
There is no way to hide that a barrel processor is not a CPU even though they superficially have a lot of CPU-like properties. A barrel processor is extremely efficient once you learn to write code for them and exceptionally well-suited to HPC since they are not latency-sensitive. However, most people never learn how to write proper code for barrel processors.
Ironically, barrel processor style code architecture is easy to translate into highly optimized CPU code, just not the reverse.
I wanted to upvote you originally, but I'm afraid this is not correct. A GPU is not a barrel processor. In a barrel processor a single context is switched between multiple threads after each instruction. A barrel processor design has a singular instruction pipeline and a singular cache across all threads. In a GPU, due to the independence of the execution units, those threads will execute those instructions concurrently on all cores, as long as a program-based instruction dependency between threads is not introduced. It's true parallelism. Furthermore, each execution unit embeds its own instruction scheduler, it's own pipeline and its own L1 cache (see [1] for NVidia's architecture).
[1] https://docs.nvidia.com/deeplearning/performance/dl-performa...
Barrel processors are a spectrum and GPUs are on one end of it. Yes, the classic canonical barrel processors (e.g. Tera architecture) more or less work as you outline. That is a 40 year old microarchitecture, they haven't been designed that way for decades.
Modern barrel processors implementations have complex microarchitectures that are much closer to a modern GPU in design. That is not accidental, the lineage is clearly there if you've worked on both. I will grant that vanishingly few people have ever seen or worked on a modern non-GPU barrel processor, since they are almost exclusively the domain of exotics built for government applications AFAICT.
What are the most important representatives of the class?
They are similar enough wrt. how they hide memory access latency within each single processing core ("streaming multiprocessor") by switching across hardware threads ("wavefronts").
A context cannot be shared by multiple threads. Each thread must have its own context, otherwise all threads will crash immediately. Thus your description of a barrel processor is completely contrary to reality.
When threads are implemented only in software, without hardware support, you have what is called coarse-grained multithreading. In this case, a CPU core executes one thread, until that thread must wait for a long time, e.g. for the completion of some I/O operation. Then the operating system switches the context from the stalled thread to another thread that is ready to run, by saving all registers used by the old thread and restoring the registers of the new thread, from the values that were saved when the new thread has been executed last time.
Such multithreading is coarse-grained, because saving and restoring the registers is expensive so it cannot be done often.
When hardware assists context-switching, by being able to store internally in the CPU core multiple sets of registers, i.e. multiple thread contexts, then you can have FGMT (fine-grained multithreading). In the earliest CPUs with FGMT the switching of the thread contexts was done after each executed instruction, but in all more recent CPUs or GPUs with FGMT the context switching can be done after each clock cycle.
Barrel processors are a subset of the FGMT processors, the simplest and the least efficient of them. Barrel processors are now only of historical interest. Nobody has made barrel processors during the last decades. In barrel processors, the threads are switched in round robin, i.e. in a fixed order. You cannot choose the next thread to run. This wastes clock cycles, because the next thread in the fixed order may be stalled, waiting for some event, so nothing can be done during its allocated clock cycle.
The name "barrel", introduced by CDC 6600 in 1964, refers to the similarity with the barrel of a revolver, you can rotate it with a position, bringing the next thread for execution, but you cannot jump over a thread to reach some arbitrary position.
What is switched in a barrel CPU at each clock cycle between threads is not a context, i.e. not the registers, but the execution units of the CPU, which become attached to the context of the current thread, i.e. to its registers. For each thread there is a distinct set of registers, storing the thread context.
The descriptions of the internal architecture of GPUs are extremely confusing, because NVIDIA has chosen to replace in its documentation all the words that have been used for decades when describing CPUs with different words, with no apparent reason except of obfuscating the GPU architecture. AMD has followed NVIDIA, and they have created a third set of architectural terms, mapped one to one to those of NVIDIA, but using yet other words, for maximum confusion.
For instance, NVIDIA calls "warp" what in a CPU is called "thread". What NVIDIA calls "thread" is what in a CPU is called "vector lane" or "SIMD lane". What NVIDIA calls "stream multiprocessor" is what in a CPU is called "core".
Both GPUs and CPUs are made of multiple cores, which can execute programs in parallel.
Each core can execute multiple threads, which share the same execution units. For executing multiple threads, most if not all GPUs use FGMT, while most modern CPUs use SMT (Simultaneous Multithreading).
Unlike FGMT, SMT can exist only on superscalar processors, i.e. which can initiate the execution of multiple instructions in the same clock cycle. Only in that case it may also be possible to initiate the execution of instructions from distinct threads in the same clock cycle.
Some GPUs may be able to initiate 2 instructions per clock cycle, only when certain conditions are met, but for all such GPUs their descriptions are typically very vague and it may be impossible to determine whether those 2 instructions may come from different threads, i.e. from different warps in the NVIDIA terminology.
i mean, it could be worse... it could be Vulkan
Who has better software than Nvidia for NN training? Meaning the least amount of friction getting a new network to train.
Just because their tools are the best doesn't mean they are designed well.
I've used DSPs, custom boards with compute hardware (FPGA image processing), and various kinds of GPUs. I would have a very hard time trying to point to ways in which the NVIDIA toolkit could be compared to what's out there and not come away with a massive sense of relief. For the most part 'it just works', the models are generic enough that you can actually get pretty close to the TDP on your own workloads with custom software and yet specific enough that you'll find stuff that makes your work easier most of the time.
I really can't complain, now, FPGAs, however... And if there ever is a company that comes out and improves substantially on this I'll be happy for sure but if you asked me off the bat what they should improve I honestly wouldn't know, especially not taking into account that this was an incremental effort over ~2 decades and that originated in an industry that has nothing to do with the main use case today and some detours into unrelated industries besides (crypto, for instance).
From fluid dynamics, FEA, crypto, gaming, genetics, AI and many others with a single generic architecture and delivering very good performance is no mean feat.
I'd love to hear in what way you would improve on their toolset.
Not the guy you replied to, but here are some improvements that feel obvious:
1. Memory indexing. It's a pain to avoid banking conflicts, and implement cooperative loading on transposed matrices. To improve this, (1) pop up a warning when banking conflicts are detected, (2) make cooperative loading solved by the compiler. It wouldn't be too hard to have a second form of indexing memory_{idx} that the compiler solves a linear programming problem for to maximize throughput (do you spend more thread cycles cooperative loading, or are banking conflicts fine because you have other things to work on?)
2. Why is there no warning when shared memory is unspecified? It isn't hard to check if you're accessing an index that might not have been assigned a value. The compiler should pop out a warning and assign it to 0.0, or maybe even just throw an error.
3. Timing - doesn't exist. Pretty much the gold standard is to run your kernel 10_000 times in a loop and subtract the time from before and after the loop. This isn't terribly important, I'm just getting flashbacks to before I learned `timeit` was a thing in Python.
Those are good and actionable suggestions. Have you passed these on to NVIDIA?
https://forums.developer.nvidia.com/c/accelerated-computing/...
They regularly have threads asking for such suggestions.
But I don't think they rise to the general conclusion that the tooling is bad.
Who cares. It's viable so long llama.cpp works and does 15 tok/s at under 500W or so. Whether the device accomplish that figure with a 8b q1 or a 1T BF16 weight files is not a fundamental boolean limiting factor, there will probably be some uses for such an instrument as proto-AGI devices.
There is a type of research called traffic surveys, which involves hiring few men with adequate education to sit or stand at an intersection for one whole day to count numbers of passing entities by types. YOLO wasn't accurate enough. I have gut feeling that vision enabled LLM would be. That doesn't require constant update or upgrades to latest NN innovations so no need to do full CUDA, so long one known good weight files work.
It's not all about NNs and AI. Take a look at the Top500, a lot of people are doing classical HPC work on Nvidia GPUs, which are increasingly not designed for this. Unfortunately the HPC market is just a lot smaller than the AI bubble.
If the hardware isn't available at all, we'll never find out if the software moat could be overcome.
I don't know why you are getting downvoted. This is 100% true. It's not like you can take any random data and train it into a NN. You have to transform the data, you have to write the low level GPU kernels which will actually run fast on that particular GPU, you also have to get the output and transform that as well. All of this is hard and very much impossible to create from scratch.
If people use PyTorch on a Nvidia GPU they are running layers and layers of code written by those that know how to write fast kernels for GPUs. In some cases they use assembly as well.
Nvidia stuck to one stack and wrote all their high level libraries on it, while their competitors switched from old APIs to new ones and never made anything close to CUDA.
Because in the context of LLM transformers, you really just need matrix multiplication to be hyper-optimized, it's 90-99% (citation needed) of the FLOPs. Get some normalization and activation functions in and you're good to go. It's not a massive software ecosystem.
CUDA and CUBLAS being capable of a bunch of other things is really cool, and would take a long time to catch up with, but getting the bare minimum to run LLMs on any platform with a bunch of GDDR7 channels and cores at a reasonable price would have people writing torch/ggml backends within weeks.
Have you tried to write a kernel for basic matrix multiplication? Because I have and I can assure you it is very hard to get 50% of maximum FLOPs, let alone 90%. It is nothing like CPUs where you write a * b in C and get 99% of the performance by the compiler.
Here is an example of how hard it is: https://siboehm.com/articles/22/CUDA-MMM
And this is just basic matrix mult. If you add activation functions it will slow down even more. There is nothing easy about GPU programming, if you care about performance. CUDA gives you all that optimization on a plate.
Well, CUDA gives you a whole programming language where you have to figure out the optimization for your particular card's cache size and bus width.
I'm saying the API surface of what to offer for LLMs is pretty small. Yeah, optimizing it is hard but it's "one really smart person works for a few weeks" hard, and most of the tiling techniques are public. Speaking of which, thanks for that blog post, off to read it now.
it's "one really smart person works for a few weeks" hard
AMD should hire that one really smart person.
yeah they really should. the primary reason AMD or behind in the GPU space is that they massively under-prioritize software.
Not having written one of these (…well I've written an IDCT) I can imagine it getting complicated if there's any known sparsity to take advantage of.
I assure you from experience that it's more than a smart person for a few weeks.
Fascinating. https://en.m.wikipedia.org/wiki/Single_program,_multiple_dat... explains the relation to SIMT.
It may also be worth noting that Japan has a pretty long history of marching to their own drummer in computing. They either created their own architectures or adopted others after pretty much everyone had moved on.
When you're building your own CPUs, why be beholden to US companies for GPUs? This makes perfect sense.
GPUs are great if your workload can use them, but not so great for more general tasks. These are more appropriate to more traditional supercomputing tasks, as in they're not optimized for lower precision AI stuff, like NVIDIA GPUs are.
Something doesn't add up here. The listed peak fp64 performance assumes one fp64 operation per clock per thread, yet there's very little description of how each PE performs 8 flops per cycle, only "threads are paired up such that one can take over processing when another one stalls...", classic latency-hiding. So the performance figures must assume that each PE has either an 8-wide SIMD unit (and 16-wide for fp32) or 8 separately schedulable execution units, neither of which seem likely given the supposed simplicity of the core (or 4 FMA EUs). Am I missing something?
Interesting that they’re investing in standard _AI_ toolchains, rather than standard HPC toolchains, even though I imagine Japanese supercomputing has more demand for the latter.
Great article documenting PEZY. It's incredible how close they are from NVidia despite being a very small team.
To me, this looks like a win.
Governments are there to finance projects like this that enable the country to have certain skillsets that wouldn't exist otherwise because of other countries having better solutions in the global market.
How what?
The fp64 GFLOPS per watt metric in the post is almost entirely meaningless to compare between these accelerators and NVIDIA GPUs, for example it says
> Hopper H200 is 47.9 gigaflops per watt at FP64 (33.5 teraflops divided by 700 watts)
But then if you consider H100 PCIe [0] instead, it's going to be 26000/350 = 74.29 GFLOPS per watt. If you go look harder you can find ones with better on-paper fp64 performance, for example AMD MI300X has 81.7 TFLOPs with typical board power of "750W Peak", which gives 108.9 GFLOPS per watt.
The truth is the power allocation of most GPGPUs are heavily tilted for Tensor usages. This has been the trend well before B300.
That's all for HPC.
And Pezy processors are certainly not designed for "AI" (i.e. linear algebra with lower input precision). For AI inference starting from 2020 everyone is talking about how many T(FL)OPS per watt, not G.
[0] which is a nerfed version of H200's precursor.
Governments are terrible at picking winners.
So are companies (Itanium, Windows Mobile, etc.) but what governments do well is funding the competitive baseline needed for big advances. We live in an age of wonders invented based on American research investment in the mid-20th century, and that worked because the government did not try to pick winners but invested in good work by qualified people (everything NIH, NAF, etc. do by competitive grants) or by promising to pay for capabilities not yet available (a lot of NASA and military stuff).
Just like it doesn’t work to try an ecosystem based on one species, a society has to blend government and private spending. They work on different incentives and timeframes, and both have pitfalls that the other might handle better.
Everyone is, and what survives, survives.
But what governments often can do, is break local optimums clustering around the quarter economy and take moonshot chances and find paths otherwise never taken. Hopefully one of these paths are great.
The difficult thing becomes deciding when to pull the plug. Is ITER a good thing or not? (Results wise, it is, but for the money? Who can tell really.)
There wouldn't be a Silicon Valley without the DARPA and NASA.
Or just plain military procurement, even before ARPA existed.
There definitely could be. The incentive, mindset and invention spirit was there. Probably darpa and nasa even hindered competition.
I could have created a social network in my college dorm and had become a multi-billionaire mogul. What "could have been" is practically limitless.
No one is good at picking winners. Governments, like VCs, are best when they spread the wealth across many different projects.
What is an "accelerator" in this context?
you can get 8TFlops of fp64 on xeon 6980P which is 6K€ now
I wonder how much progress (if any) is being done on floating point formats other than IEEE floats; on serious adoption in hardware in particular. Stuff like posits [1] for instance look very promising.
[1] https://posithub.org/docs/posit_standard-2.pdf
There is actual hardware available for posits. [1][2]
[1] https://youtu.be/vzVlQhaAZtQ?si=DJRmwOoyYGdq6mUQ [2] https://calligotech.com/uttunga/
the problem with posits is that they aren't enough better to be worth a switch. switching the industry over would cost billions in software rewrites and there are benefits, but they are fairly marginal.
For deep learning workloads the software for posits isn't the issue, it's doing anything that's not NVIDIA if you want to do it as a standalone product. For NVIDIA it's likely the penalty of not being able to share logic with standard size IEEE floats. If adopting posits allowed significantly smaller data types then NVIDIA would likely have adopted already.
But 8-bit posits are actually a very nice alternative for deep learning, especially when using the quire for dot products!
I don't disagree, it's just that the advantage hasn't yet been shown to be big enough to justify dedicating die area in a mainstream chip. There's potential there, if I were designing an accelerator today I would look hard at posits and variations of blocked representations especially around four bits. A few years back I got to have coffee with John Gustafson which was pretty neat and got me more excited about the idea.
Last time I heard about that it was for "super computers": nearly or even faster than the alternatives with a massive energy consumption advantage.
[dead]