> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
Because the attempts at segmented or object-oriented address spaces failed miserably.
> Linear virtual addresses were made to be backwards-compatible with tiny computers with linear physical addresses but without virtual memory.
That is false. In the Intel World, we first had the iAPX 432, which was an object-capability design. To say it failed miserably is overselling its success by a good margin.
The 8086 was sort-of segmented to get 20 bit addresses out of a 16 bit machine and a stop-gap and a huge success. The 80286 did things "properly" again and went all-in on the segments when going to virtual memory...and sucked. Best I remember, it was used almost exclusively as a faster 8086, with the 80286 modes used to page memory in and out and with the "reset and recover" hack to then get back to real mode for real work.
The 80386 introduced the flat address space and paged virtual memory not because of backwards-compatibility, but because it could and it was clearly The Right Thing™.
This may be misleading: the 80386 introduced flat address space and paged virtual memory _in the Intel world_, not in general. At the time it was introduced, linear / flat address space was the norm for 32 bit architectures, with examples such as the VAX, the MC68K, the NS32032 and the new RISC processors. The IBM/360 was also (mostly) linear.
So with the 80386, Intel finally abandoned their failed approach of segmented address spaces and joined the linear rest of the world. (Of course the 386 is technically still segmented, but let's ignore that).
And they made their new CPU conceptually compatible with the linear address space of the big computers of the time, the VAXens and IBM mainframes and Unix workstations. Not the "little" ones.
The important bit here is "their failed approach", just because Intel made a mess of it, doesn't mean that the entire concept is flawed.
(Intel is objectively the most lucky semiconductor company, in particular if one considers how utterly incompetent their own "green-field" designs have been.
Think for a moment how luck a company has to be, to have the major competitor they have tried to kill with all means available, legal and illegal, save your company, when you bet the entire farm on Itanic ?)
It isn't 100% proof that the concept is flawed, but the fact that the for decades most successful CPU manufacturer in the world couldn't make segmentation work in multiple attempts is pretty strong evidence that at least there are, er, "issues" that aren't immediately obvious.
I think it is safe to assume that they applied what they learned from their earlier failures to their later failures.
Again, we can never be 100% certain of counterfactuals, but certainly the assertion that linear address spaces were only there for backwards compatibility with small machines is simply historically inaccurate.
Also, Intel weren't the only ones. The first MMU for the Motorola MC68K was the MC68451, which was a segmented MMU. It was later replaced by the MC68851, a paged MMU. The MC68451, and segmentation, was both rarely used and then discontinued. The MC68851 was comparatively widely used, and later integrated in simplified form into future CPUs like the MC68030 and its successors.
So there as well, segmentation was tried first and then later abandoned. Which again, isn't definitive proof that segmentation is flawed, but way more evidence than you give credit for in your article.
People and companies again and again start out with segmentation, can't make it work and then later abandon it for linear paged memory.
My interpretation is that segmentation is one of those things that sounds great in theory, but doesn't work nearly as well in practice. Just thinking about it in the abstract, making an object boundary also a physical hardware-enforced protection boundary sounds absolutely perfect to me! For example something like the LOOM object-based virtual memory system for Smalltalk (though that was more software).
But theory ≠ practice. Another example of something that sounded great in theory was SOAR: Smalltalk on a RISC. They tried implementing a good part of the expensive bits of Smalltalk in silicon in a custom RISC design. It worked, but the benefits turned out to be minimal. What actually helped were larger caches and higher memory bandwidth, so RISC.
Another example was the Rekursiv, which also had object-capability addressing and a lot of other OO features in hardware. Also didn't go anywhere.
Again: not everything that sounds good in theory also works out in practice.
All the examples you bring up are from an entirely different time in terms of hardware, a time where one of the major technological limitations were how many pins a chip could have and two-layer PCBs.
Ideas can be good, but fail because they are premature, relative to the technological means we have to implement them. (Electrical vehicles will probably be the future text-book example of this.)
The interesting detail in the R1000's memory model, is that it combines segmentation with pages, removing the need for segments to be contiguous in physical memory, which gets rid of the fragmentation issue, which was a huge issue for the archtectures you mention.
But there obviously always will be a tension between how much info you stick into whatever goes for a "pointer" and how big it becomes (ie: "Fat pointers") but I think we can safely say that CHERI has documented that fat pointers is well worth their cost, and how we are just discussing what's in them.
The Burroughs large system architecture of the 1960s and 1970s (B6500, B6700 etc.) did it. Objects were called “arrays” and there was hardware support for allocating and deallocating them in Algol, the native language. These systems were initially aimed at businesses (for example, Ford was a big customer for such things as managing what parts were flowing where) I believe, but later they managed to support FORTRAN with its unsafe flat model.
These were large machines (think of a room 20m square) and with explicit hardware support for Algol operations including the array stuff and display registers for nested functions, were complex and power hungry and with a lot to go wrong. Eventually, with the technology of the day, they became uncompetitive against simpler architectures. By this time too, people wanted to program in languages like C++ that were not supported.
> > Linear virtual addresses were made to be backwards-compatible with tiny computers with linear physical addresses but without virtual memory.
> That is false. In the Intel World, we first had the iAPX 432, which was an object-capability design. To say it failed miserably is overselling its success by a good margin.
That's not refuting the point he's making. The mainframe-on-chip iAPX family (and Itanium after) died and had no heirs. The current popular CPU families are all descendents of the stopgap 8086 evolved from the tiny computer CPUs or ARM's straight up embedded CPU designs.
But I do agree with your point that a flat (global) virtual memory space is a lot nicer to program. In practice we've been fast moving away from that again though, the kernel has to struggle to keep up the illusion: NUCA, NUMA, CXL.mem, various mapped accelerator memories, etc.
Regarding the iAPX 432, I do want to set the record straight as I think you are insinuating that it failed because of its object memory design. The iAPX failed mostly because of it's abject performance characteristics, but that was in retrospect [1] not inherent to the object directory design. It lacked very simple look ahead mechanisms, no instruction or data caches, no registers and not even immediates. Performance did not seemed to be a top priority in the design, to paraphrase an architect. Additionally, the compiler team was not aligned and failed to deliver on time, which only compounded the performance problem.
The way you selectively quoted: yes, you removed the refutation.
And regarding the iAPX 432: it was slow in large part due to the failed object-capability model. For one, the model required multiple expensive lookups per instruction. And it required tremendous numbers of transistors, so many that despite forcing a (slow) multi-chip design there still wasn't enough transistor budget left over for performance enhancing features.
Performance enhancing features that contemporary designs with smaller transistor budgets but no object-capability model did have.
I huge factor in iAPX432 utter lack of success, were technological restrictions, like pin-count limits, laid down by Intel Top Brass, which forced stupid and silly limitations on the implementation.
That's not to say that iAPX432 would have succeeded under better management, but only to say that you cannot point to some random part of the design and say "That obviously does not work"
For object you have the IBM i Series / AS400 based systems which used an object capabilities model (as far as I understand it). A refinement and simplification of what was pioneered in the less successful System/38.
For linear, you have the Sun SPARC processor coming out in 1986, the same year that 386 shipped in volume. I think the use by Unix of linear made it more popular (the MIPSR2000 came out in January 1986, also).
Aren't AS400 closer to JVM than to a AXP432 in it's implementation details? Sans IBM's proprietary lingo, TIMI is just a bytecode virtual machine and was designed as such from the beginning. Then again, on a microcoded CPU it's hard to tell the difference.
> I think the use by Unix of linear made it more popular
More like linear address space was the only logical solution since EDVAC. Then in late 50's Manchester Atlas invented virtual memory to abstract away the magnetic drums. Some smart minds (Robert S. Barton with his B5000 which was a direct influence for JVM but was he the first one?) released what we actually want is segment/object addressing. Multics/GE-600 went with segments (couldn't find any evidence they were directly influenced by B5000 but seems so).
System/360, which was the pre-Unix lingo franca, went with flat address space. Guess IBM folks wanted to go as conservative as possible. They also wanted S/360 to compete in HPC as well so performance was quite important - and segment addressing doesn't give you that. Then VM/370 showed that flat/paged addressing allows you to do things segments can't. And then came PDP-11 (which was more or less about compressing S/360 into a mini, sorry DEC fans), RMS/VMS and Unix.
> TIMI is just a bytecode virtual machine and was designed as such from the beginning.
It's a bit more complicated than that. For one, it's an ahead-of-time translation model. The object references are implemented as tagged pointers to a single large address space. The tagged pointers rely on dedicated support in the Power architecture. The Unix compatibility layer (PASE) simulates per-process address spaces by allocating dedicated address space objects for each Unix process (these are called Terraspace objects).
When I read Frank Soltis' book a few years ago, the description of how the single level store was implemented involved segmentation, although I got the impression that the segments are implemented using pages in the Power architecture. The original CISC architecture (IMPI) may have implemented segmentation directly, although there is very little documentation on that architecture.
> iAPX 432
Yes, this was a failure, the Itanium of the 1980's
I also regard ADA as a failure. I worked with it many years ago. ADA would take 30 minutes to compile a program. Turbo C++ compiled equivalent code in a few seconds.
Machines are thousands of times faster now, yet C++ compilation is still slow somehow (templates? optimization? disinterest in compiler/linker performance? who knows?) Saving grace is having tons of cores and memory for parallel builds. Linking is still slow, though.
Of course Pascal compilers (Turbo Pascal etc.) could be blazingly fast since Pascal was designed to be compiled in a single pass, but presumably linking was faster as well. I wonder how Delphi or current Pascal compilers compare? (Pascal also supports bounded strings and array bounds checks IIRC.)
> I wonder how Delphi or current Pascal compilers compare?
Just did a full build of our main Delphi application on my work laptop, sporting an Intel i7-1260P. It compiled and linked just shy of 1.9 million lines of code in 31 seconds. So, still quite fast.
> Because the attempts at segmented or object-oriented address spaces failed miserably.
> That is false. In the Intel World, we first had the iAPX 432, which was an object-capability design. To say it failed miserably is overselling its success by a good margin.
I would further posit that segmented and object-oriented address spaces have failed and will continue to fail for as long as we have a separation into two distinct classes of storage: ephemeral (DRAM) and persistent storage / backing store (disks, flash storage, etc.) as opposed to having a single, unified concept of nearly infinite (at least logically if not physically), always-on just memory where everything is – essentially – an object.
Intel's Optane has given us a brief glimpse into what such a future could look like but, alas, that particular version of the future has not panned out.
Linear address space makes perfect sense for size-constrained DRAM, and makes little to no sense for the backing store where a file system is instead entrusted with implementing an object-like address space (files, directories are the objects, and the file system is the address space).
Once a new, successful memory technology emerges, we might see a resurgence of the segmented or object-oriented address space models, but until then, it will remain a pipe dream.
I don't see how any amount of memory technology can overcome the physical realities of locality. The closer you want the data to be to your processor, the less space you'll have to fit it. So there will always be a hierarchy where a smaller amount of data can have less latency, and there will always be an advantage to cramming as much data as you can at the top of the hierarchy.
while that's true, CPUs already have automatically managed caches. it's not too much of a stretch to imagine a world in which RAM is automatically managed as well and you don't have a distinction between RAM and persistent storage. in a spinning rust world, that never would have been possible, but with modern nvme, it's plausible.
Cpus manage it, but ensuring your data structures are friendly to how they manage caches is one of the keys to fast programs - which some of us care about.
«Memory technology» as in «a single tech» that blends RAM and disk into just «memory» and obviates the need for the disk as a distinct concept.
One can conjure up RAM, which has become exabytes large and which does not lose data after a system shutdown. Everything is local in such a unified memory model, is promptly available to and directly addressable by the CPU.
Please do note that multi-level CPU caches still do have their places in this scenario.
In fact, this has been successfully done in the AS/400 (or i Series), which I have mentioned elsewhere in the thread. It works well and is highly performant.
> «Memory technology» as in «a single tech» that blends RAM and disk into just «memory» and obviates the need for the disk as a distinct concept.
That already exists. Swap memory, mmap, disk paging, and so on.
Virtual memory is mostly fine for what it is, and it has been used in practice for decades. The problem that comes up is latency. Access time is limited by the speed of light [1]. And for that reason, CPU manufacturers continue to increase the capacities of the faster, closer memories (specifically registers and L1 cache).
I shudder to think about the impact of concurrent data structures fsync'ing on every write because the programmer can't reason about whether the data is in memory where a handful of atomic fences/barriers are enough to reason about the correctness of the operations, or on disk where those operations simply do not exist.
Also linear regions make a ton of sense for disk, and not just for performance. WAL-based systems are the cornerstone of many databases and require the ability to reserve linear regions.
Linear regions are mostly a figment of imagination in real life, but they are a convenient abstraction and a concept.
Linear regions are nearly impossible to guarantee, unless the underlying hardware has specific, controller-level provisions.
1) For RAM, the MCU will obscure the physical address of a memory page, which can come from a completely separate memory bank. It is up to the VMM implementation and heuristics to ensure the contiguous allocation, coalesce unrelated free pages into a new, large allocation or map in a free page from a «distant» location.
2) Disks (the spinning rust variety) are not that different. A freed block can be provided from the start of the disk. However, a sophisticated file system like XFS or ZFS, and others like it, will make an attempt do its best to allocate a contiguous block.
3) Flash storage (SSDs, NVMe) simply «lies» about the physical blocks and does it for a few reasons (garbage collection and the transparent reallocation of ailing blocks – to name a few). If I understand it correctly, the physical «block» numbers are hidden even from the flash storage controller and firmware themselves.
The only practical way I can think of to ensure the guaranteed contiguous allocation of blocks unfortunately involves a conventional hard drive that has a dedicated partition created just for the WAL. In fact, this is how Oracle installation worked – it required a dedicated raw device to bypass both the VMM and the file system.
When RAM and disk(s) are logically the same concept, WAL can be treated as an object of the «WAL» type with certain properties specific to this object type only to support WAL peculiarities.
Ultimately everything is an abstraction. The point I'm making is that linear regions are a useful abstraction for both disk and memory, but that's not enough to unify them. Particularly in that memory cares about the visibility of writes to other processes/threads, whereas disk cares about the durability of those writes. This is an important distinction that programmers need to differentiate between for correctness.
Perhaps a WAL was a bad example. Ultimately you need the ability to atomically reserve a region of a certain capacity and then commit it durably (or roll back). Perhaps there are other abstractions that can do this, but with linear memory and disk regions it's exceedingly easy.
Personally I think file I/O should have an atomic CAS operation on a fixed maximum number of bytes (just like shared memory between threads and processes) but afaik there is no standard way to do that.
I do not share the view that the unification of RAM and disk requires or entails linear regions of memory. In fact, the unification reduces the question of «do I have a contiguous block of size N to do X» to a mere «do I have enough memory to do X?», commits and rollbacks inclusive.
The issue of durability, however, remains a valid concern in either scenario, but the responsibility to ensure durability is delegated to the hardware.
Futhermore, commits and rollbacks are not sensitive to the memory linearity anyway; they are sensitive to durability of the operation, and they may be sensitive to the latency, although it is not a frequently occurring constraint. In the absence of a physical disk, commits/rollbacks can be implemented using the software transactional memory (STM) entirely in RAM and today – see the relevant Haskell library and the white paper on STM.
Lastly, when everything is an object in the system, the way the objects communicate with each other also changes from the traditional model of memory sharing to message passing, transactional outboxes, and similar, where the objects encapsulate the internal state without allowing other objects to access it – courtesy of the object-oriented address space protection, which is what the conversation initially started from.
> Show me somebody who calls the IBM S/360 a RISC design, and I will show you somebody who works with the s390 instruction set today.
Ahaha so true.
But to answer the post's main question:
> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
Because backwards compatibility is more valuable than elegant designs. Because array-crunching performance is more important than safety. Because a fix for a V8 vulnerability can be quickly deployed while a hardware vulnerability fix cannot. Because you can express any object model on top of flat memory, but expressing one object model (or flat memory) in terms of another object model usually costs a lot. Because nobody ever agreed of what the object model should be. But most importantly: because "memory safety" is not worth the costs.
But we don't have a linear address space, unless you're working with a tiny MCU. For last like 30 years we have virtual address space on every mainstream processor, and we can mix and match pages the way we want, insulate processes from one another, add sentinel pages at the ends of large structures to generate a fault, etc. We just structure process heaps as linear memory, but this is not a hard requirement, even on current hardware.
What we lack is the granularity that something like iAPX432 envisioned. Maybe some hardware breakthrough would allow for such granularity cheaply enough (like it allowed for signed pointers, for instance), so that smart compilers and OSes would offer even more protection without the expense of switching to kernel mode too often. I wonder what research exists in this field.
Okay, so we went from linear address spaces to partioned/disaggregated linear address spaces. This is hardly the victory you claim it is, because page sizes are increasing and thus the minimum addressable block of memory keeps increasing. Within a page everything is linear as usual.
The reason why linear address spaces are everywhere has to do with the fact that they are extremely cost effective and fast to implement in hardware. You can do prefix matching to check if an address is pointing at a specific hardware device and you can use multiplexers to address memory. Addresses can easily be encoded inside a single std_ulogic_vector. It's also possible to implement a Network-on-Chip architecture for your on-chip interconnect. It also makes caching easier, since you can translate the address into a cache entry.
When you add a scan chain to your flip flops, you're implicitly ordering your flip flops and thereby building an implicit linear address space.
There is also the fact that databases with auto incrementing integers as their primary keys use a logical linear address space, so the most obvious way to obtain a non-linear address space would require you to use randomly generated IDs instead. It seems like a huge amount of effort would have to be spent to get away from the idea of linear address spaces.
> But we don't have a linear address space, unless you're working with a tiny MCU.
We actually do, albeit for a brief duration of time – upon a cold start of the system when the MCU is inactive yet, no address translation is performed, and the entire memory space is treated as a single linear, contiguous block (even if there are physical holes in it).
When a system is powered on, the CPU runs in the privileged mode to allow an operating system kernel to set up the MCU and activate it, which takes place early on in the boot sequence. But until then, virtual memory is not available.
Those holes can be arbitrarily large, though, especially in weirder environments (e.g., memory-mapped optane and similar). Linear address space implies some degree of contiguity, I think.
Indeed. It can get ever weirder in the embedded world where a ROM, an E(E)PROM or a device may get mapped into an arbitrary slice of physical address space, anywhere within its bounds. It has become less common, though.
But devices are still commonly mapped at the top of the physical address space, which is a rather widespread practice.
And it's not uncommon for devices to be mapped multiple times in the address space! The different aliases provide slightly different ways of accessing it.
For example, 0x000-0x0ff providing linear access to memory bank A, 0x100-0x1ff linear access to bank B, but 0x200-0x3ff providing striped access across the two banks, with evenly-addressed words coming from bank A but odd ones from bank B.
Similarly, 0x000-0x0ff accessing memory through a cache, but 0x100-0x1ff accessing the same memory directly. Or 0x000-0x0ff overwriting data, 0x100-0x1ff setting bits (OR with current content), and 0x200-0x2ff clearing bits.
its entirely possible to implement segments on top of paging. what you need to do is add the kernel abstractions for implementing call gates that change segment visibility, and write some infrastructure to manage unions-of-a-bunch-of-little-regions. I haven't implemented this myself, but a friend did on a project we were working on together and as a mechanism it works perfectly well.
getting userspace to do the right thing without upending everything is what killed that project
There is also a problem of nested virtualization. If the VM has its own "imaginary" page tables on top of the hypervisor's page tables, then the number of actual physical memory reads goes from 4–6 to 16–36.
If I understood correctly, you'te talking about using descriptors to map segments; the issue with this approach is two-fold: it is slow (as each descriptor needs to be created for each segment - and sometimes more than one, if you need write-execute permissions), and there is a practical limit on the number of descriptors you can have - 8192 total, including call gates and whatnot. To extend this, you need to use LDTs, that - again - also require a descriptor in the GDT and are limited to 8192 entries. In a modern desktop system, 67 million segments would be both quite slow and at the same time quite limited.
Indeed. Also, TLB as it exists on x64 is not free, nor is very large. A multi-level "TLB", such that a process might pick an upper level of a large stretch of lower-level pages and e.g. allocate a disjoint micro-page for each stack frame, would be cool. But it takes a rather different CPU design.
> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
What a weird question, conflating one thing with the other.
I’m working on a object capability system, and trying hard to see if I can make it work using a linear address space so I don’t have to waste two or three pages per “process” [1][2] I really don’t see how objects have anything to do with virtual memory and memory isolation, as they are a higher abstraction. These objects have to live somewhere, unless the author is proposing a system without the classical model of addressable RAM.
—-
1: the reason I prefer a linear address space is that I want to run millions of actors/capabilities on a machine, and the latency and memory usage of switching address space and registers become really onerous. Also, I am really curious to see how ridiculously fast modern CPUs are when you’re not thrashing the TLB every millisecond or so.
2: in my case I let system processes/capabilities written in C run in linear address space where security isn’t a concern, and user space in a RISC-V VM so they can’t escape. The dream is that CHERI actually goes into production and user space can run on hardware, but that’s a big if.
The memory management story is still a big question: how do you do allocations in a linear address space? If you give out pages, there’s a lot of wastage. The alternative is a global memory allocator, which I am really not keen on. Still figuring out as I go.
Meet TIMI – the Technology Independent Machine Interface of IBM's i Series (nèe AS/400), which defines pointers as 128-bit values[0], which is a 1980's design.
It has allowed the AS/400 to have a single-level store, which means that «memory» and «disk» live in one conceptual address space.
A pointer can carry more than just an address – object identity, type, authority metadata – AS/400 uses tagged 16-byte pointers to stop arbitrary pointer fabrication, which supports isolation without relying on the usual per-process address-space model in the same way UNIX does.
Such «fat pointer» approach is conceptually close to modern capability systems (for example CHERI’s 128-bit capabilities), which exist for similar [safety] reasons.
[0] 128-bit pointers in the machine interface, not a 128-bit hardware virtual address space though.
It is, although TIMI does not exist in the hardware – it is a virtual architecture that has been implemented multiple times in different hardware (i.e., CPU's – IMPI, IBM RS64, POWER, and only heavens know which CPU IBM uses today).
The software written for this virtual architecture, on the other hand, has not changed and continues to run on modern IBM iSeries systems, even when it originates from 1989 – this is accomplished through static binary translation, or AOT in modern parlance, which recompiles the virtual ISA into the target ISA at startup.
If the author is reading these comments: Please write about the fully semantic IDE as soon as you can. Very interested in hearing more about that as it sounds like you've used it a lot
Is each branch-free run of instructions an object (which in general will be smaller than "function" or "method" objects) that can be abstracted? How does one manage locality ("these objects are the text of this function")?
Maybe one compromises and treats the text of a function as linear address space with small relative offsets. Of course, other issues will crop up. You can't treat code as an array, unless it's an array of the smallest word (bytes, say) even if the instructions are variable length. How do you construct all the pointer+capability values for the program's text statically? The linker would have to be able to do that...
The Rational R1000 is an interesting (and obscure) example to use - IBM's S/38 and AS/400 (now IBM i) also took a similar approach, and saw far more widespread usage.
What a clueless post. Even ignoring their massive overstatement of the difficulty and hardware complexity of hardware mapping tables, they appear to not even understand the problems solved by mapping tables.
Okay, let us say you have a physical object store. How are the actual contents of those objects stored? Are they stored in individual, isolated memory blocks? What if I want to make a 4 GB array? Do I need to have 4 GB memory blocks? What if I only have 6 GB? That is obviously unworkable.
Okay, we can solve that by compacting our physical object store onto a physical linear store and just presenting a object store as a abstraction. Sure, we have a physical linear store, but we never present that to the user. But what if somebody deallocates a object? Obviously we should be able to reuse that underlying physical linear store. What if they allocated a 4 GB array? Obviously we need to be able to fragment that into smaller pieces for future objects. What if we deallocated 4 GB of disjoint 4 KB objects? Should we fail to allocate a 8 KB object just because the fragments are not contiguous? Oh, just keep in mind the precise structure of the underlying physical store to avoid that (what a leaky and error-prone abstraction). Oh, but what about if there are multiple programs running, some potentially even buggy, how the hell am I supposed to keep track of the shared physical store to keep track of global fragmentation of the shared resource?
Okay, we can solve all of that with a level of indirection by giving you a physical object key instead of a physical object "reference". You present the key, and then we have a runtime structure that allows us to lookup where in the physical linear store we have put that data. This allows us to move and compact the underlying storage while letting you have a stable key. Now we have a mapping between object key and linear physical memory. But what if there are multiple programs on the same machine, some of which may be untrustworthy? What if they just start using keys they were not given? Obviously we need some scheme of preventing anybody from using any key. Maybe we could solve that by tagging every object in the system with a list of every program allowed to use it? But the number of programs is dynamic and if we have millions or billions of objects, each new program would require re-tagging all of those objects. We could make that list only encode "allowed" programs which would save space and amount of cleanup work, but how would the hardware do that lookup efficiently and how would it store that data efficiently?
Okay, we can solve that by having a per-program mapping between object key to linear physical memory. Oh no, that is looking suspiciously close to the per-program mapping between linear virtual memory to linear physical memory. Hopefully there are no other problems that will just result in us getting back to right where we started. Oh no, here comes another one. How is your machine storing this mapping between object key to linear physical memory? If you will remember from your data structures courses, those would usually be implemented as either a hash table or a tree. A tree sounds too suspiciously close to what currently exists, so let us use a hash table.
Okay, cool, how big should the hash table be? What if I want a billion objects in this program and a thousand objects in a different program? I guess we should use a growable hash table. All that happens is that if we allocate enough objects we allocate a new, dynamically sized storage structure then bulk rehash and insert all the old objects. That is amortized O(1), just at the cost of a unpredictable pause on potentially any memory allocation which can not only be gigantic, but is proportional to the number of live allocations. That is fine if our goal is just putting in a whole hardware garbage collector, but not really applicable for high performance computing. For high performance computing we would want worse case bounded time and memory cost (not amortized, per-operation).
Okay, I guess we have to go with a per-program tree-based mapping between object key to linear physical memory. But it is still a object store, so we won, right? How is the hardware going to walk that efficiently? For the hardware to walk that efficiently, you are going to want a highly regular structure with high fanout to both maximize the value of the cache lines you will load and to reduce the worst case number of cache lines you need to load. So you will want a B-Tree structure of some form. Oh no, that is exactly what hardware mapping tables look like.
But it is still a object store, so we won, right? But what if I deallocated 4 GB of disjoint 4 KB objects? You could move and recompact all of that memory, but why? You already have a mapping structure with a layer of indirection via object keys. Just create a interior mapping within a object between the object-relative offsets and potentially disjoint linear physical memory. Then you do not need physically contiguous backing, you can use disjoint physical linear store to provide the abstraction of a object linear store.
And now we have a per-program tree-based mapping between linear object address to linear physical memory. But what if the objects are of various sizes? In some cases the hardware will traverse the mapping from object key to linear object store, then potentially need to traverse another mapping from a large linear object address to linear physical memory. If we just compact the linear object store mappings, then we can unify the trees and just provide a common linear address to linear physical memory mapping and the tree-based mapping will be tightly bounded for all walks.
And there we have it, a per-program tree-based mapping between linear virtual memory and linear physical memory one step at a time.
And they addressed exactly none of the relevant points, instead supporting their arguments by waving in the general direction of outcompeted designs and speculative designs.
CHERI is neat, but, as far as I am aware, still suffers from serious unsolved problems with respect to temporal safety and reclamation. Last I looked (which was probably after 2022 when this post was made), the proposed solutions were hardware garbage collectors which are almost a non-starter. Could that be solved or performant enough? Maybe. Is a memory allocation strategy that can not free objects a currently viable solution for general computing to the degree you argue people not adopting it are whiners? No.
I see no reason to accept a fallacious argument from authority in lieu of actual arguments. And for that matter, I literally do kernel development on a commercial operating system and have personally authored the entirety of memory management and hardware MMU code for multiple architectures. I am a actual authority on this topic.
> And they addressed exactly none of the relevant points
Clarification: They addressed exactly none of the points you declare as relevant. You identify as an expert in the field and come across as plausibly such, so certainly I’ll still give your opinion on what’s relevant some weight.
Perhaps the author was constrained by a print publication page size limit of, say, one? Or six? That used to be a thing in the past, where people would publish opinions in industry magazines and there was a length cap set by the editor that forced cutting out the usual academic-rigor levels of detail in order to convey a mindset very briefly. What would make a lovely fifty or hundred page paper in today’s uncapped page size world, would have to be stripped of so much detail — of so much proof — in order to fit into any restrictions at all, that it would be impossible to address all possible or even probable argument in a single sitting.
Indeed, no human is infallible. But I think when someone (whom you know to be very knowledgeable in the field) writes a post, it's pretty unreasonable to describe it as "a clueless post". The author might be mistaken, perhaps, but almost certainly not clueless.
If PHK, DJB, <insert luminary> writes a post that comes across as clueless or flat out wrong, I'm going to read it and read it carefully. It does happen that <luminary> says dumb and/or incorrect things from time to time, but most likely there will be a very cool nugget of truth in there.
Regarding what PHK seems to be asking for, I think it's... linear addressing of physical memory, yes (because what else can you do?) but with pointer values that have attached capabilities so that you can dispense with virtual to physical memory mapping (and a ton of MMU TLB and page table hardware and software complexity and slowness) and replace it with hardware capability verification. Because such pointers are inherently abstract, the fact that the underlying physical memory address space is linear is irrelevant and the memory model looks more like "every object is a segment" if you squint hard. Obviously we need to be able to address arrays of bytes, for example, so within an object you have linear addressing, and overall you have it too because physical memory is linear, but otherwise you have a fiction of a pile of objects some of which you have access to and some of which you don't.
> What a clueless post. Even ignoring their massive overstatement of the difficulty and hardware complexity of hardware mapping tables, they appear to not even understand the problems solved by mapping tables.
From the article:
> And before you tell me this is impossible: The computer is in the next room, built with 74xx-TTL (transistor-transistor logic) chips in the late 1980s. It worked back then, and it still works today.
Do you think a 1980s computer has no drawbacks compared to 2020 vintage CPUs? It "works..." very slowly and with extremely high power draw. A 1980s design does not in any way prove that the model is viable compared to the state of the art today.
I did not say it was impossible. I said that mapping tables solve a lot of problems. There are very good reasons, as I explicitly outlined, for why they are a good solution to these classes of problems and why object stores fall down when trying to scale them up to parity with modern designs for general purpose computing.
People tried a lot of dead-ends in the past before we knew better. You need a direct analysis of the use case, problems, and solutions to actually support a point that a alternative technology is better rather just pointing at old examples.
A lot has changed since the 1980s. RAM access is much higher latency (in cycles), we have tons more RAM, and programs use more of it.
Maybe it is still possible but "we did it in the 80s so we can do it now" doesn't work.
Vypercore were trying to make RISC-V CPUs with object-based memory. They went out of business several months ago. I don't have the inside scoop, but I expect the biggest issue is that they were trying to sell it as a performance improvement (hardware based memory allocation), which it probably was... but also they would have been starting from a slower base anyway. "38% faster than linear memory" doesn't sound so great when your chip is half as fast as the competition to start with.
It also didn't protect objects on the stack (afaik) unlike CHERI. But on the other hand it's way simpler than CHERI conceptually, and I think it handled temporal safety more elegantly.
Personally I think Rust combined with memory tagging is going to be the sweet spot. CHERI if you really need ultra-maximum security, but I think the number of people that would pay for that is likely small.
This is one of those things, where 99.999% of all IT people have never even heard or imagined that things can be different than "how we have always done it." (Obligatory Douglas Adams quote goes here.)
This makes a certain kind of people, self-secure in their own knowledge, burst out words like "clueless", "fail miserably" etc. based on insufficient depth of actual knowledge. To them I can only say: Study harder, this is so much more technologically interesting, than you can imagine.
And yes, neither the iAPX432, nor for that matter Z8000, fared well with their segmented memory models, but it is important to remember that they primarily failed for entirely different reasons, mostly out of touch top-management, so we cannot, and should not, conclude from that, that all such memory models cannot possibly work.
There are several interesting memory models, which never really got a fair chance, because they came too early to benefit from VLSI technology, and it would be stupid to ignore a good idea, just because it was untimely. (Obligatory "Mother of all demos" reference goes here.)
CHERI is one such memory model, and probably the one we will end up with, at least in critical applications: Stick with the linear physical memory, but cabin the pointers.
In many applications, that can allow you to disable all the Virtual Memory hardware entirely. (I think the "CHERIot" project does this ?)
The R1000 model is different, but as far as I can tell equally valid, but it suffers from a much harder "getting from A to B" problem than CHERI does, yet I can see several kinds of applications where it
would totally scream around any other memory model.
But if people have never even heard about it, or think that just because computers look a certain way today, every other idea we tried must be definition have been worse, nobody will ever do the back-of-the-napkin math, to see if would make sense to try it out (again).
I'm sure there are also other memory concepts, even I have not heard about. (Yes, I've worked with IBM S/38)
But what we have right now, huge flat memory spaces, physical and virtual, with a horribly expensive translation mechanism between them, and no pointer safety, is literally the worst of all imaginable memory models, for the kind of computing we do, and the kind of security challenges we face.
There are other similar "we have always done it that way" mental blocks we need to reexamine, and I will answer one tiny question below, by giving an example:
Imagine you sit somewhere in a corner of a HUGE project, like a major commercial operating system with al the bells and whistles, the integrated air-traffic control system for a continent or the software for a state-of-the-art military gadget.
You maintain this library, which exports this function, which has a parameter which defaults to three.
For sound and sane reasons, you need to change the default to four now.
The compiler wont notice.
The linker wont notice.
People will need to know.
Who do you call ?
In the "Rational Environment" on the R1000 computer, you change 3 to 4 and, when you attempt to save your change, the semantic IDE refuses, informing you that it would change the semantics of the following three modules, which call your function without specifying that parameter explicitly - even if you do not have read permission to the source code of those modules.
The Rational Environment did that 40 years ago, can your IDE do that for you today ?
Some developers get a bit upset about that when we demo that in Datamuseum.dk :-)
The difference is that all modern IDEs regard each individual source file as "ground truth", but has nothing even remotely like an overview, or conceptual understanding, of the entire software project.
Yeah, sure, it knows what include files/declaration/exports things depend on, and which source files to link into which modules/packages/libraries, but it does not know what any of it actually means.
And sure, grep(1) is wonderful, but it only tells you what source code you need to read - provided you have the permission to do so.
In the Rational Environment ground truth is the parse tree, and what can best be described as a "preliminary symbol resolution", which is why it knows exactly which lines of code, in the entire project, call your function, with or without what parameters.
> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
Maybe it's because even though x86-64 is a 64-bit instruction set, all the CALL and JMP instructions still only support relative 8-bit or 32-bit offsets.
> Translating from linear virtual addresses to linear physical addresses is slow and complicated, because 64-bit can address a lot of memory.
Sure but spend some time thinking about how GOT and PLT aren't great solutions and can easily introduce their own set of security complications due to the above limitations.
I think you could argue there is already some effort to do type safety at the ISA register level, with e.g. shadow stack or control flow integrity. Isn't that very similar to this, except targeting program state rather than external memory?
I mean, if the stacks grew upwards, that alone would nip 90% of buffer overflow attacks in the bud. Moving the return address from the activation frame into a separate stack would help as well, but I understand that having an activation frame to be a single piece of data (a current continuation's closure, essentially) can be quite convenient.
Linux on PA-RISC also has an upward-growing stack (AFAIK, it's the only architecture Linux has ever had an upward-growing stack on; it's certainly the only currently-supported one).
Both this and parent comment about PA-RISC are very interesting.
As noted, stack growing up doesn't prevent all stack overflows, but it makes it less trivially easy to overwrite a return address. Bounded strings also made it less trivially easy to create string buffer overflows.
The PL/I stack growing up rather than down reduced potential impact of stack overflows in Multics (and PL/I already had better memory safety, with bounded strings, etc.) TFA's author would probably have appreciated the segmented memory architecture as well.
There is no reason why the C/C++ stack can't grow up rather than down. On paged hardware, both the stack and heap could (and probably should) grow up. "C's stack should grow up", one might say.
x86-64 call instruction decrements the stack pointer to push the return address. x86-64 push instructions decrement the stack pointer. The push instructions are easy to work around because most compilers already just push the entire stack frame at once and then do offset accesses, but the call instruction would be kind of annoying.
ARM does not suffer from that problem due to the usage of link registers and generic pre/post-modify. RISC-V is probably also safe, but I have not looked specifically.
> [x86] call instruction would be kind of annoying
I wonder what the best way to do it (on current x86) would be. The stupid simple way might be to adjust SP before the call instruction, and that seems to me like something that would be relatively efficient (simple addition instruction, issued very early).
Some architectures had CALL that was just "STR [SP], IP" without anything else, and it was up to the called procedure to adjust the stack pointer further to allocate for its local variables and the return slot for further calls. The RET instruction would still normally take an immediate (just as e.g. x86/x64's RET does) and additionally adjust the stack pointer by its value (either before or after loading the return address from the tip of the stack).
For modern systems, stack buffer overflow bugs haven't been great to exploit for a while. You need at least a stack cookie leak and on Apple Silicon the return addresses are MACed so overwriting them is a fools errand (2^-16 chance of success).
Most exploitable memory corruption bugs are heap buffer overflows.
In ARMv4/v5 (non-thumb-mode) stack is purely a convention that hardware does not enforce. Nobody forces you to use r13 as the stack pointer or to make the stack descending. You can prototype your approach trivially with small changes to gcc and linux kernel. As this is a standard architectural feature, qemu and the like will support emulating this. And it would run fine on real hardware too. I'd read the paper you publish based on this.
armv8/VMSAv8-64 has huge table support with optional contiguous bit allowing mapping up to 16GB at a time [0] [1]. Which will result in (almost) no address translations on any practical amount of memory available today.
Likely the issue is between most user systems not configuring huge tables and developers not keen on using things they can't test locally. Though huge tables are prominent in single-app servers and game consoles spaces.
What about an architecture, where there are pages and access permissions, but no translation (virtual address is always equal to physical)? fork() would become impossible, but Windows is fine without it anyway.
You are describing a memory protection unit (MPU). Those are common in low-resource contexts that are too simple to afford a full memory management unit (MMU). The problem with scaling that up, especially in general-purpose environments with dynamic process creation, is fragmentation of the shared address space.
You need a contiguous chunk for whatever object you are allocating. Other allocations fragment the address space, so there might be adequate space in total, but no individual contiguous chunk is large enough. You need to move around the backing storage, but then that makes your linear addresses non-stable. You solve that by adding a indirection layer mapping your "address", which is really a key/ID, to the backing storage. At that point you are basically back to a MMU.
CHERI is undeniably on the rise. Adapting existing code generally only requires rewriting less than 1% of the codebase. It offers speedups for existing as well as new languages (designed with the hardware in mind). I expect to see it everywhere in about a decade.
There's a big 0->1 jump required for it to actually be used by 99% of consumers -- x86 and ARM have to both make a pretty fundamental shift. Do you see that happening? I don't, really.
Tbh I can imagine this catching on if one of the big cloud providers endorses it. Including hardware support in a future version of AWS Graviton, or Azure cloud with a bunch of foundational software already developed to work with it. If one of those hyper scalers puts in the work, it could get to the point where you can launch a simple container running Postgres or whatever, with the full stack adapted to work with CHERI.
CHERI on its own does not fix many of the side-channels, which would need something like "BLACKOUT : Data-Oblivious Computation with Blinded Capabilities", but as I understand it, there is no consensus/infra on how to do efficient capability revocation (potentially in hardware), see https://lwn.net/Articles/1039395/.
On top of that, as I understand it, CHERI has no widespread concept of how to allow disabling/separation of workloads for ulta-low latency/high-throughput/applications in mixed-critical systems in practical systems. The only system I'm aware of with practical timing guarantees and allowing virtualization is sel4,
but again there are no practical guides with trade-offs in numbers yet.
16 bit programming kinda sucked. I caught the tail end of it but my first project was using Win32s so I just had to cherry-pick what I wanted to work on to avoid having to learn it at all. I was fortunate that a Hype Train with a particularly long track was about to leave the station and it was 32 bit. But everyone I worked with or around would wax poetic about what a pain in the ass 16 bit was.
Meanwhile though, the PC memory model really did sort of want memory to be divided into at least a couple of classes and we had to jump through a lot of hoops to deal with that era. Even if I wasn't coding in 16 bit I was still consuming 16 bit games with boot disks.
I was recently noodling around with a retrocoding setup. I have to admit that I did grin a silly grin when I found a set of compile flags for a DOS compiler that caused sizeof(void far*) to return 6 - the first time I'd ever seen it return a non power of two in my life.
“Why don’t we do $thing_that_decisively_failed instead of $thing_that_evolved_to_beat_all_other_approaches?” Usually this sort of question comes from a lack of understanding of the history of the failure of the first and the success of the second.
The fence principle always applies “don’t tear down a fence till you understand why it was built”
Linear address spaces allow for how computers actually operate - layers. Objects are hard to deal with by layers who don’t know about them. Bytes aren’t. They are just bytes. How do you page out “an object”? Do I now need to solve the knapsack problem to efficiently tile them on disk based on their most recent use time and size? …1000 other things…
IIRC Multics (among other systems) had both segmentation and paging, and a unified memory/storage architecture.
[I had thought that Multics' "ls" command abbreviation stood for "list segments" but the full name of the command seems to have been just "list". Sadly Unix/Linux didn't retain the dual full name (list, copy, move...) + abbreviated name (ls, cp, mv...) for common commands, using abbreviated names exclusively.]
Correct. 'Segments' are the Proper Unit to think about a bag of bits doing a computation; pages sitting under segments is how VM systems worked to only load the active parts of segments; that's kind of what defined Demand Paging. We have got to get back to the garden here, we need to start valuing security and safety at least as much as raw speed. https://en.wikipedia.org/wiki/Multics?useskin=vector Tagged memory, Capabilities (highly granular), and, yes, segments. Probably needs a marketing refresh (renaming) so as to not be immediately discarded.
> Like mandatory seat belts, some people argue that there would be no need for CHERI if everyone "just used type-safe languages"[...] I'm not having any of it.
It wish the author would have offered a more detailed refutation than "I'm not having it". I'm pretty sure the claim is right! I'm fairly convinced that we'd be a lot better off moving to ring0-only linear-memory architectures and rely on abstraction-theoretic security ("langsec") rather than fattening up the hardware with random whack-a-mole mitigations. We're gradually moving in that direction anyway without much of a concerted effort.
An open secret in our field is: the current market leading OSes and (to some extent) system architectures are antiquated and sub-optimal at their foundation due to backward compatibility requirements.
If we started green field today and managed to mitigate second system syndrome, we could design something faster, safer, overall simpler, and easier to program.
Every decent engineer and CS person knows this. But it’s unlikely for two reasons.
One is that doing it while avoiding second system syndrome takes teams with a huge amount of both expertise and discipline. That includes the discipline to be ruthless about exterminating complexity and saying no. That’s institutionally hard.
The second is that there isn’t strong demand. What we have is good enough for what most of the market wants, and right now all the demand for new architecture work is in the GPU/NPU/TPU space for AI. Nobody is interested in messing with the foundation when all the action is there. The CPU in that world is just a job manager for the AI tensor math machine.
Quantum computing will be similar. QC will be controlled by conventional machines, making the latter boring.
We may be past the window where rethinking architectural choices is possible. If you told me we still had Unix in 2000 years I would consider it plausible.
Aerospace, automotive, and medical devices represent a strong demand. They sometimes use and run really interesting stuff, due to the lack of such a strong backwards-compatibility demand, and a very high cost of software malfunction. Your onboard engine control system can run an OS based on seL4 with software written using Ada SPARK, or something. Nobody would bat an eye, nobody needs to run 20-years-old third-party software on it.
I don’t think these devices represent a demand in the same way at all. Secure boot firmware is another “demand” here that’s not really a demand.
All of these things, generally speaking, run unified, trusted applications, so there is no need for dynamic address space protection mechanisms or “OS level” safety. These systems can easily ban dynamic allocation, statically precompute all input sizes, and given enough effort, can mostly be statically proven given the constrained input and output space.
Or, to make this thesis more concise: I believe that OS and architecture level memory safety (object model addressing, CHERI, pointer tagging, etc.) is only necessary when the application space is not constrained. Once the application space is fully constrained you are better off fixing the application (SPARK is actually a great example in this direction).
Mobile phones are the demand and where we see the research and development happening. They’re walled off enough to be able to throw away some backwards compatibility and cross-compatibility, but still demand the ability to run multiple applications which are not statically analyzed and are untrusted by default. And indeed, this is where we see object store style / address space unflattening mitigations like pointer tagging come into play.
> we could design something faster, safer, overall simpler, and easier to program
I do remain doubtful on this for general purpose computing principles: Hardware for low latency/high throughput is at odds with full security (absence of observable side-channels). Optimal latency/throughput requires time-constrained=hardware programming with FGPAs or building hardware (high cost) usually programmed on dedicated hardware/software or via things like system-bypass solutions.
Simplicity is at odds with generality, see weak/strong formal system vs strong/weak semantics.
If you factor those compromises in, then you'll end up with the current state plus historical mistakes like missing vertical system integration of software stacks above Kernel-space as TCB, bad APIs due to missing formalization, CHERI with its current shortcomings, etc.
I do expect things to change once security with mandatory security processor becomes more required leading to multi-CPU solutions and potential for developers to use on the system complex+simple CPUs, meaning roughly time-accurate virtual and/or real ones.
> The second is that there isn’t strong demand.
This is not true for virtualization and security use cases, but not that obvious yet due to missing wide-spread attacks, see side-channel leaks of cloud solutions. Take a look at hardware security module vendors growth.
> That includes the discipline to be ruthless about exterminating complexity and saying no. That’s institutionally hard.
You need to make a product that out-performs your competitors. If their chip is faster then your work will be ignored regardless of how pure you managed to keep it.
> We may be past the window where rethinking architectural choices is possible.
I think your presumption that our architectures are extremely sub-optimal is wrong. They're exceptionally optimized. Just spend some time thinking about branch prediction and register renaming. It's a steep cliff for any new entrant. You not only have to produce something novel and worthwhile but you have to incorporate decades of deep knowledge into the core of your product, and you have to do all of that without introducing any hardware bugs.
You stand on the shoulders of giants and complain about the style of their footwear.
That’s another reason current designs are probably locked in. It’s called being stuck at a local maximum.
I’m not saying what we have is bad, just that the benefit of hindsight reveals some things.
Computing is tougher than other areas of engineering when it comes to greenfielding due to the extreme interlocking lock-in effects that emerge from things like instruction set and API compatibility. It’s easier to greenfield, say, an engine or an aircraft design, since doing so does not break compatibility with everything. If aviation were like computing, coffee mugs from propeller aircraft would fail to hold coffee (or even be mugs) on a jet aircraft.
Aviation does have a lot of backwards compatibility problems. It's one reason Boeing kept revving the 737 to make the Max version. The constraints come from things like training, certification, runway length, fuel mixes, radio protocols, regulations...
How true is this, really? When does the OS kernel take up more than a percent or so of a machine's resources nowadays? I think the problem is that there is so little juice there to squeeze that it's not worth the huge effort.
Look behind the curtains, and the space for improvement over the UNIX model are enormous. Our only saving grace is that computers have gotten ridiculously fast.
The problem isn’t direct overhead. The problem is shit APIs like blocking I/O that we constantly have to work around via heroic extensions like io_uring, an inefficient threading model that forces every app to roll its own scheduler (async etc.), lack of OS level support for advanced memory management which would be faster than doing it in user space, etc.
The thing about AI though is that it has indirect effects down the line. E.g. as prevalence of AI-generated code increases, I would argue that we'll need more guardrails both in development (to ground the model) and at runtime (to ensure that when it still fails, the outcome is not catastrophic).
This like saying generic systems are bad because you and a hacker both can make sane assumptions about it, thus even if more performant/usable it's also more vulnerable hence shouldn't be used.
I don't understand this.
I have seen bad takes but this one takes the cake. Brilliant start to 2026...
More advocacy propaganda for corporate authoritarianism under the guise of "safety". Locked-down systems like he describes fortunately died out long ago, but they are making a vicious comeback and will take over unless we fight it as much as we can.
Whatever a system is locked down is not a technology issue, it's about who have the key. You wouldn't be using MS-DOS today. Having more controls over what the applications are up to would be beneficial for the user. The modern multitasking systems have their origin in the time-sharing systems (which are exactly the locked-down ones) where security was "protect the admin's authority, protect users from each other" and hence "what application does is by definition authorized by the user that started the application". Then we started adding some "protect user data from the programs" safeguards but on desktop it always was an afterthought and on mobile the new security model is "protect the platform vendor authority from the user". Sadly a new API designed around "protect programs from each other, enforce users authority" never materialized.
But all of this is about IO. What OP is talking about is memory model and the changes they propose is not about "don't let the unauthorized ones do things" but rather "make it harder for a confused deputy do things". This one is pretty uncontroversial in its intent, though I personally don't really agree with the approach.
> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
Because the attempts at segmented or object-oriented address spaces failed miserably.
> Linear virtual addresses were made to be backwards-compatible with tiny computers with linear physical addresses but without virtual memory.
That is false. In the Intel World, we first had the iAPX 432, which was an object-capability design. To say it failed miserably is overselling its success by a good margin.
The 8086 was sort-of segmented to get 20 bit addresses out of a 16 bit machine and a stop-gap and a huge success. The 80286 did things "properly" again and went all-in on the segments when going to virtual memory...and sucked. Best I remember, it was used almost exclusively as a faster 8086, with the 80286 modes used to page memory in and out and with the "reset and recover" hack to then get back to real mode for real work.
The 80386 introduced the flat address space and paged virtual memory not because of backwards-compatibility, but because it could and it was clearly The Right Thing™.
> The 80386 introduced the flat address space ...
This may be misleading: the 80386 introduced flat address space and paged virtual memory _in the Intel world_, not in general. At the time it was introduced, linear / flat address space was the norm for 32 bit architectures, with examples such as the VAX, the MC68K, the NS32032 and the new RISC processors. The IBM/360 was also (mostly) linear.
So with the 80386, Intel finally abandoned their failed approach of segmented address spaces and joined the linear rest of the world. (Of course the 386 is technically still segmented, but let's ignore that).
And they made their new CPU conceptually compatible with the linear address space of the big computers of the time, the VAXens and IBM mainframes and Unix workstations. Not the "little" ones.
The important bit here is "their failed approach", just because Intel made a mess of it, doesn't mean that the entire concept is flawed.
(Intel is objectively the most lucky semiconductor company, in particular if one considers how utterly incompetent their own "green-field" designs have been.
Think for a moment how luck a company has to be, to have the major competitor they have tried to kill with all means available, legal and illegal, save your company, when you bet the entire farm on Itanic ?)
It isn't 100% proof that the concept is flawed, but the fact that the for decades most successful CPU manufacturer in the world couldn't make segmentation work in multiple attempts is pretty strong evidence that at least there are, er, "issues" that aren't immediately obvious.
I think it is safe to assume that they applied what they learned from their earlier failures to their later failures.
Again, we can never be 100% certain of counterfactuals, but certainly the assertion that linear address spaces were only there for backwards compatibility with small machines is simply historically inaccurate.
Also, Intel weren't the only ones. The first MMU for the Motorola MC68K was the MC68451, which was a segmented MMU. It was later replaced by the MC68851, a paged MMU. The MC68451, and segmentation, was both rarely used and then discontinued. The MC68851 was comparatively widely used, and later integrated in simplified form into future CPUs like the MC68030 and its successors.
So there as well, segmentation was tried first and then later abandoned. Which again, isn't definitive proof that segmentation is flawed, but way more evidence than you give credit for in your article.
People and companies again and again start out with segmentation, can't make it work and then later abandon it for linear paged memory.
My interpretation is that segmentation is one of those things that sounds great in theory, but doesn't work nearly as well in practice. Just thinking about it in the abstract, making an object boundary also a physical hardware-enforced protection boundary sounds absolutely perfect to me! For example something like the LOOM object-based virtual memory system for Smalltalk (though that was more software).
But theory ≠ practice. Another example of something that sounded great in theory was SOAR: Smalltalk on a RISC. They tried implementing a good part of the expensive bits of Smalltalk in silicon in a custom RISC design. It worked, but the benefits turned out to be minimal. What actually helped were larger caches and higher memory bandwidth, so RISC.
Another example was the Rekursiv, which also had object-capability addressing and a lot of other OO features in hardware. Also didn't go anywhere.
Again: not everything that sounds good in theory also works out in practice.
All the examples you bring up are from an entirely different time in terms of hardware, a time where one of the major technological limitations were how many pins a chip could have and two-layer PCBs.
Ideas can be good, but fail because they are premature, relative to the technological means we have to implement them. (Electrical vehicles will probably be the future text-book example of this.)
The interesting detail in the R1000's memory model, is that it combines segmentation with pages, removing the need for segments to be contiguous in physical memory, which gets rid of the fragmentation issue, which was a huge issue for the archtectures you mention.
But there obviously always will be a tension between how much info you stick into whatever goes for a "pointer" and how big it becomes (ie: "Fat pointers") but I think we can safely say that CHERI has documented that fat pointers is well worth their cost, and how we are just discussing what's in them.
The Burroughs large system architecture of the 1960s and 1970s (B6500, B6700 etc.) did it. Objects were called “arrays” and there was hardware support for allocating and deallocating them in Algol, the native language. These systems were initially aimed at businesses (for example, Ford was a big customer for such things as managing what parts were flowing where) I believe, but later they managed to support FORTRAN with its unsafe flat model.
These were large machines (think of a room 20m square) and with explicit hardware support for Algol operations including the array stuff and display registers for nested functions, were complex and power hungry and with a lot to go wrong. Eventually, with the technology of the day, they became uncompetitive against simpler architectures. By this time too, people wanted to program in languages like C++ that were not supported.
With today’s technology, it might be possible.
> > Linear virtual addresses were made to be backwards-compatible with tiny computers with linear physical addresses but without virtual memory.
> That is false. In the Intel World, we first had the iAPX 432, which was an object-capability design. To say it failed miserably is overselling its success by a good margin.
That's not refuting the point he's making. The mainframe-on-chip iAPX family (and Itanium after) died and had no heirs. The current popular CPU families are all descendents of the stopgap 8086 evolved from the tiny computer CPUs or ARM's straight up embedded CPU designs.
But I do agree with your point that a flat (global) virtual memory space is a lot nicer to program. In practice we've been fast moving away from that again though, the kernel has to struggle to keep up the illusion: NUCA, NUMA, CXL.mem, various mapped accelerator memories, etc.
Regarding the iAPX 432, I do want to set the record straight as I think you are insinuating that it failed because of its object memory design. The iAPX failed mostly because of it's abject performance characteristics, but that was in retrospect [1] not inherent to the object directory design. It lacked very simple look ahead mechanisms, no instruction or data caches, no registers and not even immediates. Performance did not seemed to be a top priority in the design, to paraphrase an architect. Additionally, the compiler team was not aligned and failed to deliver on time, which only compounded the performance problem.
The way you selectively quoted: yes, you removed the refutation.
And regarding the iAPX 432: it was slow in large part due to the failed object-capability model. For one, the model required multiple expensive lookups per instruction. And it required tremendous numbers of transistors, so many that despite forcing a (slow) multi-chip design there still wasn't enough transistor budget left over for performance enhancing features.
Performance enhancing features that contemporary designs with smaller transistor budgets but no object-capability model did have.
Opportunity costs matter.
I huge factor in iAPX432 utter lack of success, were technological restrictions, like pin-count limits, laid down by Intel Top Brass, which forced stupid and silly limitations on the implementation.
That's not to say that iAPX432 would have succeeded under better management, but only to say that you cannot point to some random part of the design and say "That obviously does not work"
For object you have the IBM i Series / AS400 based systems which used an object capabilities model (as far as I understand it). A refinement and simplification of what was pioneered in the less successful System/38.
For linear, you have the Sun SPARC processor coming out in 1986, the same year that 386 shipped in volume. I think the use by Unix of linear made it more popular (the MIPSR2000 came out in January 1986, also).
> IBM i Series / AS400
Aren't AS400 closer to JVM than to a AXP432 in it's implementation details? Sans IBM's proprietary lingo, TIMI is just a bytecode virtual machine and was designed as such from the beginning. Then again, on a microcoded CPU it's hard to tell the difference.
> I think the use by Unix of linear made it more popular
More like linear address space was the only logical solution since EDVAC. Then in late 50's Manchester Atlas invented virtual memory to abstract away the magnetic drums. Some smart minds (Robert S. Barton with his B5000 which was a direct influence for JVM but was he the first one?) released what we actually want is segment/object addressing. Multics/GE-600 went with segments (couldn't find any evidence they were directly influenced by B5000 but seems so).
System/360, which was the pre-Unix lingo franca, went with flat address space. Guess IBM folks wanted to go as conservative as possible. They also wanted S/360 to compete in HPC as well so performance was quite important - and segment addressing doesn't give you that. Then VM/370 showed that flat/paged addressing allows you to do things segments can't. And then came PDP-11 (which was more or less about compressing S/360 into a mini, sorry DEC fans), RMS/VMS and Unix.
> TIMI is just a bytecode virtual machine and was designed as such from the beginning.
It's a bit more complicated than that. For one, it's an ahead-of-time translation model. The object references are implemented as tagged pointers to a single large address space. The tagged pointers rely on dedicated support in the Power architecture. The Unix compatibility layer (PASE) simulates per-process address spaces by allocating dedicated address space objects for each Unix process (these are called Terraspace objects).
When I read Frank Soltis' book a few years ago, the description of how the single level store was implemented involved segmentation, although I got the impression that the segments are implemented using pages in the Power architecture. The original CISC architecture (IMPI) may have implemented segmentation directly, although there is very little documentation on that architecture.
This document describes the S/38 architecture, and many of the high level details (if not the specific implementation) also apply to the AS/400 and IBM i: https://homes.cs.washington.edu/~levy/capabook/Chapter8.pdf
> iAPX 432 Yes, this was a failure, the Itanium of the 1980's
I also regard ADA as a failure. I worked with it many years ago. ADA would take 30 minutes to compile a program. Turbo C++ compiled equivalent code in a few seconds.
Machines are thousands of times faster now, yet C++ compilation is still slow somehow (templates? optimization? disinterest in compiler/linker performance? who knows?) Saving grace is having tons of cores and memory for parallel builds. Linking is still slow, though.
Of course Pascal compilers (Turbo Pascal etc.) could be blazingly fast since Pascal was designed to be compiled in a single pass, but presumably linking was faster as well. I wonder how Delphi or current Pascal compilers compare? (Pascal also supports bounded strings and array bounds checks IIRC.)
> I wonder how Delphi or current Pascal compilers compare?
Just did a full build of our main Delphi application on my work laptop, sporting an Intel i7-1260P. It compiled and linked just shy of 1.9 million lines of code in 31 seconds. So, still quite fast.
> Because the attempts at segmented or object-oriented address spaces failed miserably.
> That is false. In the Intel World, we first had the iAPX 432, which was an object-capability design. To say it failed miserably is overselling its success by a good margin.
I would further posit that segmented and object-oriented address spaces have failed and will continue to fail for as long as we have a separation into two distinct classes of storage: ephemeral (DRAM) and persistent storage / backing store (disks, flash storage, etc.) as opposed to having a single, unified concept of nearly infinite (at least logically if not physically), always-on just memory where everything is – essentially – an object.
Intel's Optane has given us a brief glimpse into what such a future could look like but, alas, that particular version of the future has not panned out.
Linear address space makes perfect sense for size-constrained DRAM, and makes little to no sense for the backing store where a file system is instead entrusted with implementing an object-like address space (files, directories are the objects, and the file system is the address space).
Once a new, successful memory technology emerges, we might see a resurgence of the segmented or object-oriented address space models, but until then, it will remain a pipe dream.
I don't see how any amount of memory technology can overcome the physical realities of locality. The closer you want the data to be to your processor, the less space you'll have to fit it. So there will always be a hierarchy where a smaller amount of data can have less latency, and there will always be an advantage to cramming as much data as you can at the top of the hierarchy.
while that's true, CPUs already have automatically managed caches. it's not too much of a stretch to imagine a world in which RAM is automatically managed as well and you don't have a distinction between RAM and persistent storage. in a spinning rust world, that never would have been possible, but with modern nvme, it's plausible.
Cpus manage it, but ensuring your data structures are friendly to how they manage caches is one of the keys to fast programs - which some of us care about.
«Memory technology» as in «a single tech» that blends RAM and disk into just «memory» and obviates the need for the disk as a distinct concept.
One can conjure up RAM, which has become exabytes large and which does not lose data after a system shutdown. Everything is local in such a unified memory model, is promptly available to and directly addressable by the CPU.
Please do note that multi-level CPU caches still do have their places in this scenario.
In fact, this has been successfully done in the AS/400 (or i Series), which I have mentioned elsewhere in the thread. It works well and is highly performant.
> «Memory technology» as in «a single tech» that blends RAM and disk into just «memory» and obviates the need for the disk as a distinct concept.
That already exists. Swap memory, mmap, disk paging, and so on.
Virtual memory is mostly fine for what it is, and it has been used in practice for decades. The problem that comes up is latency. Access time is limited by the speed of light [1]. And for that reason, CPU manufacturers continue to increase the capacities of the faster, closer memories (specifically registers and L1 cache).
[1] https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html
I shudder to think about the impact of concurrent data structures fsync'ing on every write because the programmer can't reason about whether the data is in memory where a handful of atomic fences/barriers are enough to reason about the correctness of the operations, or on disk where those operations simply do not exist.
Also linear regions make a ton of sense for disk, and not just for performance. WAL-based systems are the cornerstone of many databases and require the ability to reserve linear regions.
otoh, WAL systems are only necessary because storage devices present an interface of linear regions. the WAL system could move into the hardware.
Linear regions are mostly a figment of imagination in real life, but they are a convenient abstraction and a concept.
Linear regions are nearly impossible to guarantee, unless the underlying hardware has specific, controller-level provisions.
The only practical way I can think of to ensure the guaranteed contiguous allocation of blocks unfortunately involves a conventional hard drive that has a dedicated partition created just for the WAL. In fact, this is how Oracle installation worked – it required a dedicated raw device to bypass both the VMM and the file system.When RAM and disk(s) are logically the same concept, WAL can be treated as an object of the «WAL» type with certain properties specific to this object type only to support WAL peculiarities.
Ultimately everything is an abstraction. The point I'm making is that linear regions are a useful abstraction for both disk and memory, but that's not enough to unify them. Particularly in that memory cares about the visibility of writes to other processes/threads, whereas disk cares about the durability of those writes. This is an important distinction that programmers need to differentiate between for correctness.
Perhaps a WAL was a bad example. Ultimately you need the ability to atomically reserve a region of a certain capacity and then commit it durably (or roll back). Perhaps there are other abstractions that can do this, but with linear memory and disk regions it's exceedingly easy.
Personally I think file I/O should have an atomic CAS operation on a fixed maximum number of bytes (just like shared memory between threads and processes) but afaik there is no standard way to do that.
I do not share the view that the unification of RAM and disk requires or entails linear regions of memory. In fact, the unification reduces the question of «do I have a contiguous block of size N to do X» to a mere «do I have enough memory to do X?», commits and rollbacks inclusive.
The issue of durability, however, remains a valid concern in either scenario, but the responsibility to ensure durability is delegated to the hardware.
Futhermore, commits and rollbacks are not sensitive to the memory linearity anyway; they are sensitive to durability of the operation, and they may be sensitive to the latency, although it is not a frequently occurring constraint. In the absence of a physical disk, commits/rollbacks can be implemented using the software transactional memory (STM) entirely in RAM and today – see the relevant Haskell library and the white paper on STM.
Lastly, when everything is an object in the system, the way the objects communicate with each other also changes from the traditional model of memory sharing to message passing, transactional outboxes, and similar, where the objects encapsulate the internal state without allowing other objects to access it – courtesy of the object-oriented address space protection, which is what the conversation initially started from.
> Show me somebody who calls the IBM S/360 a RISC design, and I will show you somebody who works with the s390 instruction set today.
Ahaha so true.
But to answer the post's main question:
> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
Because backwards compatibility is more valuable than elegant designs. Because array-crunching performance is more important than safety. Because a fix for a V8 vulnerability can be quickly deployed while a hardware vulnerability fix cannot. Because you can express any object model on top of flat memory, but expressing one object model (or flat memory) in terms of another object model usually costs a lot. Because nobody ever agreed of what the object model should be. But most importantly: because "memory safety" is not worth the costs.
But we don't have a linear address space, unless you're working with a tiny MCU. For last like 30 years we have virtual address space on every mainstream processor, and we can mix and match pages the way we want, insulate processes from one another, add sentinel pages at the ends of large structures to generate a fault, etc. We just structure process heaps as linear memory, but this is not a hard requirement, even on current hardware.
What we lack is the granularity that something like iAPX432 envisioned. Maybe some hardware breakthrough would allow for such granularity cheaply enough (like it allowed for signed pointers, for instance), so that smart compilers and OSes would offer even more protection without the expense of switching to kernel mode too often. I wonder what research exists in this field.
This feels like a pointless form of pendantry.
Okay, so we went from linear address spaces to partioned/disaggregated linear address spaces. This is hardly the victory you claim it is, because page sizes are increasing and thus the minimum addressable block of memory keeps increasing. Within a page everything is linear as usual.
The reason why linear address spaces are everywhere has to do with the fact that they are extremely cost effective and fast to implement in hardware. You can do prefix matching to check if an address is pointing at a specific hardware device and you can use multiplexers to address memory. Addresses can easily be encoded inside a single std_ulogic_vector. It's also possible to implement a Network-on-Chip architecture for your on-chip interconnect. It also makes caching easier, since you can translate the address into a cache entry.
When you add a scan chain to your flip flops, you're implicitly ordering your flip flops and thereby building an implicit linear address space.
There is also the fact that databases with auto incrementing integers as their primary keys use a logical linear address space, so the most obvious way to obtain a non-linear address space would require you to use randomly generated IDs instead. It seems like a huge amount of effort would have to be spent to get away from the idea of linear address spaces.
> But we don't have a linear address space, unless you're working with a tiny MCU.
We actually do, albeit for a brief duration of time – upon a cold start of the system when the MCU is inactive yet, no address translation is performed, and the entire memory space is treated as a single linear, contiguous block (even if there are physical holes in it).
When a system is powered on, the CPU runs in the privileged mode to allow an operating system kernel to set up the MCU and activate it, which takes place early on in the boot sequence. But until then, virtual memory is not available.
Those holes can be arbitrarily large, though, especially in weirder environments (e.g., memory-mapped optane and similar). Linear address space implies some degree of contiguity, I think.
Indeed. It can get ever weirder in the embedded world where a ROM, an E(E)PROM or a device may get mapped into an arbitrary slice of physical address space, anywhere within its bounds. It has become less common, though.
But devices are still commonly mapped at the top of the physical address space, which is a rather widespread practice.
And it's not uncommon for devices to be mapped multiple times in the address space! The different aliases provide slightly different ways of accessing it.
For example, 0x000-0x0ff providing linear access to memory bank A, 0x100-0x1ff linear access to bank B, but 0x200-0x3ff providing striped access across the two banks, with evenly-addressed words coming from bank A but odd ones from bank B.
Similarly, 0x000-0x0ff accessing memory through a cache, but 0x100-0x1ff accessing the same memory directly. Or 0x000-0x0ff overwriting data, 0x100-0x1ff setting bits (OR with current content), and 0x200-0x2ff clearing bits.
its entirely possible to implement segments on top of paging. what you need to do is add the kernel abstractions for implementing call gates that change segment visibility, and write some infrastructure to manage unions-of-a-bunch-of-little-regions. I haven't implemented this myself, but a friend did on a project we were working on together and as a mechanism it works perfectly well.
getting userspace to do the right thing without upending everything is what killed that project
There is also a problem of nested virtualization. If the VM has its own "imaginary" page tables on top of the hypervisor's page tables, then the number of actual physical memory reads goes from 4–6 to 16–36.
If I understood correctly, you'te talking about using descriptors to map segments; the issue with this approach is two-fold: it is slow (as each descriptor needs to be created for each segment - and sometimes more than one, if you need write-execute permissions), and there is a practical limit on the number of descriptors you can have - 8192 total, including call gates and whatnot. To extend this, you need to use LDTs, that - again - also require a descriptor in the GDT and are limited to 8192 entries. In a modern desktop system, 67 million segments would be both quite slow and at the same time quite limited.
But that wouldn't protect against out-of boundary access (which is the whole point of segments), would it?
Indeed. Also, TLB as it exists on x64 is not free, nor is very large. A multi-level "TLB", such that a process might pick an upper level of a large stretch of lower-level pages and e.g. allocate a disjoint micro-page for each stack frame, would be cool. But it takes a rather different CPU design.
"Please give me a segmented memory model on top of paged memory" - words which have never been uttered
There is a subtle difference between "give me an option" and "thrust on me a requirement".
> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
What a weird question, conflating one thing with the other.
I’m working on a object capability system, and trying hard to see if I can make it work using a linear address space so I don’t have to waste two or three pages per “process” [1][2] I really don’t see how objects have anything to do with virtual memory and memory isolation, as they are a higher abstraction. These objects have to live somewhere, unless the author is proposing a system without the classical model of addressable RAM.
—-
1: the reason I prefer a linear address space is that I want to run millions of actors/capabilities on a machine, and the latency and memory usage of switching address space and registers become really onerous. Also, I am really curious to see how ridiculously fast modern CPUs are when you’re not thrashing the TLB every millisecond or so.
2: in my case I let system processes/capabilities written in C run in linear address space where security isn’t a concern, and user space in a RISC-V VM so they can’t escape. The dream is that CHERI actually goes into production and user space can run on hardware, but that’s a big if.
The memory management story is still a big question: how do you do allocations in a linear address space? If you give out pages, there’s a lot of wastage. The alternative is a global memory allocator, which I am really not keen on. Still figuring out as I go.
Have you looked at the Apple Newton memory architecture?
http://waltersmith.us/newton/HICSS-92.pdf
> the data bus is 128 bits wide: 64-bit for the data and 64-bit for data's type
That seems a bit wasteful if you're not using a lot of object types.
Meet TIMI – the Technology Independent Machine Interface of IBM's i Series (nèe AS/400), which defines pointers as 128-bit values[0], which is a 1980's design.
It has allowed the AS/400 to have a single-level store, which means that «memory» and «disk» live in one conceptual address space.
A pointer can carry more than just an address – object identity, type, authority metadata – AS/400 uses tagged 16-byte pointers to stop arbitrary pointer fabrication, which supports isolation without relying on the usual per-process address-space model in the same way UNIX does.
Such «fat pointer» approach is conceptually close to modern capability systems (for example CHERI’s 128-bit capabilities), which exist for similar [safety] reasons.
[0] 128-bit pointers in the machine interface, not a 128-bit hardware virtual address space though.
is this still used in IBM hardware ?
It is, although TIMI does not exist in the hardware – it is a virtual architecture that has been implemented multiple times in different hardware (i.e., CPU's – IMPI, IBM RS64, POWER, and only heavens know which CPU IBM uses today).
The software written for this virtual architecture, on the other hand, has not changed and continues to run on modern IBM iSeries systems, even when it originates from 1989 – this is accomplished through static binary translation, or AOT in modern parlance, which recompiles the virtual ISA into the target ISA at startup.
64-bit pointers tend to be a bit wasteful as well.
Especially on a system from the 80s, did they plan to address every bit of memory available on the planet?
I am forever sad that x32 didn't take off. Lower memory use, great performance. Ah well.
If the author is reading these comments: Please write about the fully semantic IDE as soon as you can. Very interested in hearing more about that as it sounds like you've used it a lot
So how do you hook up such a system to actual RAM or EPROMs to allow it to function? Somewhere there has to be an actual address generated.
And that address is going to be contained in a linear address space (possibly with some holes).
But that address doesn't have to be visible at the ISA level.
Code has to have addresses for calls and branches. Debuggers need to be able to control it all.
> Code has to have addresses for calls and branches.
Does it mean that at that level an address has to be an offset in a linear address space?
If you have hardware powerful enough to make addresses abstract, couldn't also provide the operations to manipulate them abstractly?
Is each branch-free run of instructions an object (which in general will be smaller than "function" or "method" objects) that can be abstracted? How does one manage locality ("these objects are the text of this function")?
Maybe one compromises and treats the text of a function as linear address space with small relative offsets. Of course, other issues will crop up. You can't treat code as an array, unless it's an array of the smallest word (bytes, say) even if the instructions are variable length. How do you construct all the pointer+capability values for the program's text statically? The linker would have to be able to do that...
The Rational R1000 is an interesting (and obscure) example to use - IBM's S/38 and AS/400 (now IBM i) also took a similar approach, and saw far more widespread usage.
What a clueless post. Even ignoring their massive overstatement of the difficulty and hardware complexity of hardware mapping tables, they appear to not even understand the problems solved by mapping tables.
Okay, let us say you have a physical object store. How are the actual contents of those objects stored? Are they stored in individual, isolated memory blocks? What if I want to make a 4 GB array? Do I need to have 4 GB memory blocks? What if I only have 6 GB? That is obviously unworkable.
Okay, we can solve that by compacting our physical object store onto a physical linear store and just presenting a object store as a abstraction. Sure, we have a physical linear store, but we never present that to the user. But what if somebody deallocates a object? Obviously we should be able to reuse that underlying physical linear store. What if they allocated a 4 GB array? Obviously we need to be able to fragment that into smaller pieces for future objects. What if we deallocated 4 GB of disjoint 4 KB objects? Should we fail to allocate a 8 KB object just because the fragments are not contiguous? Oh, just keep in mind the precise structure of the underlying physical store to avoid that (what a leaky and error-prone abstraction). Oh, but what about if there are multiple programs running, some potentially even buggy, how the hell am I supposed to keep track of the shared physical store to keep track of global fragmentation of the shared resource?
Okay, we can solve all of that with a level of indirection by giving you a physical object key instead of a physical object "reference". You present the key, and then we have a runtime structure that allows us to lookup where in the physical linear store we have put that data. This allows us to move and compact the underlying storage while letting you have a stable key. Now we have a mapping between object key and linear physical memory. But what if there are multiple programs on the same machine, some of which may be untrustworthy? What if they just start using keys they were not given? Obviously we need some scheme of preventing anybody from using any key. Maybe we could solve that by tagging every object in the system with a list of every program allowed to use it? But the number of programs is dynamic and if we have millions or billions of objects, each new program would require re-tagging all of those objects. We could make that list only encode "allowed" programs which would save space and amount of cleanup work, but how would the hardware do that lookup efficiently and how would it store that data efficiently?
Okay, we can solve that by having a per-program mapping between object key to linear physical memory. Oh no, that is looking suspiciously close to the per-program mapping between linear virtual memory to linear physical memory. Hopefully there are no other problems that will just result in us getting back to right where we started. Oh no, here comes another one. How is your machine storing this mapping between object key to linear physical memory? If you will remember from your data structures courses, those would usually be implemented as either a hash table or a tree. A tree sounds too suspiciously close to what currently exists, so let us use a hash table.
Okay, cool, how big should the hash table be? What if I want a billion objects in this program and a thousand objects in a different program? I guess we should use a growable hash table. All that happens is that if we allocate enough objects we allocate a new, dynamically sized storage structure then bulk rehash and insert all the old objects. That is amortized O(1), just at the cost of a unpredictable pause on potentially any memory allocation which can not only be gigantic, but is proportional to the number of live allocations. That is fine if our goal is just putting in a whole hardware garbage collector, but not really applicable for high performance computing. For high performance computing we would want worse case bounded time and memory cost (not amortized, per-operation).
Okay, I guess we have to go with a per-program tree-based mapping between object key to linear physical memory. But it is still a object store, so we won, right? How is the hardware going to walk that efficiently? For the hardware to walk that efficiently, you are going to want a highly regular structure with high fanout to both maximize the value of the cache lines you will load and to reduce the worst case number of cache lines you need to load. So you will want a B-Tree structure of some form. Oh no, that is exactly what hardware mapping tables look like.
But it is still a object store, so we won, right? But what if I deallocated 4 GB of disjoint 4 KB objects? You could move and recompact all of that memory, but why? You already have a mapping structure with a layer of indirection via object keys. Just create a interior mapping within a object between the object-relative offsets and potentially disjoint linear physical memory. Then you do not need physically contiguous backing, you can use disjoint physical linear store to provide the abstraction of a object linear store.
And now we have a per-program tree-based mapping between linear object address to linear physical memory. But what if the objects are of various sizes? In some cases the hardware will traverse the mapping from object key to linear object store, then potentially need to traverse another mapping from a large linear object address to linear physical memory. If we just compact the linear object store mappings, then we can unify the trees and just provide a common linear address to linear physical memory mapping and the tree-based mapping will be tightly bounded for all walks.
And there we have it, a per-program tree-based mapping between linear virtual memory and linear physical memory one step at a time.
> What a clueless post [...a lot of OS memory handling details...]. And there we have it.
Considering who wrote the post, I would bet the author is aware of what you've described here.
And they addressed exactly none of the relevant points, instead supporting their arguments by waving in the general direction of outcompeted designs and speculative designs.
CHERI is neat, but, as far as I am aware, still suffers from serious unsolved problems with respect to temporal safety and reclamation. Last I looked (which was probably after 2022 when this post was made), the proposed solutions were hardware garbage collectors which are almost a non-starter. Could that be solved or performant enough? Maybe. Is a memory allocation strategy that can not free objects a currently viable solution for general computing to the degree you argue people not adopting it are whiners? No.
I see no reason to accept a fallacious argument from authority in lieu of actual arguments. And for that matter, I literally do kernel development on a commercial operating system and have personally authored the entirety of memory management and hardware MMU code for multiple architectures. I am a actual authority on this topic.
> And they addressed exactly none of the relevant points
Clarification: They addressed exactly none of the points you declare as relevant. You identify as an expert in the field and come across as plausibly such, so certainly I’ll still give your opinion on what’s relevant some weight.
Perhaps the author was constrained by a print publication page size limit of, say, one? Or six? That used to be a thing in the past, where people would publish opinions in industry magazines and there was a length cap set by the editor that forced cutting out the usual academic-rigor levels of detail in order to convey a mindset very briefly. What would make a lovely fifty or hundred page paper in today’s uncapped page size world, would have to be stripped of so much detail — of so much proof — in order to fit into any restrictions at all, that it would be impossible to address all possible or even probable argument in a single sitting.
These are obvious and trivial counters to their points. They should have been addressed.
PHK is far from infallible.
Indeed, no human is infallible. But I think when someone (whom you know to be very knowledgeable in the field) writes a post, it's pretty unreasonable to describe it as "a clueless post". The author might be mistaken, perhaps, but almost certainly not clueless.
If PHK, DJB, <insert luminary> writes a post that comes across as clueless or flat out wrong, I'm going to read it and read it carefully. It does happen that <luminary> says dumb and/or incorrect things from time to time, but most likely there will be a very cool nugget of truth in there.
Regarding what PHK seems to be asking for, I think it's... linear addressing of physical memory, yes (because what else can you do?) but with pointer values that have attached capabilities so that you can dispense with virtual to physical memory mapping (and a ton of MMU TLB and page table hardware and software complexity and slowness) and replace it with hardware capability verification. Because such pointers are inherently abstract, the fact that the underlying physical memory address space is linear is irrelevant and the memory model looks more like "every object is a segment" if you squint hard. Obviously we need to be able to address arrays of bytes, for example, so within an object you have linear addressing, and overall you have it too because physical memory is linear, but otherwise you have a fiction of a pile of objects some of which you have access to and some of which you don't.
> What a clueless post. Even ignoring their massive overstatement of the difficulty and hardware complexity of hardware mapping tables, they appear to not even understand the problems solved by mapping tables.
From the article:
> And before you tell me this is impossible: The computer is in the next room, built with 74xx-TTL (transistor-transistor logic) chips in the late 1980s. It worked back then, and it still works today.
Do you think a 1980s computer has no drawbacks compared to 2020 vintage CPUs? It "works..." very slowly and with extremely high power draw. A 1980s design does not in any way prove that the model is viable compared to the state of the art today.
I did not say it was impossible. I said that mapping tables solve a lot of problems. There are very good reasons, as I explicitly outlined, for why they are a good solution to these classes of problems and why object stores fall down when trying to scale them up to parity with modern designs for general purpose computing.
People tried a lot of dead-ends in the past before we knew better. You need a direct analysis of the use case, problems, and solutions to actually support a point that a alternative technology is better rather just pointing at old examples.
A lot has changed since the 1980s. RAM access is much higher latency (in cycles), we have tons more RAM, and programs use more of it.
Maybe it is still possible but "we did it in the 80s so we can do it now" doesn't work.
Vypercore were trying to make RISC-V CPUs with object-based memory. They went out of business several months ago. I don't have the inside scoop, but I expect the biggest issue is that they were trying to sell it as a performance improvement (hardware based memory allocation), which it probably was... but also they would have been starting from a slower base anyway. "38% faster than linear memory" doesn't sound so great when your chip is half as fast as the competition to start with.
It also didn't protect objects on the stack (afaik) unlike CHERI. But on the other hand it's way simpler than CHERI conceptually, and I think it handled temporal safety more elegantly.
Personally I think Rust combined with memory tagging is going to be the sweet spot. CHERI if you really need ultra-maximum security, but I think the number of people that would pay for that is likely small.
Author here.
This is one of those things, where 99.999% of all IT people have never even heard or imagined that things can be different than "how we have always done it." (Obligatory Douglas Adams quote goes here.)
This makes a certain kind of people, self-secure in their own knowledge, burst out words like "clueless", "fail miserably" etc. based on insufficient depth of actual knowledge. To them I can only say: Study harder, this is so much more technologically interesting, than you can imagine.
And yes, neither the iAPX432, nor for that matter Z8000, fared well with their segmented memory models, but it is important to remember that they primarily failed for entirely different reasons, mostly out of touch top-management, so we cannot, and should not, conclude from that, that all such memory models cannot possibly work.
There are several interesting memory models, which never really got a fair chance, because they came too early to benefit from VLSI technology, and it would be stupid to ignore a good idea, just because it was untimely. (Obligatory "Mother of all demos" reference goes here.)
CHERI is one such memory model, and probably the one we will end up with, at least in critical applications: Stick with the linear physical memory, but cabin the pointers.
In many applications, that can allow you to disable all the Virtual Memory hardware entirely. (I think the "CHERIot" project does this ?)
The R1000 model is different, but as far as I can tell equally valid, but it suffers from a much harder "getting from A to B" problem than CHERI does, yet I can see several kinds of applications where it would totally scream around any other memory model.
But if people have never even heard about it, or think that just because computers look a certain way today, every other idea we tried must be definition have been worse, nobody will ever do the back-of-the-napkin math, to see if would make sense to try it out (again).
I'm sure there are also other memory concepts, even I have not heard about. (Yes, I've worked with IBM S/38)
But what we have right now, huge flat memory spaces, physical and virtual, with a horribly expensive translation mechanism between them, and no pointer safety, is literally the worst of all imaginable memory models, for the kind of computing we do, and the kind of security challenges we face.
There are other similar "we have always done it that way" mental blocks we need to reexamine, and I will answer one tiny question below, by giving an example:
Imagine you sit somewhere in a corner of a HUGE project, like a major commercial operating system with al the bells and whistles, the integrated air-traffic control system for a continent or the software for a state-of-the-art military gadget.
You maintain this library, which exports this function, which has a parameter which defaults to three.
For sound and sane reasons, you need to change the default to four now.
The compiler wont notice.
The linker wont notice.
People will need to know.
Who do you call ?
In the "Rational Environment" on the R1000 computer, you change 3 to 4 and, when you attempt to save your change, the semantic IDE refuses, informing you that it would change the semantics of the following three modules, which call your function without specifying that parameter explicitly - even if you do not have read permission to the source code of those modules.
The Rational Environment did that 40 years ago, can your IDE do that for you today ?
Some developers get a bit upset about that when we demo that in Datamuseum.dk :-)
The difference is that all modern IDEs regard each individual source file as "ground truth", but has nothing even remotely like an overview, or conceptual understanding, of the entire software project.
Yeah, sure, it knows what include files/declaration/exports things depend on, and which source files to link into which modules/packages/libraries, but it does not know what any of it actually means.
And sure, grep(1) is wonderful, but it only tells you what source code you need to read - provided you have the permission to do so.
In the Rational Environment ground truth is the parse tree, and what can best be described as a "preliminary symbol resolution", which is why it knows exactly which lines of code, in the entire project, call your function, with or without what parameters.
Not all ideas are good.
Not all good ideas are lucky.
Not all forgotten ideas should be ignored.
> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
Maybe it's because even though x86-64 is a 64-bit instruction set, all the CALL and JMP instructions still only support relative 8-bit or 32-bit offsets.
> Translating from linear virtual addresses to linear physical addresses is slow and complicated, because 64-bit can address a lot of memory.
Sure but spend some time thinking about how GOT and PLT aren't great solutions and can easily introduce their own set of security complications due to the above limitations.
I think you could argue there is already some effort to do type safety at the ISA register level, with e.g. shadow stack or control flow integrity. Isn't that very similar to this, except targeting program state rather than external memory?
Tagged memory was a thing, and is a thing again on some ARM machines. Check out Google Pixel 9.
I mean, if the stacks grew upwards, that alone would nip 90% of buffer overflow attacks in the bud. Moving the return address from the activation frame into a separate stack would help as well, but I understand that having an activation frame to be a single piece of data (a current continuation's closure, essentially) can be quite convenient.
HP-UX on PA-RISC had an upward-growing stack. In practice, various exploits were developed which adapted to the changed direction of the stack.
One source from a few mins of searching: https://phrack.org/issues/58/11
Linux on PA-RISC also has an upward-growing stack (AFAIK, it's the only architecture Linux has ever had an upward-growing stack on; it's certainly the only currently-supported one).
Both this and parent comment about PA-RISC are very interesting.
As noted, stack growing up doesn't prevent all stack overflows, but it makes it less trivially easy to overwrite a return address. Bounded strings also made it less trivially easy to create string buffer overflows.
Yeah, my assumption is that all the PA-RISC operating systems did, but I only know about HP-UX for certain.
The PL/I stack growing up rather than down reduced potential impact of stack overflows in Multics (and PL/I already had better memory safety, with bounded strings, etc.) TFA's author would probably have appreciated the segmented memory architecture as well.
There is no reason why the C/C++ stack can't grow up rather than down. On paged hardware, both the stack and heap could (and probably should) grow up. "C's stack should grow up", one might say.
> There is no reason why the C/C++ stack can't grow up rather than down.
Historical accident. Imagine if PDP-7/PDP-11 easily allowed for the following memory layout:
Things could have turned out very differently than they have. Oh well.Is there anything stopping us from doing this today on modern hardware? Why do we grow the stack down?
x86-64 call instruction decrements the stack pointer to push the return address. x86-64 push instructions decrement the stack pointer. The push instructions are easy to work around because most compilers already just push the entire stack frame at once and then do offset accesses, but the call instruction would be kind of annoying.
ARM does not suffer from that problem due to the usage of link registers and generic pre/post-modify. RISC-V is probably also safe, but I have not looked specifically.
> [x86] call instruction would be kind of annoying
I wonder what the best way to do it (on current x86) would be. The stupid simple way might be to adjust SP before the call instruction, and that seems to me like something that would be relatively efficient (simple addition instruction, issued very early).
Some architectures had CALL that was just "STR [SP], IP" without anything else, and it was up to the called procedure to adjust the stack pointer further to allocate for its local variables and the return slot for further calls. The RET instruction would still normally take an immediate (just as e.g. x86/x64's RET does) and additionally adjust the stack pointer by its value (either before or after loading the return address from the tip of the stack).
Nothing stops you from having upward growing stacks in RISC-V, for example, as there are no dedicated stack instructions.
Instead of
Do:For modern systems, stack buffer overflow bugs haven't been great to exploit for a while. You need at least a stack cookie leak and on Apple Silicon the return addresses are MACed so overwriting them is a fools errand (2^-16 chance of success).
Most exploitable memory corruption bugs are heap buffer overflows.
It’s still fairly easy to attack buffer overflows if the stack grows upward
In ARMv4/v5 (non-thumb-mode) stack is purely a convention that hardware does not enforce. Nobody forces you to use r13 as the stack pointer or to make the stack descending. You can prototype your approach trivially with small changes to gcc and linux kernel. As this is a standard architectural feature, qemu and the like will support emulating this. And it would run fine on real hardware too. I'd read the paper you publish based on this.
armv8/VMSAv8-64 has huge table support with optional contiguous bit allowing mapping up to 16GB at a time [0] [1]. Which will result in (almost) no address translations on any practical amount of memory available today.
Likely the issue is between most user systems not configuring huge tables and developers not keen on using things they can't test locally. Though huge tables are prominent in single-app servers and game consoles spaces.
- [0] https://docs.kernel.org/arch/arm64/hugetlbpage.html - [1] https://developer.arm.com/documentation/ddi0487/latest (section D8.7.1 at the time of writing)
You still need address translations, they’re just coming out of the TLB most of the time.
What about an architecture, where there are pages and access permissions, but no translation (virtual address is always equal to physical)? fork() would become impossible, but Windows is fine without it anyway.
You are describing a memory protection unit (MPU). Those are common in low-resource contexts that are too simple to afford a full memory management unit (MMU). The problem with scaling that up, especially in general-purpose environments with dynamic process creation, is fragmentation of the shared address space.
You need a contiguous chunk for whatever object you are allocating. Other allocations fragment the address space, so there might be adequate space in total, but no individual contiguous chunk is large enough. You need to move around the backing storage, but then that makes your linear addresses non-stable. You solve that by adding a indirection layer mapping your "address", which is really a key/ID, to the backing storage. At that point you are basically back to a MMU.
Or you run everything with a compacting GC.
CHERI is undeniably on the rise. Adapting existing code generally only requires rewriting less than 1% of the codebase. It offers speedups for existing as well as new languages (designed with the hardware in mind). I expect to see it everywhere in about a decade.
There's a big 0->1 jump required for it to actually be used by 99% of consumers -- x86 and ARM have to both make a pretty fundamental shift. Do you see that happening? I don't, really.
Tbh I can imagine this catching on if one of the big cloud providers endorses it. Including hardware support in a future version of AWS Graviton, or Azure cloud with a bunch of foundational software already developed to work with it. If one of those hyper scalers puts in the work, it could get to the point where you can launch a simple container running Postgres or whatever, with the full stack adapted to work with CHERI.
CHERI on its own does not fix many of the side-channels, which would need something like "BLACKOUT : Data-Oblivious Computation with Blinded Capabilities", but as I understand it, there is no consensus/infra on how to do efficient capability revocation (potentially in hardware), see https://lwn.net/Articles/1039395/.
On top of that, as I understand it, CHERI has no widespread concept of how to allow disabling/separation of workloads for ulta-low latency/high-throughput/applications in mixed-critical systems in practical systems. The only system I'm aware of with practical timing guarantees and allowing virtualization is sel4, but again there are no practical guides with trade-offs in numbers yet.
Interesting, what causes the speedup?
You can skip some bounds checks and then get 50% slower because the hardware is not very powerful
We’re all using the pointer math functions in Rust and testing it with miri, right? Right?
> Why do we even have linear physical and virtual addresses in the first place, when pretty much everything today is object-oriented?
But what happens when the in-memory size of objects approaches 2⁶⁴? How to even map such a thing without multi-level page tables?
What field do you work in that you’re mapping objects of size 2^{63}? Databases? When I see anything that size it’s a bug.
Regions, like [0], for example? Multi-level page tables kinda suck.
[0] https://web.archive.org/web/20250321211345/https://www.secur...
16 bit programming kinda sucked. I caught the tail end of it but my first project was using Win32s so I just had to cherry-pick what I wanted to work on to avoid having to learn it at all. I was fortunate that a Hype Train with a particularly long track was about to leave the station and it was 32 bit. But everyone I worked with or around would wax poetic about what a pain in the ass 16 bit was.
Meanwhile though, the PC memory model really did sort of want memory to be divided into at least a couple of classes and we had to jump through a lot of hoops to deal with that era. Even if I wasn't coding in 16 bit I was still consuming 16 bit games with boot disks.
I was recently noodling around with a retrocoding setup. I have to admit that I did grin a silly grin when I found a set of compile flags for a DOS compiler that caused sizeof(void far*) to return 6 - the first time I'd ever seen it return a non power of two in my life.
I believe Multics allowed multiple segments to be laid out contiguously. When you overflowed the offset, you got into the next object/segment.
“Why don’t we do $thing_that_decisively_failed instead of $thing_that_evolved_to_beat_all_other_approaches?” Usually this sort of question comes from a lack of understanding of the history of the failure of the first and the success of the second.
The fence principle always applies “don’t tear down a fence till you understand why it was built”
Linear address spaces allow for how computers actually operate - layers. Objects are hard to deal with by layers who don’t know about them. Bytes aren’t. They are just bytes. How do you page out “an object”? Do I now need to solve the knapsack problem to efficiently tile them on disk based on their most recent use time and size? …1000 other things…
IIRC Multics (among other systems) had both segmentation and paging, and a unified memory/storage architecture.
[I had thought that Multics' "ls" command abbreviation stood for "list segments" but the full name of the command seems to have been just "list". Sadly Unix/Linux didn't retain the dual full name (list, copy, move...) + abbreviated name (ls, cp, mv...) for common commands, using abbreviated names exclusively.]
Correct. 'Segments' are the Proper Unit to think about a bag of bits doing a computation; pages sitting under segments is how VM systems worked to only load the active parts of segments; that's kind of what defined Demand Paging. We have got to get back to the garden here, we need to start valuing security and safety at least as much as raw speed. https://en.wikipedia.org/wiki/Multics?useskin=vector Tagged memory, Capabilities (highly granular), and, yes, segments. Probably needs a marketing refresh (renaming) so as to not be immediately discarded.
> Like mandatory seat belts, some people argue that there would be no need for CHERI if everyone "just used type-safe languages"[...] I'm not having any of it.
It wish the author would have offered a more detailed refutation than "I'm not having it". I'm pretty sure the claim is right! I'm fairly convinced that we'd be a lot better off moving to ring0-only linear-memory architectures and rely on abstraction-theoretic security ("langsec") rather than fattening up the hardware with random whack-a-mole mitigations. We're gradually moving in that direction anyway without much of a concerted effort.
DId Multics solve this in any way?
An open secret in our field is: the current market leading OSes and (to some extent) system architectures are antiquated and sub-optimal at their foundation due to backward compatibility requirements.
If we started green field today and managed to mitigate second system syndrome, we could design something faster, safer, overall simpler, and easier to program.
Every decent engineer and CS person knows this. But it’s unlikely for two reasons.
One is that doing it while avoiding second system syndrome takes teams with a huge amount of both expertise and discipline. That includes the discipline to be ruthless about exterminating complexity and saying no. That’s institutionally hard.
The second is that there isn’t strong demand. What we have is good enough for what most of the market wants, and right now all the demand for new architecture work is in the GPU/NPU/TPU space for AI. Nobody is interested in messing with the foundation when all the action is there. The CPU in that world is just a job manager for the AI tensor math machine.
Quantum computing will be similar. QC will be controlled by conventional machines, making the latter boring.
We may be past the window where rethinking architectural choices is possible. If you told me we still had Unix in 2000 years I would consider it plausible.
Aerospace, automotive, and medical devices represent a strong demand. They sometimes use and run really interesting stuff, due to the lack of such a strong backwards-compatibility demand, and a very high cost of software malfunction. Your onboard engine control system can run an OS based on seL4 with software written using Ada SPARK, or something. Nobody would bat an eye, nobody needs to run 20-years-old third-party software on it.
I don’t think these devices represent a demand in the same way at all. Secure boot firmware is another “demand” here that’s not really a demand.
All of these things, generally speaking, run unified, trusted applications, so there is no need for dynamic address space protection mechanisms or “OS level” safety. These systems can easily ban dynamic allocation, statically precompute all input sizes, and given enough effort, can mostly be statically proven given the constrained input and output space.
Or, to make this thesis more concise: I believe that OS and architecture level memory safety (object model addressing, CHERI, pointer tagging, etc.) is only necessary when the application space is not constrained. Once the application space is fully constrained you are better off fixing the application (SPARK is actually a great example in this direction).
Mobile phones are the demand and where we see the research and development happening. They’re walled off enough to be able to throw away some backwards compatibility and cross-compatibility, but still demand the ability to run multiple applications which are not statically analyzed and are untrusted by default. And indeed, this is where we see object store style / address space unflattening mitigations like pointer tagging come into play.
> we could design something faster, safer, overall simpler, and easier to program
I do remain doubtful on this for general purpose computing principles: Hardware for low latency/high throughput is at odds with full security (absence of observable side-channels). Optimal latency/throughput requires time-constrained=hardware programming with FGPAs or building hardware (high cost) usually programmed on dedicated hardware/software or via things like system-bypass solutions. Simplicity is at odds with generality, see weak/strong formal system vs strong/weak semantics.
If you factor those compromises in, then you'll end up with the current state plus historical mistakes like missing vertical system integration of software stacks above Kernel-space as TCB, bad APIs due to missing formalization, CHERI with its current shortcomings, etc.
I do expect things to change once security with mandatory security processor becomes more required leading to multi-CPU solutions and potential for developers to use on the system complex+simple CPUs, meaning roughly time-accurate virtual and/or real ones.
> The second is that there isn’t strong demand.
This is not true for virtualization and security use cases, but not that obvious yet due to missing wide-spread attacks, see side-channel leaks of cloud solutions. Take a look at hardware security module vendors growth.
> That includes the discipline to be ruthless about exterminating complexity and saying no. That’s institutionally hard.
You need to make a product that out-performs your competitors. If their chip is faster then your work will be ignored regardless of how pure you managed to keep it.
> We may be past the window where rethinking architectural choices is possible.
I think your presumption that our architectures are extremely sub-optimal is wrong. They're exceptionally optimized. Just spend some time thinking about branch prediction and register renaming. It's a steep cliff for any new entrant. You not only have to produce something novel and worthwhile but you have to incorporate decades of deep knowledge into the core of your product, and you have to do all of that without introducing any hardware bugs.
You stand on the shoulders of giants and complain about the style of their footwear.
That’s another reason current designs are probably locked in. It’s called being stuck at a local maximum.
I’m not saying what we have is bad, just that the benefit of hindsight reveals some things.
Computing is tougher than other areas of engineering when it comes to greenfielding due to the extreme interlocking lock-in effects that emerge from things like instruction set and API compatibility. It’s easier to greenfield, say, an engine or an aircraft design, since doing so does not break compatibility with everything. If aviation were like computing, coffee mugs from propeller aircraft would fail to hold coffee (or even be mugs) on a jet aircraft.
Aviation does have a lot of backwards compatibility problems. It's one reason Boeing kept revving the 737 to make the Max version. The constraints come from things like training, certification, runway length, fuel mixes, radio protocols, regulations...
> something faster
How true is this, really? When does the OS kernel take up more than a percent or so of a machine's resources nowadays? I think the problem is that there is so little juice there to squeeze that it's not worth the huge effort.
Look behind the curtains, and the space for improvement over the UNIX model are enormous. Our only saving grace is that computers have gotten ridiculously fast.
The problem isn’t direct overhead. The problem is shit APIs like blocking I/O that we constantly have to work around via heroic extensions like io_uring, an inefficient threading model that forces every app to roll its own scheduler (async etc.), lack of OS level support for advanced memory management which would be faster than doing it in user space, etc.
The thing about AI though is that it has indirect effects down the line. E.g. as prevalence of AI-generated code increases, I would argue that we'll need more guardrails both in development (to ground the model) and at runtime (to ensure that when it still fails, the outcome is not catastrophic).
This like saying generic systems are bad because you and a hacker both can make sane assumptions about it, thus even if more performant/usable it's also more vulnerable hence shouldn't be used.
I don't understand this.
I have seen bad takes but this one takes the cake. Brilliant start to 2026...
How does object store hardware work? Doesnt it still require a cache?
Any papers on modern object store archiectures (is that the right terminology?)
> Because the attempts at segmented or object-oriented address spaces failed miserably.
where, what, evidence of this please...
More advocacy propaganda for corporate authoritarianism under the guise of "safety". Locked-down systems like he describes fortunately died out long ago, but they are making a vicious comeback and will take over unless we fight it as much as we can.
Whatever a system is locked down is not a technology issue, it's about who have the key. You wouldn't be using MS-DOS today. Having more controls over what the applications are up to would be beneficial for the user. The modern multitasking systems have their origin in the time-sharing systems (which are exactly the locked-down ones) where security was "protect the admin's authority, protect users from each other" and hence "what application does is by definition authorized by the user that started the application". Then we started adding some "protect user data from the programs" safeguards but on desktop it always was an afterthought and on mobile the new security model is "protect the platform vendor authority from the user". Sadly a new API designed around "protect programs from each other, enforce users authority" never materialized.
But all of this is about IO. What OP is talking about is memory model and the changes they propose is not about "don't let the unauthorized ones do things" but rather "make it harder for a confused deputy do things". This one is pretty uncontroversial in its intent, though I personally don't really agree with the approach.