Hash tables in Go and advantage of self-hosted compilers

(rushter.com)

56 points | by f311a 6 days ago ago

37 comments

  • pdpi 7 hours ago ago

    It's worth noting that the "self-hosted compiler" thing here is a red herring.

    E.g. the JVM is a C++ project, but you can easily read the HashMap implementation, because it's part of the standard library, not part of the runtime.

    • vips7L 3 hours ago ago

      FWIW javac is self-hosted.

  • j1elo 4 hours ago ago

    Interesting how the Go team is the utmost example of thinking through and bikeshedding ad infinitum even the tiniest angles of each proposal (something that I like a lot by the way), which is part of the reason that popular feature requests take years to come, and others such as the `Set` type are binned because of not providing enough added value.

    But an implementation change that will for sure baloon the memory usage of everybody's code making heavy use of Hashmap-as-set (a popular idiom)? Yeah no problem, change shipped.

    • avianlyric 3 hours ago ago

      There’s a big difference between a change which modifies the languages API, and one that just modifies the implementation of the API.

      Given GoLangs compatibility guarantee, any mistake in the design of a language API has to be preserved forever, and is very difficult to improve.

      But implementations of the GoLang spec and language APIs are much easier to evolve. There’s nothing preventing the Go team rolling out future improvements to deal with this issue, without having to worry about long term consequences. There’s also nothing preventing other implementations of the GoLang spec choosing a different approach.

    • voidfunc 4 hours ago ago

      The Go team has a lot of old school nerd cred thats why it gets away with a lot of stupid shit. Then a fan base of nerd hero worshippers beat down any discussion about doing things a better way with: SIMPLICITY.

      Its frustrating and I say this as someone who has been writing Go for around a decade.

      • yomismoaqui 2 hours ago ago

        Go is the worst programming language except for all those others that I have tried from time to time

    • 9rx 4 hours ago ago

      It's called marketing. If Go quietly made something perfect, nobody would know of its existence. Do stupid things that gets people talking and everyone soon learns about you.

  • ncruces 5 hours ago ago

    Issue tracking this “regression”: https://github.com/golang/go/issues/71368

  • cabirum 8 hours ago ago

    > Using empty structs also hurts readability

    An empty struct is idiomatic and expected to be used in a Set type. When/if the memory optimization is reintroduced, no code change will be needed to take advantage of it.

    • tym0 8 hours ago ago

      Using a bool instead of empty struct also means that there is more way to use it wrong: check the bool instead of if the key exist, set the bool incorrectly, etc...

      I would argue using bool hurts readability more.

      Even better write/use a simple library that calls things that are sets `Set`.

    • ioanaci 8 hours ago ago

      I also feel like map[T]struct{} communicates its purpose way better than map[T]bool. When I see a bool I expect it to represent a bit of information, I don't see why using it as a placeholder for "nothing" would be more readable than a type that can literally store nothing.

    • rplnt 3 hours ago ago

      Isn't it empty interface that's idiomatic? Or was anyway?

      edit: I may be wrong here

  • nickcw 7 hours ago ago

    I wonder if the compiler really needs to allocate 1 byte so you can get the address of the struct {}

    In the general case then yes, but here you can't take addresses of dictionary values (the compiler won't let you) so adding 1 byte to make a unique pointer for the struct {} shouldn't be necessary.

    Unless it is used in the implementation of the map I suppose.

    So I conjecture a bit of internal magic could fix this.

    • occamrazor 6 hours ago ago

      I’m curious, what was the rationale for forbidding it?

      • kbolino 6 hours ago ago

        I interpret this as asking "why can't you get the address of a value in a map?"

        There are two reasons, and we could also ask "why can't you get the address of a key in a map?"

        The first reason is flexibility in implementation. Maps are fairly opaque, their implementation details are some of the least exposed in the language (see also: channels), and this is done on purpose to discourage users of the language from mucking with the internals and thus making it harder for the developers of the language to change them. Denying access to internal pointers makes it a lot easier to change the implementation of a map.

        The second reason is that most ways of implementing a map move the value around copiously. Supposing you could get a pointer p := &m[k] for some map m and key k, what would it even point to? Just the value position of a slot in a hash table. If you do delete(m, k) now what does it point to? If you assign m[k2] but hash(k2) == hash(k) and the map handles the collision by picking a new slot for k, now what does it point to? And eventually you may assign so many keys that the old hash table is too small and so a new one somewhere else in memory has to be allocated, leaving the pointer dangling.

        While the above also apply to pointers-to-keys, there is another reason you can't get one of those: if you mutated the key, you would (with high probability) violate the core invariant of a hash table, namely that the slot for an entry is determined exactly by the hash of its key. The exact consequences of violating this would depend on the specific implementation, but they are mostly quite bad.

        For comparison, Rust, with its strong control over mutability and lifetimes, can give you safe references to the entries of a HashMap in a way Go cannot.

        • hiddendoom45 5 hours ago ago

          I was burnt by the mutability of keys in go maps a few months ago, I'm not sure exactly how go handles it internally but it ended up with the map growing and duplicate keys in the key list when looking at it with a debugger.

          The footgun was that url.QueryUnescape returned a slice of the original string if nothing needed to be escaped so if the original string was modified, it would modify the key in the map if you put the returned slice directly into the map.

          • arccy 2 hours ago ago

            That just means fiber is a bad library that abuses unsafe, resulting in real bugs.

          • ncruces 5 hours ago ago

            Just how are you modifying strings? Cause that's your bug to fix.

            • hiddendoom45 5 hours ago ago

              That was probably done by fiber[1] the code specifically took the param from it in the function passed to the Get(path string, handlers ...Handler) Router function. c is the *fiber.Ctx passed by fiber to the handler. My code took the string from c.Param("name") passed it to url.QueryUnescape then another function which had a mutex around setting the key/value in the map. I got the hint it was slices and something modifying the keys when I found truncated keys in the key list.

              My guess is fiber used the same string for the param to avoid allocations. The fix for it is just to create a copy of the string with strings.Clone() to ensure it does not get mutated when it is used as a key. I understand it was an issue with my code, it just wasn't something I expected to be the case so it took several hours and using the debugger to find the root cause. Probably didn't help that a lot of the code was generated by Grok-4-Code/Sonic as a vibe coding test when I decided to go back a few months later and try and fix some of the issues I had myself.

              [1] https://github.com/gofiber/fiber

          • kbolino 5 hours ago ago

            This sounds like a bug, whether it be in your code, the map implementation, or even the debugger. Map keys are not mutable, and neither are strings.

            • hiddendoom45 4 hours ago ago

              This shouldn't be a race condition, reads were done by taking a RLock() from a mutex in a struct with the map, and defer RUnlock(), writes were similar where a Lock() was taken on the same mutex with a defer Unlock(). All these functions did was get/set values in the map and operated on a struct with just a mutex and the map. Unless I have a fundamental misunderstanding of how to use mutexes to avoid race conditions this shouldn't have been the case. This also feels a lot like a llm response with the Hypotheses section.

              edit: this part below was originally a part of the comment I'm replying to

              Hypotheses: you were modifying the map in another goroutine (do not share maps between goroutines unless they all treat it as read-only), the map implementation had some short-circuit logic for strings which was broken (file a bug report/it's probably already fixed), the debugger paused execution at an unsafe location (e.g. in the middle of non-user code), or the debugger incorrectly interpreted the contents of the map.

  • Hendrikto 8 hours ago ago

    > Another takeaway here, as always, is not to trust everything LLMs say.

    I would go even farther and say to not trust anything they say. Always be skeptical, always verify.

    • nasretdinov 8 hours ago ago

      Applies to humans as well :)

      • rplnt 3 hours ago ago

        Not at all. With human you can have some expectations based on context, expertise. They are also far less likely to make up extremely specific details.

        • nasretdinov an hour ago ago

          Sure. I was agreeing with the conclusion though, where you should aim to verify what you hear from other humans, no matter how confident they sound. Been burned by that a few times by blindly trusting some statements from some respected people only for it to blow up in production because they were wrong :).

      • lenkite 7 hours ago ago

        There are many humans who are far more reliable than LLM's on a 99.9999% win streak.

        • rat9988 5 hours ago ago

          Yes, now generalize the theorem to any human to make it usable on a daily basis.

        • 9rx 5 hours ago ago

          That is, strangely, until those humans turn to a topic I know something about. Then their reliability drops like a hot potato. At least they get everything else right!

  • gethly 7 hours ago ago

    Empty struct is good for representing non-nil zero-length information, for example this is ideal for many use cases where channels are involved. Or of you have a http route and you want to return empty response(200 OK or 204 No Content, instead of error).

    Boolean on the other hand inherently contains two information: either true or false. ie. there will always be information and it will always be one of two values.

    This is similar to *struct{} where we can signal no information, or false, by returning/passing nil or initiated pointer to empty struct as true/value present.

    For maps, bool makes more sense as otherwise we just want a list with fast access to determine whether value in the list exists or not. Which is often something we might want. But it should not detract form the fact that each type has its own place and just because new implementation for maps ignores this, in this particular use, case does not make them worse than previous version.

    tl;dr it is good to know this fact about the new swiss maps, but it should not have any impact on programming an design decisions whatsoever.

  • yosefk 4 hours ago ago

    Rust HashSets are HashMaps with an empty type as the value type, but the compiler actually optimizes away the storage for the keys based on the type being empty. Go doesn't bother to either define a set type like most languages do, or to optimize the map implementation with an empty type as the value type

  • andunie 8 hours ago ago

    So what is this article about?

    1. How to do sets in Go?

    2. What changed between Go 1.24 and 1.25?

    3. Trusting an LLM?

    4. Self-hosted compilers?

    It is not clear at all. Also there are no conclusions, it's purely a waste of time, basically the story of a guy figuring out for no reason that the way maps are implemented has changed in Go.

    And the title is about self-hosted compilers, whose "advantage" turned out to be just that the guy was able to read the code? How is that an advantage? I guess it is an advantage for him.

    The TypeScript compiler is also written in Go instead of in TypeScript. So this shouldn't be an advantage? But this guy likes to read Go, so it would also be an advantage to him.

    • bxparks 4 hours ago ago

      I agree that the article is a bit unfocused about the supporting material. But the primary topic is clear: it's about the memory consumption of the Go map implementation.

      This is an article written by a real human person, who's going to meander a bit. I prefer that over an LLM article which is 100% focused, 100% confident, and 100% wrong. Let's give the human person a little bit of slack.

    • gethly 7 hours ago ago

      I think it is quite obvious - the author has found out that a memory trick that used to work in previous Go versions no longer works - in this sigular use case.