Show HN: 30k IKEA items in flat text

(huggingface.co)

49 points | by tsazan 5 days ago ago

34 comments

  • reddalo 3 hours ago ago

    I don't understand why new proposed standards are still polluting the root namespace (also see llms.txt).

    These things should be put under /.well-known [1], not in the root.

    [1] https://en.wikipedia.org/wiki/Well-known_URI

    • buildbuildbuild 3 hours ago ago

      User friendliness. I’ve seen several less-technical people able to quickly access, create, and understand “llms.txt”.

      It’s not ideal but representative of the tension between user experience and technical correctness.

      • reddalo an hour ago ago

        >less-technical people able to quickly access

        Why would somebody even want to access that file? It doesn't make any sense to make that more user friendly, it's for LLMs.

    • dkdcio 3 hours ago ago

      I was not aware you shouldn’t do that — what’s the rationale/historical context?

      • embedding-shape 3 hours ago ago

        Like most standards: "Because it's a standard". Kind of like setting a .body for a GET request, you can kind of do that, but why not do it the way it's intended to instead? Use POST :)

        • gunalx 2 hours ago ago

          I have seen post being used instead of get, because of having encrypted parameters by default.

          • ljm an hour ago ago

            Sending a URL encoded form or some JSON in a POST request is also easier for most people to understand than the myriad ways you might format a query string in the URL (which may have a stricter limit on size).

            You only have to look at how different services handle arrays in query strings to understand that serialising it is conceptually easier.

            Comes up a lot in search or filter APIs. I'm sure there was some effort many moons ago to create a QUERY method for that.

          • embedding-shape 2 hours ago ago

            Yeah, and also because of firewalls sometimes stripping body of GET requests (not responses mind you, we're talking requests) to a server, and also because it's really uncommon to put a body on a GET request ;)

  • vachina 3 hours ago ago

    There’s already a schema.org spec that defines a JSON-LD structured data that you can embed on every of your product page to provide a machine readable interface of your product.

    For example, Google’s indexers already use this to surface pricing data. https://developers.google.com/search/docs/appearance/structu...

    • tsazan 3 hours ago ago

      That`s is valid for search engines. But if JSON-LD was sufficient for agents, Google wouldn't have launched UCP (Universal Commerce Protocol) yesterday.

      • vachina 2 hours ago ago

        Took a look, UCP looks like presenting an entire shopping lifecycle for agents.

        JSON-LD is just read-only metadata for machines.

        • tsazan 2 hours ago ago

          True. But extracting that metadata requires parsing the full DOM. CommerceTXT is for efficient discovery. Scan inventory cheaply first, then commit to the transaction.

  • bleonard an hour ago ago

    A blast from the past. When Taskrabbit was acquired by IKEA, I built several tools that went through the whole catalog via various crawling approaches. One tool was to estimate how long it would be to put each item together for an initial training set.

  • chuckadams 27 minutes ago ago

    Well of course it would be flat text... ;)

  • btrettel 3 hours ago ago

    Interesting. I had been thinking recently about grep-friendly structured text file formats given the constraints of regex. But I hadn't considered that you could design a structured text file format to be LLM-friendly given token constraints.

    • tsazan 3 hours ago ago

      You're right.If a format is easy to grep, it is almost always cheap to tokenize. We treat token density as a primary design constraint.

  • JosephRedfern 3 hours ago ago

    I've heard that LLMs can perform worse with these more efficient representations compared to e.g. JSON, because they've seen far fewer examples of them during training. Do you know how true that is?

    • TechSquidTV 3 hours ago ago

      Absolutely, but usually when working with a bespoke format for optimization, it's paired with an LLM specifically trained on that format.

    • tsazan 3 hours ago ago

      You are right about cryptic formats. CommerceTXT is semantically structured. Models like GPT, Claude and Gemini understand it out-of-the-box via ICL.

  • sognetic 3 hours ago ago

    Interesting! So did you do any experiments on a relevant subset of the data to test whether LLM performance degrades by introducing a new, presumably unknown to the LLM, format?

    • tsazan 2 hours ago ago

      The 24% token savings come from converting JSON syntax to CommerceTXT.

  • croisillon 2 hours ago ago

    years ago i did a small tool that, when you entered a product number, would scan all IKEA-websites with currency Euro and return the prices for each of them ; not that i expected furniture tourism to become a thing but it was funny

    • tsazan 2 hours ago ago

      Reminds me of a friend who built a comment sentiment analyzer years ago. At the time, it looked like great innovation...

  • colinbartlett 4 hours ago ago

    Any practical use for this IKEA data specifically?

    Or just a handy open data set you could use to prove out the concept?

    • DennisP 3 hours ago ago

      I assumed it's because IKEA is famous for flat packing its furniture.

      • tsazan 3 hours ago ago

        Exactly! IKEA removes the air from the box to save space, CommerceTXT removes the HTML/JSON bloat to save tokens. You made my day!

        • embedding-shape 3 hours ago ago

          > IKEA removes the air from the box to save space

          Huh? I don't think that's true, there usually is some sort of structural elements inside of the package, meant to be thrown away (usually made with cardboard/paper), and all Ikea boxes definitively have lots of air inside of them, not sure what would make you say otherwise, unless it's some joke I'm missing?

          • jayknight 2 hours ago ago

            A box that contained a fully assembled kitchen table would contain a lot more air. I think that comment just meant IKEA designs items that can be packaged into a minimal volume.

            • embedding-shape 2 hours ago ago

              Ah yes, on second reading it's actually pretty obvious that is what parent meant and I was reading it too literally. Thanks for the clarification, that's certainly correct :)

    • WildGreenLeave 3 hours ago ago

      I've had the idea to setup an AI that automatically (re)designs a room using IKEA stuff. It would definitely help me decorate my room in a better way.

      • tsazan 3 hours ago ago

        That`s great use case. If you ship it, let me know!

  • usefulposter 3 hours ago ago

    "OP here" is the funniest tell that shows up when using an LLM to write a post for HN or Reddit.

    It's funny because it makes zero sense in the body of an initial post!

    In comments replying to people downthread - maybe. But opening a top-level post with "Original Poster here" is just silly and shows a lack of respect for community etiquette.

    https://hn.algolia.com/?dateRange=pastYear&page=0&prefix=tru...

    • dkoy 3 hours ago ago

      Good catch, think you’re on to something

    • tokai an hour ago ago

      I just understand it as lightly humorous. Like starting a anecdote with

      >be me

      Seeing it as a lack of respect is a huge stretch. And kinda conceited that you accuse someone of such, on the basis of a two word opener.