This advice is insane. Except in specific settings (where a sensor may be misbehaving, where a survey respondent clearly just picked random choices) outliers are really just outlying values and should be kept in the analysis, or at most clipped / winsorized. When submitting to a scientific journal, admitting that outliers were removed without first inspecting why they are there can be enough for an instant rejection, and rightly so.
The last time this hit HN, my hosting provider complained about the traffic, but I've since migrated the blog to GitHub Pages, so I guess that won't be an issue this time.
You can tell how much they cared about data quality because they never took the time to look at context-dependent glyph equivalencies. And some context-sensitive algorithms might not make the same mistakes as a naive “guess what characters are here” algorithm that just uses glyph shapes. You run into this a LOT with ALPR systems because some of the presses excluded some characters. O and 0 are the most common character equivalency. But only in certain places.
OCR is actually complicated if you’re trying to rely on the data for something.
1s in other ordinals were misread, but two 1s next to each other were misread wildly more often than anything else. My current theory, which I only hint at in the last paragraph, was that "nth" was in the OCR dictionary, "nth" is close to "11th" in pixel space, and no other ordinal is that similar to a dictionary word. Therefore, "11th" gets misread most often.
For 2, 3, 22, 23, read the second part: https://drhagen.com/blog/the-missing-23rd-of-the-month/. (short version: people used to sometimes write 2d, 3d, 22d, 23d.). This also explains why 12 and 13 don't show the deficit.
Or, sometimes, not; one of the more interesting takeaways was typewritten lowercase ells instead of ones: “When the algorithm read October llth, it was far more correct than we have been giving it credit.”
> Was it some technical constraint of the typewriter that caused “1” to become more like “l” come XX century?
The typewriter I grew up with simply didn't have a key for it. It also didn't have a 0 or an exclamation mark or a plus sign. There were well known substitutes:
For the number 1, type lowercase letter l.
For the number 0, type uppercase letter o.
For the exclamation mark, type a period, hit backspace, and type an apostrophe / single quote.
For the plus sign, I'm not aware of a good substitute. You could maybe superimpose a slash on a hyphen, but it would look bad.
There was no division sign, and using a slash to denote division was not yet something I'd ever seen anyone do. You could probably have superimposed a hyphen and a colon to get ÷.
Oddly enough, it did have other characters which you won't find on a standard US keyboard today: ¼, ½, and ¢. The cent sign was useful, and it seems logical to me that if you're going to have $ you should have ¢ too!
In the UK, we also used to type '£' with an 'L' and backspacing and overtyping an '='.
Weirdly, I learned to type on a second hand Selectric typewriter that my parents bought cheaply at some auction, and while it had a '1' key, the only ball we had used that key for punctuation instead, so we still needed to type 'l' instead.
And since I've mentioned GNU groff elsethread, here's where it today in its "ascii" mode uses a minus sign instead of the equals sign to do the very same thing as you did on your typewriter, for "Po", the macro for producing a pound symbol.
The font designer in you is being influenced by computer displays and movable type. The world of typewriting had some idiosyncrasies, not even shared with (say) the world of dot-matrix printing, one of which was pressure to make the largest composable range of characters with the smallest number of physical keys. Which means common base shapes.
It's a little known fact that some parts of the computing world are faithfully reproducing this aspect of typewriters, trying to write bullet points, daggers, and currency symbols not with actual Unicode but with a very limited ASCII repertoire and overstriking, even today.
There's a table of this stuff, relying upon the glyphs being visually close enough that they make other glyphs, in GNU groff right now. Here, for example, is how it currently, in its UTF-8-disabled mode (as unfortunately still used by manual page readers on several operating systems), composes a down arrow in the typewriter style of overstriking a vertical bar with a 'v' letter:
No. It has things that aren't bullet points, and besidelines. Underlines go underneath, and asterisks are not bullet points.
In its UTF-8 mode, for bullet points GNU groff uses the actual Unicode characters that are available. Until 2024, in "ascii" mode it overstruck a plus symbol with a letter 'o', one of many such typewriting tricks, which no-one but those printing manual pages to old printers capable of the same typewriting trick would have ever seen as it was supposed to be seen. On VDUs, such bullet points just came out as the letter 'o'.
Sadly, in 2024 its developers did away with this interesting little-known quirky feature that almost no-one would have seen properly rendered. (-:
Interestingly, I couldn't find a Unicode code point that satisfactorily represented the crossed circle that this would have been on paper. There's a mathematical operator that is not semantically correct.
I wonder whether, given the reams of books on Linux-based operating system administration and use, any author or publisher typesetting yet another copy of the manual pages got grotty's overstruck plus-o into print.
Because that would be one rather ironic argument for getting it an assigned Unicode code point.
I suspect that all the books were better typeset than that, though.
I agree with almost everything in your comment except the initial ”no”, which seems to indicate you misunderstood the comment you were replying to.
Things like `enscript` and `a2ps` can render ASCII overstrikes without the need for old printers. (PostScript and PDF have no trouble representing overstrikes!) Also xterm in Tek4014 mode, but that's a lot less useful.
The output mode of groff that overstrikes also depends on fixed-pitch fonts for proper alignment, which puts strict limits on the quality of the resulting typesetting. You could imagine alternatives that didn't (for example, allocate a character cell the maximum of the advance width of both characters and center them both in it), but groff didn't implement them.
You mis-used the word "write". Not that it makes much sense to say that Markdown "writes" anything, and the best guess for what is meant when it is said that a markup language "writes" something is that it includes something as markup when one writes in that language, which in this case it definitely does not.
a2ps isn't writing ASCII and overstrikes, either. Obviously so, as its output is PostScript. It's reading ASCII and overstrikes, like pg, most, less, more, and even the humble ul command; and writing something else entirely. Reading stuff that was destined for old typewriter-like printers and turning it into something else is straying quite far from what strogonoff and I were actually talking about. That's a whole other discussion, and how bad less and more (and indeed a2ps) are at reading such stuff, even compared to what ul is capable of, is a lengthy discussion in itself.
Note that I did explicitly point to the part of groff that is used where it is the "tty" output post-processing, i.e. grotty, which handles what are called in the groff doco the "typewriter-like devices". PostScript and PDF and non-fixed-width fonts are the domain of grops and gropdf et al. and again not the typewriters that we were talking about.
In the context of GNU groff, it is what is written to those "typewriter-like devices" that is what is relevant to typewriters, in that groff (grotty) uses the same idea for the same effect, relies upon the same assumptions about glyph shapes not being crazily different from the norm (which turns out to be a fairly shaky assumption for the middle to late 20th century), and can (as typewriters with a few judicious glyph designs for things like single quotes and such could) actually construct a fairly decent subset of Latin-1 and some other bits and bobs using this technique from the typewriter days.
> in "ascii" mode it overstruck a plus symbol with a letter 'o', one of many such typewriting tricks, which no-one but those printing manual pages to old printers capable of the same typewriting trick would have ever seen
I was pointing out that you don't need an old printer to see them; `enscript` and `a2ps` are perfectly capable of showing them to you. So it is not the case that, as you said,
> Reading stuff that was destined for old typewriter-like printers (...) is straying (...) far from what strogonoff and I were (...) talking about.
The rest of your comment seems to be you getting mad that you found my comments hard to interpret sensibly.
Typewriter keys cost money, and dropping the 1 allowed them to drop a key without significantly affecting the use of it. As far as I can tell, that's effectively the entire rationale.
This wasn't meaningfully the case prior; the printing press would've just needed more copies of 'l' if they'd dropped the 1s, and letters weren't as significant a portion of the cost of the machine, anyway. And afterwards came computers, which need to distinguish between the characters even if they're displayed the same way.
They didn't just cost money. They were competing to the limited space around the typing area, what meant they were constrained at the border of a circumference that would be entirely filled with mechanisms. In other words, the cost in both money, size, and weight depended on the square of the number of keys.
was it that in prior years a reader could usually distinguish 1 from l by context. Even today, very few things cause me to need to te11 a 1 from a l.
(typo 0n purpose)
it matters when reading code and random string (what we now call passwords, though back then passwords were things you could pronounce, unlike say ywtr466Nh%vX).
It doesn't matter for much else.
Though it did make an interesting plot twist in the Mioscene Arrow
My parents had a typewriter without a 1 or a 0. I always thought it was to provide room for two other valuable characters like the old "cents" c with a bar through it.
Interesting! Be sure to follow the link to the second post about what happened to the 2nd, 3rd, 22nd, and 23rd. It's simpler but still worth the read:
https://drhagen.com/blog/the-missing-23rd-of-the-month/
This is why one of my principles is to be skeptical of outliers. Often they are not real and therefore misrepresent the true data.
It's one reason median is preferred over mean, at the outset, as well as throwing out outliers just to see what things look like.
Similar to Twyman's Law: “Any figure that looks interesting or different is usually wrong.”
https://en.m.wikipedia.org/wiki/Twyman%27s_law
Still you should be aware of https://en.m.wikipedia.org/wiki/Black_swan_theory
The lesson I took from this is that it is useful and important to dig into how any piece of data was sourced.
This advice is insane. Except in specific settings (where a sensor may be misbehaving, where a survey respondent clearly just picked random choices) outliers are really just outlying values and should be kept in the analysis, or at most clipped / winsorized. When submitting to a scientific journal, admitting that outliers were removed without first inspecting why they are there can be enough for an instant rejection, and rightly so.
Twyman's law doesn't state you should ignore those outliers it just predicts that they are more likely to be mistakes then genuine.
I like using the Olympic style of scoring where they lop off the top and bottom scores to account for the cranky overly lenient judges.
The last time this hit HN, my hosting provider complained about the traffic, but I've since migrated the blog to GitHub Pages, so I guess that won't be an issue this time.
You can tell how much they cared about data quality because they never took the time to look at context-dependent glyph equivalencies. And some context-sensitive algorithms might not make the same mistakes as a naive “guess what characters are here” algorithm that just uses glyph shapes. You run into this a LOT with ALPR systems because some of the presses excluded some characters. O and 0 are the most common character equivalency. But only in certain places.
OCR is actually complicated if you’re trying to rely on the data for something.
I love stuff like this.
However, shouldn't every date with a "1" be less common if that is the case? Why 22 and 23?
I think 11 might be somewhat explained by scanner errors if we assume e.g. l2 is corrected to 12 but ll not to 11.
But I guess maybe 2,3,11,22,23 are less common due to people overcomensating for wanting to not pick dates that look not randomly sampled?
1s in other ordinals were misread, but two 1s next to each other were misread wildly more often than anything else. My current theory, which I only hint at in the last paragraph, was that "nth" was in the OCR dictionary, "nth" is close to "11th" in pixel space, and no other ordinal is that similar to a dictionary word. Therefore, "11th" gets misread most often.
For 2, 3, 22, 23, read the second part: https://drhagen.com/blog/the-missing-23rd-of-the-month/. (short version: people used to sometimes write 2d, 3d, 22d, 23d.). This also explains why 12 and 13 don't show the deficit.
Well, that was surprisingly fun to read!
tl,dr: It's an OCR error
Or, sometimes, not; one of the more interesting takeaways was typewritten lowercase ells instead of ones: “When the algorithm read October llth, it was far more correct than we have been giving it credit.”
The latent font designer in me balks at the thought of taking a typeface and intentionally making one character look more like another character.
Was it some technical constraint of the typewriter that caused “1” to become more like “l” come XX century?
> Was it some technical constraint of the typewriter that caused “1” to become more like “l” come XX century?
The typewriter I grew up with simply didn't have a key for it. It also didn't have a 0 or an exclamation mark or a plus sign. There were well known substitutes:
For the number 1, type lowercase letter l.
For the number 0, type uppercase letter o.
For the exclamation mark, type a period, hit backspace, and type an apostrophe / single quote.
For the plus sign, I'm not aware of a good substitute. You could maybe superimpose a slash on a hyphen, but it would look bad.
There was no division sign, and using a slash to denote division was not yet something I'd ever seen anyone do. You could probably have superimposed a hyphen and a colon to get ÷.
Oddly enough, it did have other characters which you won't find on a standard US keyboard today: ¼, ½, and ¢. The cent sign was useful, and it seems logical to me that if you're going to have $ you should have ¢ too!
In the UK, we also used to type '£' with an 'L' and backspacing and overtyping an '='.
Weirdly, I learned to type on a second hand Selectric typewriter that my parents bought cheaply at some auction, and while it had a '1' key, the only ball we had used that key for punctuation instead, so we still needed to type 'l' instead.
And since I've mentioned GNU groff elsethread, here's where it today in its "ascii" mode uses a minus sign instead of the equals sign to do the very same thing as you did on your typewriter, for "Po", the macro for producing a pound symbol.
* https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/tmac/t...
Cents were a useful unit back when a dollar was defined as 1/20th of an ounce of gold (or 1/35 between the Great Depression and 1970).
The font designer in you is being influenced by computer displays and movable type. The world of typewriting had some idiosyncrasies, not even shared with (say) the world of dot-matrix printing, one of which was pressure to make the largest composable range of characters with the smallest number of physical keys. Which means common base shapes.
It's a little known fact that some parts of the computing world are faithfully reproducing this aspect of typewriters, trying to write bullet points, daggers, and currency symbols not with actual Unicode but with a very limited ASCII repertoire and overstriking, even today.
There's a table of this stuff, relying upon the glyphs being visually close enough that they make other glyphs, in GNU groff right now. Here, for example, is how it currently, in its UTF-8-disabled mode (as unfortunately still used by manual page readers on several operating systems), composes a down arrow in the typewriter style of overstriking a vertical bar with a 'v' letter:
https://cgit.git.savannah.gnu.org/cgit/groff.git/tree/tmac/t...
Markdown writes bullet points and underlines without even overstriking.
No. It has things that aren't bullet points, and besidelines. Underlines go underneath, and asterisks are not bullet points.
In its UTF-8 mode, for bullet points GNU groff uses the actual Unicode characters that are available. Until 2024, in "ascii" mode it overstruck a plus symbol with a letter 'o', one of many such typewriting tricks, which no-one but those printing manual pages to old printers capable of the same typewriting trick would have ever seen as it was supposed to be seen. On VDUs, such bullet points just came out as the letter 'o'.
Sadly, in 2024 its developers did away with this interesting little-known quirky feature that almost no-one would have seen properly rendered. (-:
* https://savannah.gnu.org/bugs/?56015
Interestingly, I couldn't find a Unicode code point that satisfactorily represented the crossed circle that this would have been on paper. There's a mathematical operator that is not semantically correct.
I wonder whether, given the reams of books on Linux-based operating system administration and use, any author or publisher typesetting yet another copy of the manual pages got grotty's overstruck plus-o into print.
Because that would be one rather ironic argument for getting it an assigned Unicode code point.
I suspect that all the books were better typeset than that, though.
I agree with almost everything in your comment except the initial ”no”, which seems to indicate you misunderstood the comment you were replying to.
Things like `enscript` and `a2ps` can render ASCII overstrikes without the need for old printers. (PostScript and PDF have no trouble representing overstrikes!) Also xterm in Tek4014 mode, but that's a lot less useful.
The output mode of groff that overstrikes also depends on fixed-pitch fonts for proper alignment, which puts strict limits on the quality of the resulting typesetting. You could imagine alternatives that didn't (for example, allocate a character cell the maximum of the advance width of both characters and center them both in it), but groff didn't implement them.
You mis-used the word "write". Not that it makes much sense to say that Markdown "writes" anything, and the best guess for what is meant when it is said that a markup language "writes" something is that it includes something as markup when one writes in that language, which in this case it definitely does not.
a2ps isn't writing ASCII and overstrikes, either. Obviously so, as its output is PostScript. It's reading ASCII and overstrikes, like pg, most, less, more, and even the humble ul command; and writing something else entirely. Reading stuff that was destined for old typewriter-like printers and turning it into something else is straying quite far from what strogonoff and I were actually talking about. That's a whole other discussion, and how bad less and more (and indeed a2ps) are at reading such stuff, even compared to what ul is capable of, is a lengthy discussion in itself.
Note that I did explicitly point to the part of groff that is used where it is the "tty" output post-processing, i.e. grotty, which handles what are called in the groff doco the "typewriter-like devices". PostScript and PDF and non-fixed-width fonts are the domain of grops and gropdf et al. and again not the typewriters that we were talking about.
In the context of GNU groff, it is what is written to those "typewriter-like devices" that is what is relevant to typewriters, in that groff (grotty) uses the same idea for the same effect, relies upon the same assumptions about glyph shapes not being crazily different from the norm (which turns out to be a fairly shaky assumption for the middle to late 20th century), and can (as typewriters with a few judicious glyph designs for things like single quotes and such could) actually construct a fairly decent subset of Latin-1 and some other bits and bobs using this technique from the typewriter days.
You said:
> in "ascii" mode it overstruck a plus symbol with a letter 'o', one of many such typewriting tricks, which no-one but those printing manual pages to old printers capable of the same typewriting trick would have ever seen
I was pointing out that you don't need an old printer to see them; `enscript` and `a2ps` are perfectly capable of showing them to you. So it is not the case that, as you said,
> Reading stuff that was destined for old typewriter-like printers (...) is straying (...) far from what strogonoff and I were (...) talking about.
The rest of your comment seems to be you getting mad that you found my comments hard to interpret sensibly.
Typewriter keys cost money, and dropping the 1 allowed them to drop a key without significantly affecting the use of it. As far as I can tell, that's effectively the entire rationale.
This wasn't meaningfully the case prior; the printing press would've just needed more copies of 'l' if they'd dropped the 1s, and letters weren't as significant a portion of the cost of the machine, anyway. And afterwards came computers, which need to distinguish between the characters even if they're displayed the same way.
> Typewriter keys cost money
They didn't just cost money. They were competing to the limited space around the typing area, what meant they were constrained at the border of a circumference that would be entirely filled with mechanisms. In other words, the cost in both money, size, and weight depended on the square of the number of keys.
With limited space and resources, I wonder what other letter or number could be dropped and meaning retained. 0 and O might be worth considering?
The technical constraint still applies. My keyboard uses ' ' and '-' to represent many different symbols.
was it that in prior years a reader could usually distinguish 1 from l by context. Even today, very few things cause me to need to te11 a 1 from a l.
(typo 0n purpose)
it matters when reading code and random string (what we now call passwords, though back then passwords were things you could pronounce, unlike say ywtr466Nh%vX).
It doesn't matter for much else.
Though it did make an interesting plot twist in the Mioscene Arrow
My parents had a typewriter without a 1 or a 0. I always thought it was to provide room for two other valuable characters like the old "cents" c with a bar through it.
Naming an event after its date will have a limited run.