Engineering words for everyone
One interview question I enjoy getting as a mid-career engineer is, what’s the most impressive feat of engineering you’ve seen in your lifetime? I answer without doubt or hesitation. It’s UTF-8.
What do I find so impressive about UTF-8? Let’s go on a historical tour of the ways people convert words into signals for transmission and storage, character encodings.
Diagram showing each letter and numeral with its Morse code representation.
Morse code (1844). Initially assigned numbers to words using a codebook, like all of the other telegraph code systems. Instead started assigning codes to letters and having the letters spell out words on their own in 1840. Estimated letter frequency by looking at how many of each letter there were in typesetting equipment, assigning shorter codes to more common letters for faster transmission. Served as the US standard for over 100 years, emphasizing what a wildly powerful one-shot it was. International Morse, the version we use today, was standardized in 1865 based on German Morse. The OG. ★★★★☆
Photograph of an 1871 Chinese telegraph codebook, open to a table of 200 Chinese characters and their corresponding numeric codes.
Chinese telegraph code (1871). An example of a ridiculously bad encoding. Designed by foreigners. Used the codebook approach to assign each Chinese character to a number from 0000 to 9999. Sent the number using International Morse. Numbers are the slowest characters to transmit in International Morse. Just about maximally slow. ★☆☆☆☆
Screenshot of a Wikipedia page showing a table of EBCDIC glyphs and their corresponding hexadecimal codes.
BCDIC (1928). Stands for Binary-Coded Decimal Interchange Code. A way to represent letters on punch cards designed for numbers by assigning a two-digit code to each letter. Developed by IBM for use in electromechanical tabulating machines. Represented 48 glyphs with the numbers 0-63. Extended into EBCDIC in 1964 (whence the E), used in IBM computers once those were invented. Perfectly serviceable. ★★★☆☆
Table of the 128 ASCII characters and their corresponding decimal, hex, and binary representations.
ASCII (1963). Stands for American Standard Code for Information Interchange. An elegant, logical mapping of every key a typewriter could produce, plus the most essential machine control instructions, into 128 glyphs. Really great for English text! The (American) IBM engineers definitely knew what they were doing by now. The only problem was that sometimes, some people wanted to display text that was not in English. Served as the US standard for 25 years. Brilliant and efficient. ★★★★☆
Diagram showing the 256 glyphs represented in Latin-1.
Latin-1 (1985). Often called ANSI, which actually stands for American National Standards Institute. Actually developed by the European Computer Manufacturers Association, trying to construct a single text encoding that all of Europe could use. The limit of 256 glpyhs meant they had to make some tradeoffs, not fully supporting any language. Missing glyphs include French œ, German ẞ, Finnish š, and the Hungarian ő found in Erdős. Mostly solved the encoding problem for most European languages for the next 25 years. The rest of the world used a horrifying jumble of competing encodings (ask me about Shift-JIS some time.) A solid working solution for an intractable problem. ★★★☆☆
Screenshot of xkcd comic 927. Transcription available at https://www.explainxkcd.com/wiki/index.php/927:_Standards
Unicode (1991). Born of a utopian dream to develop one universal standard that covers everyone’s use cases. Aims to define a single encoding that can represent any text that any person has ever written. Governed by the Unicode Consortium. Ran into two seemingly insurmountable problems: 1. Using it doubled the size of text written in Latin-1. The fastest modems available in 1991 had a transfer rate of 1.2KiB/s, so it was critical to optimize even text transmission sizes. 2. It only had space for 65,536 glyphs. This is not enough glyphs to encode even all of modern Chinese. There was a proposal to expand the space to 4.2 billion glyphs, but at the unworkable cost of doubling the text size again. Points for originality, I guess. ★★☆☆☆
UTF-8 (1993). Stands for Unicode Transformation Format (8-bit). Radical solution that made everyone happy (except for CJK languages, which got screwed over by the consortium in the 1990s. Every country with its own script now has a seat on the consortium for this reason). Varies the length of the encoding based on the glyph, at the cost of significantly increased size and complexity. English could keep its precious 8-bit characters (128 glyphs), all European languages handily fit in 16 bits (2,048 glyphs), and even Classical Chinese has more than enough space at 32 bits (1.1 billion glyphs). The best character encoding so far. ★★★★★
Screenshot of a graph showing the usage rates of character encodings on the Web each year from 2000 to 2012. UTF-8 goes from 0% to 64% while ASCII goes from 57% to 17%.
That earlier XKCD comic, which ran in 2011 when UTF-8 adoption was around 52%, specifically uses character encodings as an example of a universal standard that never actually works out. Today, 99% of all websites use UTF-8. Its wide rollout was not without its challenges, but by and large proceeded smoothly. Experienced programmers today wonder why they have to specify UTF-8 everywhere, when there’s hasn’t been a reason to consider anything else in the past 10 years. It’s like how every single DNS resource record since 1985 has had to specify “IN” so we know it’s on the Internet, and not some other global network it might have been on in the 1980s.
And that Internet in question? It runs on text. Text, which must have an encoding, makes up five of the seven layers that in turn make up the Internet. It’s like we took a plane in flight and swapped out all of its engines. And no one noticed. A monumental feat that no one outside the field is aware of. This is the kind of massive infrastructure upgrade that engineers everywhere should strive to emulate.
The wordification pipeline
The folks who compile the Oxford English Dictionary (henceforth OED) think a lot about English words, and what qualities are necessary to make an utterance count as a word. I’ve found it rewarding to browse the guidance and materials they release as part of this process. For example, I really enjoyed their article on word frequency bands and their characteristics. Earlier this year, this instinct led me to learn about their schema for how a weird sound that falls out of someone’s mouth, or a calculated keysmash that falls out of someone’s brain, becomes a real, certified word. (In this case, the certification involves inclusion in the OED.)
Let’s say that I, a humble blogger, write a thinkpiece about how TV seasons optimize for fictional parasocial relationships, which are good to cultivate. I jocularly coin the term “pseudoparasocial” within. They would call that a coinage. Other lexicographers might call it a hapax legomenon, a Greek term for a word that appears only once. Hapaxes and their cousins can also be a fun topic to learn about if you’re the kind of person who reads articles like this one.
If I really like the sound of my own voice when I say “pseudoparasocial” and start injecting it into conversation, that term is now part of my idiolect. (If you like the sound of my voice, could I interest you in using the term neoretrofuturism to refer to cyberpunk?) I use it, and people understand me as well as they usually do, but no one else uses it.
If the term catches on in my friend group, and has enough memetic fitness that they also start using it in their friend groups, it might eventually become a regionalism after a few successful hops. Instead of my friends and their friends, if it were my colleagues and their colleagues, the term might instead begin to be considered jargon.
Once it’s gained enough cultural cachet that other bloggers begin to write thinkpieces about the term “pseudoparasocial”, they’d call it a neologism. Most other dictionaries would include the term as a word at this point, but not the OED. A work that aspires to define every word that has ever been written in the English language, in all of history, must have stricter critieria for inclusion to avoid getting swamped. So it is only if the term has enough staying power to maintain its cachet for a few years that their elite lexicographers would deem it a word.
I find myself comparing and contrasting this wordification pipeline with the path a work takes to enter the American cultural canon. In this instance, what I mean by “cultural canon” is the set of media that everyone knows that everyone knows. Some examples from the late 1900s might be Terminator and Jurassic Park. The ruby slippers from the Wizard of Oz or Lucy in the Sky with Diamonds. It’s no mean feat for a work to become well-known enough that it enters the zeitgeist, but the vast majority of those works do not then go on to become part of the cultural canon. One thing all four of the works I mention have in common is, they were all made before a hypothetical 25-year-old American was born. Works that successfully make it into the cultural canon need to somehow cross that vast chasm, somehow convincing people it’s worth their time to watch a thing their parents watched.
Misaligned, uncontained slop
Rationalists often speculate about what might happen if a misaligned AI we created escapes containment. I think this has already happened. I claim the word “slop” is a misaligned AI we created that has escaped containment.
Richard Dawkins coined the term meme in 1976 to describe the unit of cultural evolution, in the same way genes are the unit of biological evolution. Every word is a raw unit of meaning people created to more efficiently lob compact meaning bombs at each other. People transmit, reproduce, and mutate them. Words are memes. Each word competes with other words for fitness and “wants” to perpetuate itself, in the same way we think of a gene as “wanting” to perpetuate itself despite its lack of agency.
So anthropomorphized, you can see that the word “slop” solves a variety of general problems. Its broad applicability contributes to its spread, and therefore fitness. And it learns from its environment and past actions — the way people use and spread the word “slop” depends on their cultural context and history. You wouldn’t casually sling the word “slop” around in a formal policy recommendation, or on a blog post you wrote three years ago. You might deploy it more readily today, now that it’s been awarded the prestigious Word of the Year 2025 award by both Merriam-Webster and the American Dialect Society.
If we think of the word “slop” as an AI, it’s clear that it is both misaligned and has escaped containment. The word “slop” was coined to describe unwanted, low-quality LLM output, a parallel to the word “spam” describing unwanted, low-quality emails. Widespread compounds like friendslop and slop bowl illustrate that the word’s meaning has drifted far indeed from its initial intention.
Photo of a Chipotle burrito bowl. The single-use, elliptical bowl contains chicken al pastor, corn, cheese, sour cream, lettuce, black beans, and cilantro lime white rice.
Admittedly, categorizing the word “slop” as AI implies we should categorize all words as AIs, which doesn’t feel right. Let’s try a weaker version instead and see if that lands better. I claim the United States of America is a misaligned AGI we created that has escaped containment.
Yes, AGI. Even this devil’s advocate can’t find a framing that attributes human-level intelligence to the word “slop”. On the other hand, the US is at least as intelligent as a human, as well as being capable of nearly all common intellectual tasks.
In 2017, Charles Stross gave a keynote at the Chaos Communication Congress describing corporations as really old, really slow AIs. So we can start predicting what misaligned AGIs might do by looking at what misaligned corporations have already done. The argument goes, corporations are artificial: people created them. Corporations are the result of a series of inventions over time that gradually increased their capabilities. Corporations are intelligent: they are self-aware, learn from their actions and environment, and develop complex social relationships with other corporations. And corporations are agentic. Microsoft doesn’t do what Bill Gates wants. Or Steve Ballmer, or Satya Nadella, or any specific person. No person can direct all of Microsoft’s attention. Microsoft is composed of departments and people, just as people are composed of organs and cells. But Microsoft is something more than its constituent people, just as you are something more than your constituent cells. Microsoft’s will affects the world every day, in more ways than any one person can comprehend.
What Microsoft does not have is an army and a navy. The United States has those things. Like Microsoft, it is artificial, intelligent, and agentic. We the people created the US. Its existence depends on concepts that did not exist just 400 years ago. The US is aware of its own existence, learns from its actions, and has complex foreign relations with other nations. It has an immune system that isolates and kills cancerous people before they grow and hurt their surrounding people. Its will is different from and greater than any one person’s. Not even Donald Trump can direct all of the US’s attention. The US takes actions guided by a defined set of amendable principles, and it may later regret and apologize for its past actions.
It’s straightforward to show that the US is misaligned and has escaped containment. Even early in its history, the US displayed a strong desire to grow unchecked and to make more of itself. US ideals and culture outcompeted many others, spreading all over the world. Nearly half the world’s population now lives under a democracy. We can also point to actions the US takes that seem to violate its guiding principles. A cop in isolation can’t incarcerate a dissident; the US does. A sniper in isolation can’t assassinate Ali Khamenei; the US did.
Yet despite everything, I am still proud to be an American. Today I turn my own attention to shifting the US’s attention one tiny bit. I do this through the time-honored American tradition of indiscriminately lobbing compact meaning bombs everywhere.
Misaligned, uncontained AGIs are all around us. This has been true for over a hundred years. But humans are still alive and well. What are some of the things we did and still do to make this outcome, and good outcomes in general, more likely?
Effective pre-journaling
I went on a meditation retreat with Jhourney last year that I can fairly describe as life-changing. They take a secular, evidence-based approach to meditation that some practitioners will find distasteful and others with see as a perfect fit. In particular, they use a lot of tools for thought that I recognize from consulting. Although one in particular I found very useful doesn’t seem to originate from either consulting or psychology. I wonder if it’s something the team came up with themselves.
They call it the PLAN framework for journaling. I don’t have the text handy, so I’m not going to get it exactly right. But maybe my unintentional changes will make it more memetically transmissible.
If you have some time before doing something important, write some notes to organize your thoughts first. Use the prompts:
Purpose. What is my goal? What am I trying to accomplish and why? Are there different, easier ways to get what I actually want?
Learned. What do I know that’s related to the goal? What happened the last time I tried to do something like this? What happened when someone else tried something like this?
Action. What specific actions to I plan to take to achieve my goal? Write out the details and see if anything stands out.
Needs. What is some thing or knowledge that, if I had it, would make doing these things dramatically easier? Can I figure out how to get that?
Scan of an 1860 diagram of the parts of the brain, overlaid on a man in profile. Some parts have labels like Manners, Patriotism, Industry, and Aversion.
Moving from abstract to concrete and back helps with learning, so here’s an example:
Purpose. I want to figure out what to write for tomorrow’s post. There’s a lot of ideas in my list. I want to pick one so that I can get some thinking and writing in ahead of time.
Learned. History posts seem very compelling to me while I’m writing daily. Memoir posts seem to get a lot of traction, but the required vulnerability leaves me drained. I enjoy review posts, but feel a lot of pressure to get the details right. List posts are quick and easy on an off day. I should set an intention ahead of time if I want to try a kind of post that scares me.
Action. Go through the ideas list from most to least recent. My best ideas so far seem to happen when I can write a single post that hits two or more of them. It’s a fun puzzle to solve that sometimes hits me with related inspiration. Matching a content idea to a form restriction works well, but finding a throughline between two content ideas is even better.
Needs. What’s my schedule today and tomorrow? Do I have any plans that align with a particular topic or theme? Is there someone I can chat with who’d really inspire or inform my work on a subject? Who do I want to spend time chatting with that I haven’t had the opportunity to yet?
In my experience, writing these out when I’m faced with a thorny problem has been well worth the time spent. Each of the four prompts has led me to important realizations. In particular, the “dramatically easier” framing in Needs often helps me notice that I should do or ask something else first. That someting else might lead me to never actually take the actions, but get what I want anyway. Unfortunately, skipping directly to just that prompt without going through the other ones doesn’t seem to work as well.
One hint that makes me suspect the Jhourney team created this themselves is, the mnemonic absolutely does not work for me. I can recall it now through repetition. But the first few times, I had to look up the words every time despite knowing their first letters. Needs, really? But I can’t deny the effectiveness of the questions. If you can come up with a better mnemonic, I’d love to hear it.
Infinite Lives, 2000: Diablo II
Gradually build up a character from plinking away one shot at a time, to filling the screen with projectiles while blinking all over the map. Diablo II is all about experiencing an incredible 100-hour power fantasy character arc. Like with Lord of the Rings, playing it now can feel derivative rather than prescient, just because the series established so many of the conventions for video games. Red health and blue mana? Item sets? Teleporting to waypoints? Color-coded item rarities? Gems and socketable items? Item affixes? Skill trees??
This is the second in a series of posts examining video game history through looking at one game I loved from each year, 1978–2027.
One secret to Diablo II’s success was its obsession with procedural generation. Instead of static levels you could memorize, most Diablo II levels were randomly generated. While generated levels were a staple of the roguelike genre Diablo II descended from, they weren’t common in mainstream games. Diablo II’s world generation was notable in part for using many different generators based on your biome. Wandering the randomly-generated desert would feel very different from navigating abandoned sewers. While its predecessor Diablo (1996) was also procedurally generated, it contained a single 16-level dungeon, a pale shadow of Diablo II’s expansive generated world.
Diablo II inherited its rich, underexplored mechanics from Diablo, of course. Diablo was an attempt to take the roguelike Moria (1983) and add graphical, real-time action combat. Considered the first influential action RPG (ARPG), Diablo was a solid success. Its strategy of “take a niche genre that turbo-nerds love and make it pretty and legible to regular nerds” created an excellent foundation for Diablo II to expand on. Diablo II would then go on to inspire so many ARPGs that the genre is still sometimes called “diablolike” to this day.
Diablo-style item colors — blue for magic, yellow for rare, green for set, and gold for unique — would become the default for video games until its sibling World of WarCraft (2004) popularized its even more influential standard. ARPGs were such a compelling fusion that they still inspire entire new genres. Borderlands (2009) applied the formula to first-person shooters, creating looter-shooters. Vampire Survivors (2022) compacted the 100-hour character arc to 30 minutes, creating bullet heavens (or single-stick shooters, or pick-3s; the genre name hasn’t cleared its orbit yet).
Screenshot of Diablo 2 in pixelated 640x480 resolution. A sorceress in trademark green casts Blizzard, spraying snow all over the dungeon and turning a nearby sword demon blue.
The Arreat Summit was an official Diablo II website where you could look up game mechanics in unprecedented detail. Official game websites in 2000 would have a screenshot or two, some marketing copy, and maybe a link to a fansite where the real information was if you were lucky. Allocating a website maintainer to publicly document the game’s inner workings was another thing Diablo II did far ahead of its time. Another was the free included netplay service battle.net.
I did not play Diablo II in 2000. Its system requirements included a ridiculous 32MiB of RAM. When they ran an open-to-everyone load test, I gave it a shot with my 16MiB of RAM and was treated to a slideshow. I could only look on in envy, only sneaking in a few hours on a friend’s gaming rig in 2001. I eventually learned how to physically jam some RAM I ordered straight from the manufacturer into my computer’s innards by 2002 and was able to enjoy it.
It would be negligent to discuss 2000 in video games without at least mentioning The Sims. The Sims was by far the most well-known game released that year, especially popular among people who didn’t play video games. Starting with SimCity (1989), Maxis set out to make games out of simulating everything from a single building in SimTower (1994) to the entire planet in SimEarth (1990). They’d known for years that simulated people would be the most difficult and rewarding game to get right. They managed to knock it out of the park, with innovations like the needs bars and mood framework that changed the way people thought. They even managed to include same-sex relationships, despite an unfriendly media climate.