diff options
Diffstat (limited to '_posts/2023-02-12-uuid-versions.md')
-rw-r--r-- | _posts/2023-02-12-uuid-versions.md | 170 |
1 files changed, 170 insertions, 0 deletions
diff --git a/_posts/2023-02-12-uuid-versions.md b/_posts/2023-02-12-uuid-versions.md new file mode 100644 index 0000000..3f9c257 --- /dev/null +++ b/_posts/2023-02-12-uuid-versions.md @@ -0,0 +1,170 @@ +--- +title: UUID versions through the ages +--- + +UUIDs are neat. y'know, `cfbff0d1-9375-5685-968c-48ce8b15ae17` type of shit. if you're like me until a few days ago, all you know about the types of UUID is that v4 is the good one. but why are there other ones? is there a secret better one? why are the dashes asymmetrical? let's take a (roughly paraphrased from [wikipedia](https://en.wikipedia.org/w/index.php?title=Universally_unique_identifier&oldid=1136241716) and probably not quite accurate) look. + +## wait why even + +sometimes you need an ID for something you are putting in the computer, so that you have a stable way to refer to it even if all the editable fields on it change. the simplest possible approach is to give the first thing ID 1, the second thing ID 2, and so on. cohost works this way right now - as i'm editing it, this draft post has ID 1009270, meaning this is the just-over-a-millionth thing in the posts table. + +your database sits there going "the next post has ID 8. oh, new post? it has ID 8, okay the next post has ID 9." and all is well. except a year later you have a million posts and a bunch of people posting all at once, and every new post needs a new ID but they have to get created one at a time in the database so that they all get the right ID. If You're In Line (To Get The Next Post ID), Stay In Line. and the only way to know what the next post ID is is to check with the database, so you can't do things like save drafts offline with proper IDs. (staff probably doesn't want that anyway, but we need something vaguely similar at work, which is how i got here.) if you need to work at, say, Twitter's scale, or you need to be able to generate IDs without checking with the database, you need something more involved than just sequential IDs. + +## wait why even + +sometimes you need an ID for something you are putting in the computer, so that you have a stable way to refer to it even if all the editable fields on it change. the simplest possible approach is to give the first thing ID 1, the second thing ID 2, and so on. cohost works this way right now - as i'm editing it, this draft post has ID 1009270, meaning this is the just-over-a-millionth thing in the posts table. + +your database sits there going "the next post has ID 8. oh, new post? it has ID 8, okay the next post has ID 9." and all is well. except a year later you have a million posts and a bunch of people posting all at once, and every new post needs a new ID but they have to get created one at a time in the database so that they all get the right ID. If You're In Line (To Get The Next Post ID), Stay In Line. and the only way to know what the next post ID is is to check with the database, so you can't do things like save drafts offline with proper IDs. (staff probably doesn't want that anyway, but we need something vaguely similar at work, which is how i got here.) if you need to work at, say, Twitter's scale, or you need to be able to generate IDs without checking with the database, you need something more involved than just sequential IDs. + +## version 1 + +in the early 90s, some UNIX people ran into this problem when drawing up their Distributed Computing Environment. they called [their solution](https://pubs.opengroup.org/onlinepubs/9629399/apdxa.htm) "Universal Unique Identifiers", which they call "an identifier that is unique across both space and time". it's written as a hexadecimal string, but it can be stored as just the 16 bytes that are represented by that hexadecimal string. the way they make sure it's unique across both space and time is actually pretty straightforward: part of the UUID encodes the space where it was generated, and part of the UUID encodes the time where it was generated. + +the UUID format has two control fields and three data fields. the version field is pretty straightforward - it's 1 for UUIDv1, 2 for UUIDv2, etc. at this point, they only had 1 and 2, but they left room in the spec for up to 15 just in case. there's also a variant field, which says whether it's a normal UUID (`10`, hex value `8` through `b`) or some other bullshit that may or may not adhere to any of the rest of this spec. + +<details> + +<summary>other bullshit</summary> + +if the variant field is `0` then it's a UUID from Apollo Computer's Network Computing System, which had UUIDs before DCE but defined them in a slightly different way. if it's `110` then it's a UUID but the wrong endian, which Microsoft does sometimes when it makes UUIDs (it calls them GUIDs, because they're thinking too small, merely Global rather than Universal). if it's `111` then you're living in the future where they assigned a meaning to variant `111`. what's it like? how's the whole climate change thing going? + +</details> + +the data fields are + +- timestamp, which is measured since the start of the Gregorian calendar with 100ns resolution (which, with 60 bits available, repeats every 3653 years). this is how they make sure the UUID is unique across time. + +- clock sequence, which starts at a random number and goes up by one if time goes backwards, to ensure that things like clock drift or leap seconds don't lead to collisions. this is how they make sure the UUID is *actually* unique across time. + +- node, which is just the MAC address on your network card (or one of them, if you've got more than one), since different network cards already have to have different MAC addresses in order for networking to happen. this is how they make sure the UUID is unique across space. + +this is how we get that weird hyphen asymmetry, the groups come directly from the UUID data fields: + +<div style="display: grid; grid-template-rows: 1fr 1fr;"><div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><span style="grid-column: 5; border-bottom: 1px solid;">version</span><span style="grid-column: 8; border-bottom: 1px solid;">variant</span></div> <div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><kbd style="font-weight: 600;">1a8188ce</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">aa78</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">1</kbd><kbd style="font-weight: 600;">1ed</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">a</kbd><kbd style="font-weight: 600;">fa1</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">0242ac120002</kbd></div> <div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><span style="grid-column: 1; border-top: 1px solid;">time_low</span><span style="grid-column: 3; border-top: 1px solid;">time_mid</span><span style="grid-column: 6; border-top: 1px solid;">time_hi</span><span style="grid-column: 8 / 10; border-top: 1px solid;">clock sequence</span><span style="grid-column: 11; border-top: 1px solid;">node</span></div></div> + +you may have noticed that the time is split into three pieces, for the low 32 bits, the middle 16 bits, and the high 12 bits. why split it up like that? well, i don't know, but i suspect it makes a lot of things easier to split the 60 bit timestamp into at least 32 and 28 (and maybe splitting the 28 into 16 and 12 makes something easier in a way i'm not seeing?). for one, 64-bit CPUs weren't mainstream yet, and for two, they had creative alternate uses for that time_low field. + +## version 2 + +like a lot of multi-user operating systems, UNIX has users and groups, and allows for permissions management based on those users and groups. users and groups both have textual names and numeric IDs, so if you want to stably refer to a specific user or group, you can use its user ID or group ID. however, different computers can have different sets of users and groups, so if you're making the Distributed Computing Environment, you need a way to refer to a specific user or group on a specific machine. you're building on top of UNIX, because you're the Open Software Foundation (later The Open Group), so you have user and group IDs locally already. and you already made this UUID format, which has fields that refer to a specific machine. the other fields are already taken for time and also-time, but you didn't promise they'd *always* be time, right? + +in UUIDv2 (which the [DCE spec](https://pubs.opengroup.org/onlinepubs/9696989899/chap5.htm#tagcjh_08_02_01_01) calls the "security version"), the time_low field is literally just a UNIX user/group ID. the low byte of the clock sequence field is repurposed to specify whether it's a user or a group (or a secret third thing, an organization). + +i have several questions. for one, what about the other time fields? time_mid ticks up once every 7 minutes, if you construct your UUIDv2 out of a UUIDv1. do you just leave it at zero and let time_hi tick up every 325 days? do you leave mid and hi both at zero and party like it's 1582? for two, had they not invented MAC address spoofing yet? these days you can usually change your network card's MAC address to something else, so using that for anything security-related strikes me as highly dubious. for three,, what? just in general? why would you do this? this is some 5 Minute Crafts tier lifehackery. please refrain. + +presumably this worked well enough for DCE, but it has not withstood the test of time. i don't know that UUIDv2 even counts as a UUID, but it follows the UUID format and put a 2 in the version number slot, and so it lives on solely as negative space in the UUID version number range. (this is also apparently the deal with IPv5.) + +UUIDv2 may or may not have been a good idea, but the concept of "what if you had a UUID based on some specific value other than the current time" had legs. + +## version 3 + +DCE was done being written, and then it kinda died, but people kept using UUIDs. DCE was a legacy-style Proper Goddamn Specification, written by the consortium that had since become The Open Group, who also run POSIX and the Single UNIX Specification and all that jazz (?? when the posix is sus !), but that sort of doorstopper spec was overkill for the humble UUID. what it needed, as a piece of computer bullshit, was an RFC. and so in 2005 the UUID was defined again in [RFC 4122](https://datatracker.ietf.org/doc/html/rfc4122), which kept v1, reduced v2 to one sentence, and added some new versions. + +one way to think of the goal of UUIDv2 is that it's about referring to an object that already has a contextually unique ID. in v2, that object is either a user or a group, and that context is a machine. v3 is a little more flexible, but one of the contexts mentioned in the RFC is domain names, so let's look at that. + +say i want something in the format of a UUID that refers to the domain name `example.com`. one option would be to take the MD5 hash of `"example.com"`, look at the first 16 bytes, line that up with the UUID format definition, and set the version and variant to the right values. this is cool, and it basically already works for domain names, but we want flexibility. if you and i both want to do the UUIDv2 thing of referring to users on a machine, and my context is my machine and your context is your machine, and both of us have a user named `cactus`, oops, we have the same UUID, that's hardly Universally Unique. we need to include the context in what we're MD5ing, and we need to guarantee that different contexts have different values. and there's nothing computer people love more than recursion, so let's give the context a damn UUID. + +to make a UUIDv3, you need a name (which is just some text) and a UUID for your "name space" (which is the context in which your name is unique). take the binary representation of the namespace UUID, append the name, MD5 it, copy that into your UUID structure, set the version and variant, and you are done. + +the RFC defines some name space UUIDs already, like `6ba7b810-9dad-11d1-80b4-00c04fd430c8` for domain names, so we can check this ourself: + +```python +>>> import hashlib +>>> import uuid +>>> dns_namespace = uuid.UUID("6ba7b810-9dad-11d1-80b4-00c04fd430c8") +>>> hashlib.md5(dns_namespace.bytes + b"example.com").hexdigest() +'9073926b929fd1c26bc9fad77ae3e8eb' +>>> uuid.uuid3(dns_namespace, "example.com") +UUID('9073926b-929f-31c2-abc9-fad77ae3e8eb') +``` + +this is pretty damn neat. if you have something that's contextually unique and you want to turn it into something that's globally unique, this is a really cool way to do that. (spoilers, except for one thing, which you may have noticed if you know your hash functions.) but this is only sometimes a problem you have; other times, you don't have anything unique yet, and you want something ex nihilo. v1 is still good, if you've got a timestamp and a MAC address, but what if you're doing something like JS development where you can't exactly check the MAC address you're running on? well, satan help you. or what if that 100ns resolution isn't good enough like it was in the 90s? + +## version 4 + +set the version and variant. fill the rest with random bits. there is no step 3. + +if you had asked me to guess last week what a UUIDv4 was, i'd have just guessed it was 128 random bits (if you made me count how long it was). and i'd have been wrong, but only by six bits. which is neat, but also a little bit bullshit, because like. me from last week wants those bits back! + +those six bits are for compatibility with the rest of the UUID universe, but if you're just looking for some random bytes to throw in your id column, you don't need compatibility with UUIDv1, you could just make some random bytes! and 16 of them is probably overkill for your use case anyway! + +wait hang on a minute, did that say MD5 earlier? + +## version 5 + +turns out MD5 sucks. you know what's really cool? SHA-1. + +UUIDv5 is just UUIDv3 again, but with SHA-1 instead of MD5. + +```python +>>> hashlib.sha1(dns_namespace.bytes + b"example.com").hexdigest() +'cfbff0d193753685568c48ce8b15ae17d93cc34c' +>>> uuid.uuid5(dns_namespace, "example.com") +UUID('cfbff0d1-9375-5685-968c-48ce8b15ae17') +``` + +thankfully, SHA-1 is the last word in hashing algorithms, it's never had problems and it's known to be very good. now to take a big sip of my coffee and check the NIST website. + +<div class="cohost-style-embed"> +{% renderTemplate "webc" %} +<opengraph-embed + href="https://www.nist.gov/news-events/news/2022/12/nist-retires-sha-1-cryptographic-algorithm" + site-href="https://www.nist.gov" + site-favicon="https://www.nist.gov/themes/custom/nist_www/favicon.ico" + img-src="https://www.nist.gov/sites/default/files/images/2022/12/14/SecureHashAltogirthm23_Released_960x600_v4_A.png" + datetime="2022-12-15T12:00:00.000Z" +> +<span slot="title">NIST Retires SHA-1 Cryptographic Algorithm</span> +The venerable cryptographic hash function has vulnerabilities that make its further use inadvisable. +<span slot="site-name">NIST</span> +<span slot="site-domain">nist.gov</span> +<span slot="date">Dec 15, 2022</span> +</opengraph-embed> +{% endrenderTemplate %} +<div class="cohost-style-embed-link"><a href="https://www.nist.gov/news-events/news/2022/12/nist-retires-sha-1-cryptographic-algorithm">https://www.nist.gov/news-events/news/2022/12/nist-retires-sha-1-cryptographic-algorithm</a></div> +</div> + +oh. that seems bad. when is UUIDv6? + +## version 6 + +well. there's a [draft RFC](https://datatracker.ietf.org/doc/html/draft-peabody-dispatch-new-uuid-format) making updates to the UUID RFC, but it doesn't solve that problem, it solves different problems. + +one of the cool things about UUIDv1 is that you can decode the timestamp back out of it, and you don't need a separate field for the time when your object was created, because its ID tells you when it was created. however, the weird slicing and dicing that UUIDv1 does to the timestamp field means sorting by time is complicated, since the low 32 bits come first and the high 12 bits come last. + +UUIDv6 puts the whole timestamp in order, so that the most significant bit is first. + +<div style="display: grid; grid-template-rows: 1fr 1fr;"><div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><span style="grid-column: 5; border-bottom: 1px solid;">version</span><span style="grid-column: 8; border-bottom: 1px solid;">variant</span></div> <div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><kbd style="font-weight: 600;">1edaa9b4</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">e919</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">6</kbd><kbd style="font-weight: 600;">172</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">a</kbd><kbd style="font-weight: 600;">0d0</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">6721ef312724</kbd></div> <div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><span style="grid-column: 1; border-top: 1px solid;">time_high</span><span style="grid-column: 3; border-top: 1px solid;">time_mid</span><span style="grid-column: 6; border-top: 1px solid;">time_low</span><span style="grid-column: 8 / 10; border-top: 1px solid;">clock sequence</span><span style="grid-column: 11; border-top: 1px solid;">node</span></div></div> + +it keeps the clock sequence as-is from v1, but it explicitly recommends using random data instead of the MAC address for the node field, which is good. + +hang on, while we're messing with silly things from v1, what the hell is up with time since the gregorian calendar? + +## version 7 + +what if we just did unix timestamp and randomness, so that we had easy sorting and decoding but also uniqueness, without any bullshit. + +well, v7 is that. 48 bits of unix timestamp in milliseconds (rolls over every 8920 years), 74 random bits, 6 control bits for version and variant. + +<div style="display: grid; grid-template-rows: 1fr 1fr;"><div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><span style="grid-column: 3; border-bottom: 1px solid;">version</span><span style="grid-column: 6; border-bottom: 1px solid;">variant</span></div> <div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><kbd style="font-weight: 600;">0186443f-2c00</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">7</kbd><kbd style="font-weight: 600;">5fb</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">8</kbd><kbd style="font-weight: 600;">00a-7ec0f02a852d</kbd></div> <div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><span style="grid-column: 1; border-top: 1px solid;">time</span><span style="grid-column: 4; border-top: 1px solid;">rand_a</span><span style="grid-column: 6 / 10; border-top: 1px solid;">rand_b</span></div></div> + +this is the real good one. the only reason not to use it is that the RFC isn't approved so it isn't quite official yet, but if you don't need to care, find an implementation in your language of choice and go to fuckin town. + +but what if yolo? + +## version 8 + +if yolo, then you can use version 8. all of the version-specific fields are defined to be custom, so you can put whatever the hell nonsense you need to there. + +<div style="display: grid; grid-template-rows: 1fr 1fr;"><div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><span style="grid-column: 3; border-bottom: 1px solid;">version</span><span style="grid-column: 6; border-bottom: 1px solid;">variant</span></div> <div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><kbd style="font-weight: 600;">b00b5101-6969</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">8</kbd><kbd style="font-weight: 600;">420</kbd><kbd style="font-weight: 600;">-</kbd><kbd style="font-weight: 600;">9</kbd><kbd style="font-weight: 600;">a55-676179736578</kbd></div> <div style="display: grid; grid-column: 1 / 99; grid-template-columns: subgrid; text-align: center;"><span style="grid-column: 1; border-top: 1px solid;">custom_a</span><span style="grid-column: 4; border-top: 1px solid;">custom_b</span><span style="grid-column: 6 / 10; border-top: 1px solid;">custom_c</span></div></div> + +## so what did we learn + +UUIDs are pretty neat. if you need a database identifier and you get to pick something from scratch, use v7, it gives you timestamps for free. if you have things that are contextually unique and you want to turn them into universally unique IDs with a standard format, v5 is for you. if all you need is some random shit, v4 is that in the UUID format, but if you don't need the UUID format, you can also just use random bytes directly. if what you need is very specific shit nobody's thought of before, but in the UUID format, that's v8. v6 is the worse version of v7, v3 is the worse version of v5, v1 is the worse version of v6, and v2 was a mistake. + +see also: [creative use of v1's MAC address field](https://cohost.org/twilight-sparkle/post/1010833-related-reading-on), [current RFC draft status and other notes](https://cohost.org/iliana/post/1013189-some-notes-on-furthe), [bluetooth doing crimes](https://cohost.org/vikxin/post/1014175-and-then-bluetooth-h) + +in conclusion, + +# UUIDs nuts. + |