“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”
– Ben Adida
Our starting point into cryptocurrencies is to build a digital version of a banking system, where people can transfer money to each other electronically over the Internet without the need for a single trusted banking authority.
In the physical world, when Bob receives a check from Alice, he may be suspicious that Alice is trying to deceive him. He needs to be able to verify the following three properties.
Alice properly signed the check
Alice possesses $1 in her bank account
Alice does not double spend the money by writing a check to someone else at the same time
Last week, we started to explore a cryptographic primitive that will address property #1: building a digital version of signing a check. This primitive is (appropriately enough) called a digital signature.
A digital signature allows Alice to append a signature $\sigma$ to a message $M$ that satisfies two properties.
To satisfy the latter goal, Alice must have something that the rest of the world does not. We call this special something a "secret key."
Informally, a digital signature allows Alice to "lock" her message using her secret key. Everyone else in the world can unlock this message, so the lockbox isn't used for hiding. Instead, it is used to prove that Alice must have been the person who locked the message in the first place.
Image source: idea-instructions.com
To understand the unforgeability guarantee better, it helps to add a third person to our figure.
Suppose that Alice is trying to sign a message $M$. But, there is a meddler-in-the-middle, called Mallory, who intercepts the message.
Alice | Mallory | Bob | ||
---|---|---|---|---|
$\longrightarrow$ | $\longrightarrow$ |
We want to ensure that any attempt by Mallory to tamper with the messages from Alice to Bob will be detected.
And we want unforgeability to hold no matter how many messages Alice sends. Even if she sends 1000 messages $M_1, M_2, \ldots, M_{1000}$, Mallory cannot forge any other message that Alice hasn't sent.
Furthermore, it shouldn't matter which messages Alice sent. That is, for any choice of the 1000 $M_1, M_2, \ldots, M_{1000}$ that Alice sent, Mallory cannot create a signature for message 1001.
We are now ready to define formally the definition of a digital signature scheme.
Definition. A digital signature contains three algorithms:
true
or false
All three algorithms should be efficiently computable. Moreover, the digital signature must satisfy two guarantees: correctness and unforgeability.
Correctness means that for any secret/public keypair produced by KeyGen, it is the case that any signature made via the Sign
algorithm is subsequently accepted by Verify
.
We formalize the unforgeability guarantee by considering a hypothetical game involving
This game proceeds in two phases.
Mallory wins the game if:
Any digital signature scheme that satisfies this definition is said to be "existentially unforgeable under a chosen message attack," often abbreviated EU-CMA.
In words, this means that no matter how many signed messages that Mallory has seen Alice make in the past, and no matter what the message contents were, it is practically impossible for Mallory to forge a new message that Alice never signed herself.
It is helpful to have a definition in hand before we try to construct a digital signature, so that we have a way to test whether a candidate construction is good.
Let's try to construct a digital signature. It turns out that this is challenging to do. Remember that we need the following properties to hold:
So: we need to find a math problem that is difficult to solve but easy to check. Unfortunately, we do not know for certain whether any such math problems exist! If you can find one, the Clay Math Institute will pay a prize of $1 million.
So we do the next best thing: build signature schemes from math problems that we believe to have this property of "difficult to solve but easy to check." One such problem is factoring.
Rivest, Shamir, and Adleman realized that this math problem can be used as the basis of a digital signature scheme. It is named RSA in their honor. It works using modular (or clock) arithmetic.
Here is a slightly oversimplified (and incorrect) version of their signing algorithm.
true
if and only if $\sigma^{pk} = M \bmod{N}$Why does this (almost) work? It turns out that:
If you know the factorization of $N$, then it is easy to find two exponents $sk$ and $pk$ that "cancel out" in the sense that $(x^{sk})^{pk} = x$ for any value $x$.
On the other hand, if you don't know the factorization of $N$, it is not possible to find the value $sk$ that "cancels out" $pk$.
There's a bug in the current construction though. Suppose Alice gives Mallory the signatures of two messages $M_1$ and $M_2$:
$$ \sigma_1 = M_1^{sk} \bmod{N} \qquad \sigma_2 = M_2^{sk} \bmod{N}$$From this, Mallory could produce the signature corresponding to the message $M^* = M_1 \cdot M_2 \bmod{N}$ because:
$$ \begin{align} \operatorname{Sign}(sk, M^*) &= (M_1 \cdot M_2)^{sk} \bmod{N} \\ & = M_1^{sk} \cdot M_2^{sk} \bmod{N} & = \sigma_1 \cdot \sigma_2 \bmod{N}, \end{align}$$and the last equation is something Mallory can calculate even without knowing $sk$.
To fix this problem, the RSA signature scheme works as follows.
true
if and only if $\sigma^{pk} = \mathbf{H}(M) \bmod{N}$In this new construction, the Sign
algorithm starts by applying a function $\mathbf{H}$ to the message $M$. This function breaks the algebraic structure that caused the problem above.
This special function is called a cryptographic hash function, and we will have many more uses for it throughout the semester. We will talk about it in detail next.
Often called a cryptographer’s Swiss Army knife, a hash function can underlie many different cryptographic schemes: aside from producing a document’s digest to be digitally signed—one of the most common applications.
Let us give example applications: code-signing systems, computer forensics, [and protecting] passwords.
– Aumasson, Meier, Phan, and Henzen, The Hash Function BLAKE.
A cryptographic hash function is a mathematical function, which we typically denote by $\mathbf{H}$, that has two seemingly-contradictory requirements.
You can think of a hash function as translating from inputs that have (mathematical or English-language) structure into a new "foreign language" that is incomprehensible to all of us on Earth.
Moreover, the outputs have a fixed, known length, no matter how short or long the input messages are.
Input | Output |
---|---|
abandon | nieh |
ability | btol |
able | ykwf |
about | fkay |
$\vdots$ | $\vdots$ |
zipper | jgxm |
zone | pnjo |
zoo | ansm |
[Note: this truth table was produced by mapping the top 3000 English words to random 4-letter strings using random.org.]
While the toy example above used English-language words as the possible inputs, in reality we want a hash function to accept any data type. Remember: for our application to digital signatures, $\mathbf{H}$ must be able to accept any message $M$ of any length with any data encoding.
Throughout this course, we will typically represent data as abstractly as possible: as a sequence of bits of a particular length.
A bitstring is an arbitrary-length sequence of bits. The set of all bitstrings of length $n$ is denoted as $\{0, 1\}^n$. The set of bitstrings of any possible length is denoted as $\{0, 1\}^*$.
(We will typically, though not always, consider bitstrings that are a multiple of 8 bits; i.e., that can be decomposed into bytes with no "leftover" bits.)
Within Python
, you can represent a value of type byte
using a string with the letter b
in front of it. For example, here is a byte that holds the ASCII value of the character A.
s = b'A'
print(s)
type(s)
Python
provides commands to convert between raw bytes
and numbers represented in decimal int
, hex
, and binary formats.
from Crypto.Util.number import bytes_to_long, long_to_bytes # requires PyCrypto
from binascii import hexlify, unhexlify
t = bytes_to_long(s)
print(t)
print(type(t))
u = hexlify(s)
print(u)
print(type(u))
v = bin(t)
print(v)
print(type(v))
Using the language of bitstrings, we can now define a hash $H$ as we stated previously: it's a single mathematical function that can accept inputs of any size and that always returns an output of a fixed length.
Definition. A hash function $H: \{0, 1\}^* \to \{0, 1\}^n$ is an efficiently-computable function that accepts unbounded input and provides output strings of a fixed number of bits $n$.
Note that a hash function does not have a secret key; it is public knowledge and anyone can compute it.
The output $H(x)$ is often called the hash digest corresponding to the input $x$.
Because $H$ has infinitely many inputs and only finitely many outputs, it must be the case that there are collisions -- that is, two inputs that map to the same output string.
But: we are going to insist that collisions are hard to find, just like they would be in our motivating example of the randomly-generated foreign language.
This guarantee should hold even if we give the code of $H$ to our adversary Mallory, since $H$ is meant to be public knowledge.
In fact, we can make several security guarantees of increasing strength.
Definition. A cryptographic hash function satisfies the following properties.
Collision resistance: it is practically infeasible for anyone to find two different inputs $x_1$ and $x_2$ such that $H(x) = H(x')$.
(Note: if a function is collision resistant then it must be one way, but the converse is not necessarily true. Do you see why that is?)
If an adversary [Mallory] has not explicitly queried the oracles on some point $x$, then the value of $H(x)$ is completely random... at least as far as [Mallory] is concerned.
– Jon Katz and Yehuda Lindell, Introduction to Modern Cryptography
Note that there is a limit to how infeasible it is to find a collision. After all, we already said that collisions must exist.
Exactly how difficult can we hope that the problem might be?
The key insight is that the security of hash functions must relate somehow to the length of the output space $n$.
To take an extreme example, suppose that the output length $n = 2$ bits. Then:
To break one-wayness for (say) $y = 00$, we can just select inputs $x$ at random and compute $H(x)$. In expectation, we will find an input $x$ such that $H(x) = y$ after four trials. (That said, there is a chance that we get unlucky and it could take much longer.)
To break collision resistance, we can select two inputs $x_1$ and $x_2$ at random, and with 1/4 probability it is the case that $H(x_1) = H(x_2)$. In the worst case, we can select five inputs, and by the pigeonhole principle two of them must map to the same output.
So in order to have a secure hash function, we want $n$ to be large. For example, we might choose $n = 256$ just as we did with the private key of a digital signature scheme.
Extrapolating from the analysis above, we know that collision resistance could be broken (with certainty) and one-wayness can be broken (with good probability) by an attacker who tries $2^{256} + 1$ inputs.
But it turns out that collision resistance is a much easier task for Mallory to break than one-wayness.
To see why, consider the subtle -- yet crucial -- difference between the following two problems. Suppose that you are alone in a classroom, and then your classmates walk into the room one at a time.
The first question has a simple answer: 365 people (ignoring leap birthdays for now).
The second question is more challenging, and it turns out that the answer is about 25! This is much smaller than 365, and the distinction between the two answers is sometimes called the birthday paradox.
It may help to think about a related problem: if 25 people walk into the room, one at a time, what is the probability that all of them have different birthdays?
As a result, the overall probability that everybody's birthday is distinct is very small:
import math
math.prod(range(341,365)) / (365**25)
The reason for the difference between problems 1 and 2 is that in order for all birthdays to be distinct, every pair of people in the room must have a different birthday. If there are $n$ people in the room, then there are ${N \choose 2} \approx N^2$ pairs of possible collisions.
Generalizing from the specific case of birthdays (where there are 365 options): when drawing with replacement from a set of size $N$, the expected number of attempts until the first collision is about $\sqrt{N}$.
The upshot is: if $N = 2^{256}$, then even a guess-and-check approach will find a collision in about $\sqrt{2^{256}} = 2^{128}$ tries.
Fortunately doing that much work is still infeasible for modern computers. But it is a lot less effort than the "energy of the sun" amount of work that $2^{256}$ tries would require.
The math of hash functions is not nearly as elegant as digital signatures, so unfortunately I won't show you today the inner workings of hash functions. Instead, I will just show you how to use them.
The U.S. National Institute of Standards and Technology (NIST) develops standards for all cryptographic building blocks, including digital signatures and hash functions. Officially, these standards are required for use within the U.S. federal government... unofficially, their standards are often used worldwide.
To date, NIST has developed three standards for hash functions (and also subsequently revoked one of them). These algorithms are called the Secure Hash Algorithms, or SHA for short. There have been three iterations of hash functions.
SHA-1 was initially developed in 1995 by the U.S. National Security Agency. It had an output length of $n = 160$ bits, so by the birthday bound at best we can hope that it takes $2^{80}$ time to find a collision.
But through a combination of incredibly clever math to design a more clever collision-finding algorithm and a lot of computing power to execute it, researchers have broken SHA-1 and you should never use it.
from hashlib import sha1
# read the contents of two files: good.pdf (a check mark) and bad.pdf (an X)
with open('images/good.pdf', 'rb') as file:
good = file.read()
with open('images/bad.pdf', 'rb') as file:
bad = file.read()
# verify that the strings are different
assert(good != bad)
# outputs are 40 hex characters, or 160 bits in length
hashGood = sha1(good).hexdigest()
hashBad = sha1(bad).hexdigest()
# observe that the hash function outputs are the same
print(hashGood)
print(hashBad)
assert(hashGood == hashBad)
SHA-2 was initially developed in 2001, also by the U.S. National Security Agency. It is the most popular hash function algorithm today. It offers a choice of four possible output lengths: 224, 256, 384, or 512 bits.
SHA-3 was standardized in 2015 after a multi-year open competition. Its goal was to design a hash function in a fundamentally different way as SHA-2, in order to have a "failsafe" plan in case someone finds a collision in SHA-2.
from Crypto.Hash import SHA256
# output is 64 hex characters, or 256 bits in length
print(SHA256.new(b'Hello world!').hexdigest())