Announcements

  • If you don't have a copy of the syllabus, please pick one up before or after class
  • Lecture + lab notes and video are posted on Piazza
    • If you miss a lecture or lab, please watch the video afterward
    • Please err on the side of caution and do not attend class if you are feeling ill (link to BU public health reminders)
  • Office hours
    • Nicolas: Monday 1-3pm in CCDS room 1322
    • Mayank: Thursday 11:30am-1pm in CCDS room 1343
  • Homework 1 will be posted later this week, due Friday 2/3

Lecture 2: Cryptographic Building Blocks

“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”

– Ben Adida

Review from last lecture

Our starting point into cryptocurrencies is to build a digital version of a banking system, where people can transfer money to each other electronically over the Internet without the need for a single trusted banking authority.

In the physical world, when Bob receives a check from Alice, he may be suspicious that Alice is trying to deceive him. He needs to be able to verify the following three properties.

  1. Alice properly signed the check

  2. Alice possesses $1 in her bank account

  3. Alice does not double spend the money by writing a check to someone else at the same time

Last week, we started to explore a cryptographic primitive that will address property #1: building a digital version of signing a check. This primitive is (appropriately enough) called a digital signature.

A digital signature allows Alice to append a signature $\sigma$ to a message $M$ that satisfies two properties.

  1. Correctness: Everyone in the world can verify whether Alice created the signature $\sigma$.
  2. Unforgeability: The signature $\sigma$ is bound to the message $M$. Nobody besides Alice could use it to forge a signature for any other message $M'$.

To satisfy the latter goal, Alice must have something that the rest of the world does not. We call this special something a "secret key."

Informally, a digital signature allows Alice to "lock" her message using her secret key. Everyone else in the world can unlock this message, so the lockbox isn't used for hiding. Instead, it is used to prove that Alice must have been the person who locked the message in the first place.

Informal description of a digital signature scheme

Image source: idea-instructions.com

2.1 Defining digital signatures

To understand the unforgeability guarantee better, it helps to add a third person to our figure.

Suppose that Alice is trying to sign a message $M$. But, there is a meddler-in-the-middle, called Mallory, who intercepts the message.

Alice Mallory Bob
Alice $\longrightarrow$ Mallory $\longrightarrow$ Bob

We want to ensure that any attempt by Mallory to tamper with the messages from Alice to Bob will be detected.

And we want unforgeability to hold no matter how many messages Alice sends. Even if she sends 1000 messages $M_1, M_2, \ldots, M_{1000}$, Mallory cannot forge any other message that Alice hasn't sent.

Furthermore, it shouldn't matter which messages Alice sent. That is, for any choice of the 1000 $M_1, M_2, \ldots, M_{1000}$ that Alice sent, Mallory cannot create a signature for message 1001.

Formal definitions

We are now ready to define formally the definition of a digital signature scheme.

Definition. A digital signature contains three algorithms:

  • KeyGen: takes no input, produces a secret key $sk$ and a public key $pk$
  • Sign(sk, M): produces a signature $\sigma$
  • Verify(pk, M, $\sigma$): returns true or false

All three algorithms should be efficiently computable. Moreover, the digital signature must satisfy two guarantees: correctness and unforgeability.

Correctness means that for any secret/public keypair produced by KeyGen, it is the case that any signature made via the Sign algorithm is subsequently accepted by Verify.

We formalize the unforgeability guarantee by considering a hypothetical game involving

  • Alice, who has run KeyGen to produce a secret/public keypair, and
  • Mallory, who only knows Alice's public key $pk$

This game proceeds in two phases.

  • In the first phase, Mallory sends a message $M$, and Alice responds with $\sigma = \operatorname{Sign}(sk, M)$. Mallory can repeat this step as often as she wants.
  • In the second phase, Mallory sends a message $M^*$ of her choice along with a claimed signature $\sigma^*$.

Mallory wins the game if:

  1. Her signature verifies, in the sense that $\operatorname{Verify}(pk, M^*) = \operatorname{True}$.
  2. The message is new, meaning that $M^*$ is not a message that she previously sent in the first phase for Alice to sign.

Any digital signature scheme that satisfies this definition is said to be "existentially unforgeable under a chosen message attack," often abbreviated EU-CMA.

In words, this means that no matter how many signed messages that Mallory has seen Alice make in the past, and no matter what the message contents were, it is practically impossible for Mallory to forge a new message that Alice never signed herself.

2.2 Constructing digital signatures

It is helpful to have a definition in hand before we try to construct a digital signature, so that we have a way to test whether a candidate construction is good.

Let's try to construct a digital signature. It turns out that this is challenging to do. Remember that we need the following properties to hold:

  • Given a public key, it is difficult to find the signature $\sigma$ corresponding to a message $M$.
  • Given a public key, it is easy to check whether the signature $\sigma$ corresponding to a message $M$ is correct, if someone else has already found $\sigma$ for you.

So: we need to find a math problem that is difficult to solve but easy to check. Unfortunately, we do not know for certain whether any such math problems exist! If you can find one, the Clay Math Institute will pay a prize of $1 million.

So we do the next best thing: build signature schemes from math problems that we believe to have this property of "difficult to solve but easy to check." One such problem is factoring.

  • Given a composite integer $N$ that is the product of two primes, it is difficult to find the primes $p$ and $q$ such that $N = pq$.
  • If I told you the prime factors, you could easily multiply them together to verify whether $p q \overset{?}{=} N$.

Rivest, Shamir, and Adleman realized that this math problem can be used as the basis of a digital signature scheme. It is named RSA in their honor. It works using modular (or clock) arithmetic.

Here is a slightly oversimplified (and incorrect) version of their signing algorithm.

  • $\operatorname{Sign}(sk, M) = M^{sk} \bmod{N}$
  • $\operatorname{Verify}(pk, M, \sigma)$ returns true if and only if $\sigma^{pk} = M \bmod{N}$

Why does this (almost) work? It turns out that:

  • If you know the factorization of $N$, then it is easy to find two exponents $sk$ and $pk$ that "cancel out" in the sense that $(x^{sk})^{pk} = x$ for any value $x$.

  • On the other hand, if you don't know the factorization of $N$, it is not possible to find the value $sk$ that "cancels out" $pk$.

There's a bug in the current construction though. Suppose Alice gives Mallory the signatures of two messages $M_1$ and $M_2$:

$$ \sigma_1 = M_1^{sk} \bmod{N} \qquad \sigma_2 = M_2^{sk} \bmod{N}$$

From this, Mallory could produce the signature corresponding to the message $M^* = M_1 \cdot M_2 \bmod{N}$ because:

$$ \begin{align} \operatorname{Sign}(sk, M^*) &= (M_1 \cdot M_2)^{sk} \bmod{N} \\ & = M_1^{sk} \cdot M_2^{sk} \bmod{N} & = \sigma_1 \cdot \sigma_2 \bmod{N}, \end{align}$$

and the last equation is something Mallory can calculate even without knowing $sk$.

To fix this problem, the RSA signature scheme works as follows.

  • KeyGen: generate two primes $p$, $q$ and calculate $N$. Then find two exponents $sk$ and $pk$ that "cancel out."
  • $\operatorname{Sign}(sk, M) = \mathbf{H}(M)^{sk} \bmod{N}$
  • $\operatorname{Verify}(pk, M, \sigma)$ returns true if and only if $\sigma^{pk} = \mathbf{H}(M) \bmod{N}$

In this new construction, the Sign algorithm starts by applying a function $\mathbf{H}$ to the message $M$. This function breaks the algebraic structure that caused the problem above.

This special function is called a cryptographic hash function, and we will have many more uses for it throughout the semester. We will talk about it in detail next.

2.3 Hash Functions

Often called a cryptographer’s Swiss Army knife, a hash function can underlie many different cryptographic schemes: aside from producing a document’s digest to be digitally signed—one of the most common applications.

Let us give example applications: code-signing systems, computer forensics, [and protecting] passwords.

– Aumasson, Meier, Phan, and Henzen, The Hash Function BLAKE.

Swiss Army knife

High-level idea

A cryptographic hash function is a mathematical function, which we typically denote by $\mathbf{H}$, that has two seemingly-contradictory requirements.

  • It should be public and deterministic, so everyone can compute it quickly, and yet
  • It should be unpredictable and random-looking, so that nobody can understand how its outputs are connected to its inputs.

You can think of a hash function as translating from inputs that have (mathematical or English-language) structure into a new "foreign language" that is incomprehensible to all of us on Earth.

Moreover, the outputs have a fixed, known length, no matter how short or long the input messages are.

Input Output
abandon nieh
ability btol
able ykwf
about fkay
$\vdots$ $\vdots$
zipper jgxm
zone pnjo
zoo ansm

[Note: this truth table was produced by mapping the top 3000 English words to random 4-letter strings using random.org.]

Representations of data

While the toy example above used English-language words as the possible inputs, in reality we want a hash function to accept any data type. Remember: for our application to digital signatures, $\mathbf{H}$ must be able to accept any message $M$ of any length with any data encoding.

Throughout this course, we will typically represent data as abstractly as possible: as a sequence of bits of a particular length.

  • A bit is a single 0 or 1 value. That is, a bit is an element of the set $\{0, 1\}$.
  • A byte is a sequence of eight bits. In other words, a byte is an element of the set $\{0, 1\}^8$. (Note that there is a one-to-one mapping between bytes and integers in the range 0-255.)
  • A bitstring is an arbitrary-length sequence of bits. The set of all bitstrings of length $n$ is denoted as $\{0, 1\}^n$. The set of bitstrings of any possible length is denoted as $\{0, 1\}^*$.

    (We will typically, though not always, consider bitstrings that are a multiple of 8 bits; i.e., that can be decomposed into bytes with no "leftover" bits.)

Within Python, you can represent a value of type byte using a string with the letter b in front of it. For example, here is a byte that holds the ASCII value of the character A.

In [19]:
s = b'A'
print(s)
type(s)
b'A'
Out[19]:
bytes

Python provides commands to convert between raw bytes and numbers represented in decimal int, hex, and binary formats.

In [18]:
from Crypto.Util.number import bytes_to_long, long_to_bytes # requires PyCrypto
from binascii import hexlify, unhexlify

t = bytes_to_long(s)
print(t)
print(type(t))
65
<class 'int'>
In [14]:
u = hexlify(s)
print(u)
print(type(u))
b'41'
<class 'bytes'>
In [15]:
v = bin(t)
print(v)
print(type(v))
0b1000001
<class 'str'>

Hash Function Definition

Using the language of bitstrings, we can now define a hash $H$ as we stated previously: it's a single mathematical function that can accept inputs of any size and that always returns an output of a fixed length.

Definition. A hash function $H: \{0, 1\}^* \to \{0, 1\}^n$ is an efficiently-computable function that accepts unbounded input and provides output strings of a fixed number of bits $n$.

Note that a hash function does not have a secret key; it is public knowledge and anyone can compute it.

The output $H(x)$ is often called the hash digest corresponding to the input $x$.

2.4 Hash Function Security

Because $H$ has infinitely many inputs and only finitely many outputs, it must be the case that there are collisions -- that is, two inputs that map to the same output string.

But: we are going to insist that collisions are hard to find, just like they would be in our motivating example of the randomly-generated foreign language.

Collision resistance

This guarantee should hold even if we give the code of $H$ to our adversary Mallory, since $H$ is meant to be public knowledge.

In fact, we can make several security guarantees of increasing strength.

Definition. A cryptographic hash function satisfies the following properties.

  • One wayness: given $y = H(x)$, it is practically infeasible for anyone to find any preimage $x'$ such that $H(x') = y$.
  • Collision resistance: it is practically infeasible for anyone to find two different inputs $x_1$ and $x_2$ such that $H(x) = H(x')$.

    (Note: if a function is collision resistant then it must be one way, but the converse is not necessarily true. Do you see why that is?)

  • Puzzle friendliness: (this property is complicated, and we will come back to it later)
  • Random oracle: $H$ acts like a randomly-generated truth table would.

If an adversary [Mallory] has not explicitly queried the oracles on some point $x$, then the value of $H(x)$ is completely random... at least as far as [Mallory] is concerned.

– Jon Katz and Yehuda Lindell, Introduction to Modern Cryptography

Note that there is a limit to how infeasible it is to find a collision. After all, we already said that collisions must exist.

Exactly how difficult can we hope that the problem might be?

Birthday bound

The key insight is that the security of hash functions must relate somehow to the length of the output space $n$.

To take an extreme example, suppose that the output length $n = 2$ bits. Then:

  • To break one-wayness for (say) $y = 00$, we can just select inputs $x$ at random and compute $H(x)$. In expectation, we will find an input $x$ such that $H(x) = y$ after four trials. (That said, there is a chance that we get unlucky and it could take much longer.)

  • To break collision resistance, we can select two inputs $x_1$ and $x_2$ at random, and with 1/4 probability it is the case that $H(x_1) = H(x_2)$. In the worst case, we can select five inputs, and by the pigeonhole principle two of them must map to the same output.

So in order to have a secure hash function, we want $n$ to be large. For example, we might choose $n = 256$ just as we did with the private key of a digital signature scheme.

Extrapolating from the analysis above, we know that collision resistance could be broken (with certainty) and one-wayness can be broken (with good probability) by an attacker who tries $2^{256} + 1$ inputs.

But it turns out that collision resistance is a much easier task for Mallory to break than one-wayness.

To see why, consider the subtle -- yet crucial -- difference between the following two problems. Suppose that you are alone in a classroom, and then your classmates walk into the room one at a time.

  1. In expectation, how many people must be in this classroom for someone else to have the same birthday as you?
  2. In expectation, how many people must be in this classroom for two people to have the same birthday as each other?

The first question has a simple answer: 365 people (ignoring leap birthdays for now).

The second question is more challenging, and it turns out that the answer is about 25! This is much smaller than 365, and the distinction between the two answers is sometimes called the birthday paradox.

It may help to think about a related problem: if 25 people walk into the room, one at a time, what is the probability that all of them have different birthdays?

  • The first person certainly has a different birthday than everyone else in the room.
  • The second person has probability 364/365 of having a different birthday than the first person.
  • Conditioned on the first two people having distinct birthdays, the third person has probability 363/365 of having a different birthday than both of them.
  • and so on, until the 25th person to enter the room has a 341/365 probability of having a birthday that is different than the other 24 people already in the room.

As a result, the overall probability that everybody's birthday is distinct is very small:

In [26]:
import math
math.prod(range(341,365)) / (365**25)
Out[26]:
0.001181644646659003

The reason for the difference between problems 1 and 2 is that in order for all birthdays to be distinct, every pair of people in the room must have a different birthday. If there are $n$ people in the room, then there are ${N \choose 2} \approx N^2$ pairs of possible collisions.

Generalizing from the specific case of birthdays (where there are 365 options): when drawing with replacement from a set of size $N$, the expected number of attempts until the first collision is about $\sqrt{N}$.

The upshot is: if $N = 2^{256}$, then even a guess-and-check approach will find a collision in about $\sqrt{2^{256}} = 2^{128}$ tries.

Fortunately doing that much work is still infeasible for modern computers. But it is a lot less effort than the "energy of the sun" amount of work that $2^{256}$ tries would require.

2.5 Hash function constructions

The math of hash functions is not nearly as elegant as digital signatures, so unfortunately I won't show you today the inner workings of hash functions. Instead, I will just show you how to use them.

The U.S. National Institute of Standards and Technology (NIST) develops standards for all cryptographic building blocks, including digital signatures and hash functions. Officially, these standards are required for use within the U.S. federal government... unofficially, their standards are often used worldwide.

To date, NIST has developed three standards for hash functions (and also subsequently revoked one of them). These algorithms are called the Secure Hash Algorithms, or SHA for short. There have been three iterations of hash functions.

  • SHA-1 was initially developed in 1995 by the U.S. National Security Agency. It had an output length of $n = 160$ bits, so by the birthday bound at best we can hope that it takes $2^{80}$ time to find a collision.

    But through a combination of incredibly clever math to design a more clever collision-finding algorithm and a lot of computing power to execute it, researchers have broken SHA-1 and you should never use it.

In [9]:
from hashlib import sha1

# read the contents of two files: good.pdf (a check mark) and bad.pdf (an X)
with open('images/good.pdf', 'rb') as file:
    good = file.read()
    
with open('images/bad.pdf', 'rb') as file:
    bad = file.read()
    
# verify that the strings are different
assert(good != bad)

# outputs are 40 hex characters, or 160 bits in length
hashGood = sha1(good).hexdigest()
hashBad  = sha1(bad).hexdigest()

# observe that the hash function outputs are the same
print(hashGood)
print(hashBad)
assert(hashGood == hashBad)
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a
d00bbe65d80f6d53d5c15da7c6b4f0a655c5a86a
  • SHA-2 was initially developed in 2001, also by the U.S. National Security Agency. It is the most popular hash function algorithm today. It offers a choice of four possible output lengths: 224, 256, 384, or 512 bits.

  • SHA-3 was standardized in 2015 after a multi-year open competition. Its goal was to design a hash function in a fundamentally different way as SHA-2, in order to have a "failsafe" plan in case someone finds a collision in SHA-2.

In [67]:
from Crypto.Hash import SHA256

# output is 64 hex characters, or 256 bits in length
print(SHA256.new(b'Hello world!').hexdigest())
c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a