Announcements

  • Homework 5 has been posted on Piazza. It is due on Friday 3/24 on Gradescope.
  • Homework 6 will be posted later this week, and will be due on Friday 3/31.

Lecture 14: Verifiable and Secure Computation

Recap: Secure multi-party computation

At a high level, cryptographic systems attempt to provide (some or all of) the following three objectives.

  • Confidentiality: keeping data or people's identities hidden from others.
  • Integrity: preventing raw data or the data science results from being altered, or at least allowing for detection of tampering.
  • Availability: ensuring that data science is censorship-resistant.

These three security objectives are sometimes referred to as the "CIA triad."

In the first two parts of this course, we have designed blockchains and broadcast protocols that provide strong integrity and availability. They are decentralized protocols that allow for outsourcing or federating data among a large number $n$ of parties (e.g., servers on the Internet).

Fundamentally, blockchains and broadcasts achieve decentralization based on the principle of replication. The idea is that if you want information to live for a long time (either on the Internet or between a set of Byzantine generals), then you should store it in many computers or many people's minds.

Replication is a powerful technique. But on its own, it seems incompatible with confidentiality. After all, if you give your data to $n$ servers, how can you protect your data aginst them?

In this part of the course, we are studying the concept of secure multi-party computation, or MPC.

MPC is a powerful tool that allows you to distribute your data among $n$ parties, without giving your data to any single one of them! If a threshold number $t$ parties meet then they can jointly reconstruct the data, but any coalition of smaller size cannot read your data.

MPC is based on the idea of randomness, rather than replication. The idea is to give each party a share of your secret: a random number, but chosen so that all of the shares jointly can be used to reconstruct your data.

D-20 dice

We saw a few types of secret sharing techniques last week:

  • Additive secret sharing, where a secret $x$ is split into several random numbers $x_1, x_2, \ldots, x_n$ that sum to $x$.
  • Multiplicative secret sharing, where a secret $y$ is split into random numbers that multiply to $y$.
  • Shamir secret sharing, where a secret $z$ is shared as a collection of $n$ points on a polynomial of degree $t-1$, so that any $t$ of them can be interpolated to reconstruct $z$.

In the last lecture, we showed that it is possible for the $n$ parties to collectively compute any mathematical function $f$ over secret shares of the data, and then only reconstruct the final result.

Our protocols from last time required $t = 2$ parties to reconstruct the data; any one party on their own could not. And we tolerated several types of faulty (i.e., adversarial) participants.

Recap: Crypto building blocks

Crypto is based on hard math problems. One such problem is the difficulty of computing discrete logarithms within:

  • Modular arithmetic groups, e.g., finding $x$ that solves $h = g^x \bmod{p}$ for a large prime $p$.
  • Elliptic curve groups, which conceptually work the same way but the math is slightly more complicated.

Multiplication on an elliptic curve group

[Image source: Wikipedia]

Elliptic curve discrete logarithm assumption: Given two points $g$ and $h$ such that $g^x = h$, it is computationally infeasible for anyone to find the discrete logarithm $x$.

(Note: all of the exponentiations and logarithms are done modulo a prime number, but for brevity I'm going to stop writing that explicitly from now onward.)

In previous lectures, we saw how to use the hardness of discrete logarithms to build

  • Key exchange, which allows two people who have never met before to agree upon a joint secret over a public communication channel.
  • Zero knowledge proofs, which allow a prover to convince a verifier that a particular claim is true, but without revealing the evidence for this claim.
  • Digital signature algorithms like Schnorr and (EC)DSA signatures.

I reproduce the construction of Schnorr signatures below; it is adapted from the presentation on Wikipedia.

Remember that this signature scheme is randomized: if you request to sign the same message twice, then you will likely get two different signatures back in response. Nevertheless, all possible signatures will correctly be accepted by $\operatorname{Verify}$.

To sign a message $M$:

  • Choose a random exponent $k$ from the allowed set.
  • Let $r=g^{k}$.
  • Let $e = H(r, M)$, where $H$ denotes a cryptographic hash function.
  • Let $s = k − x e$.

The signature is the pair $\sigma = (s, e)$.

To verify the claimed signature $\sigma = (s, e)$ of a message $M$:

  1. Let $r^* = g^{s}y^{e}$.
  2. Let $e^* = H(r^*, M)$.

If $e^* = e$, then return true (i.e., that the signature is valid). Otherwise return false.

14.1 Threshold Signatures

Back in Lecture 6, we constructed several "quality of life" improvements that made it easier for people who use Bitcoin in several ways:

  • They can check how much money they have without running a full node.
  • They only have to remember one master key that can be written down as English-language words.
  • They can use Schnorr digital signatures to reduce the amount of data they store on the blockchain (and therefore how much money they pay).

Today, we will improve the resilience of a Bitcoin wallet.

Storing money in a Bitcoin wallet is both simple and risky. You only have to hold a single 32-byte key.

But the flip side of this is: if this master key is lost, stolen, or copied, then your money is gone.

You could write down your master key in several places, so reduce the risk of losing your key. But there's a tradeoff here: that creates more places where someone could find and copy your key.

Fortunately, we can do better than this.

Recall that Bitcoin transactions can be more complex than just a signature verification. They can actually encode a small computer program. (Not quite as much as Ethereum can with Solidity, but still it is a somewhat expressive language.)

Here is one way you can use this feature to your benefit:

  • Create many different signing keys $sk_1, sk_2, \ldots, sk_n$.
  • Store them in different places.
  • When you receive money, make sure this transaction includes a script that says that the incoming money should only be considered validly spent in the future if it includes signatures using some threshold number $t$ of the $n$ public keys.

This strategy is called a multi-signature wallet, or multisig for short.

For instance, you could create $n = 3$ signing keys and store them on your laptop, inside of a physical safe in your home, and on your best friend's computer. Then you can declare that an outgoing transaction is valid if there are signatures from any $t = 2$ of these keys.

  • This strategy provides better availability: if your laptop is lost or stolen, or if your home is broken into, then there are still 2 remaining keys that will allow you to move your money.

  • It also provides better confidentiality than, say, backing up your keys with a single organization like a cloud provider. Even though you are relying on your best friend to provide this improved availability, that person cannot steal your money on their own without your consent.

A cheaper approach

With Schnorr signatures, there's an easier way to provide $n$-out-of-$n$ multi-signatures.

The new approach will look just like an ordinary signature on the blockchain itself; nobody will know that you're even using threshold signing.

Also, it will be cheaper since we will only post one signature to the blockchain itself. (Remember that on the blockchain, you pay mining fees that are roughly proportional to the amount of storage you require.)

Multisignatures vs threshold signatures

[Image source: cointelegraph.com]

To see how this works, let's revisit the signing protocol for Schnorr signatures, given a secret key $x$.

$\operatorname{Sign}_{x} (M)$:

  1. Choose a random exponent $k$ from the allowed set.
  2. Let $r=g^{k}$.
  3. Let $e = H(r, M)$, where $H$ denotes a cryptographic hash function.
  4. Let $s = k − x e$.
  5. Output the pair $\sigma = (s, e)$.

The signer must hide their long-term secret key $x$ and one-time randomness $k$. But the other variables can be public: $s$ and $e$ are part of the signature and $r$ can be derived from them.

There are two important observations to make here.

  • It is safe to post $r$ publicly, because $k$ is still hidden due to the discrete logarithm assumption.
  • Step 4 of the protocol is a linear function (in the algebraic sense) of the secret data $k$ and $x$. And step 2 is also somewhat-linear in its secrets (in a sense that we will formalize later today).

The last observation means that we can use secret sharing here to protect the secret and public keys!

Suppose that $n$ signers each have secret keys $x_1, x_2, \ldots, x_n$ with corresponding public keys $y_1, y_2, \ldots, y_n$.

They refuse to share their secret keys with each other. Even so, they can work together to produce a single Schnorr signature (without the need for any special Bitcoin script).

This new signature will correspond to:

  • A group secret key $x = x_1 + x_2 + \cdots + x_n$. (Note that the signers will never know the value of $x$; each signer just creates their own key $x_i$ independently.)
  • A group public key $y = y_1 \times y_2 \times \cdots \times x_n = g^{x_1 + x_2 + \cdots + x_n} = g^x$.

Here's how it works.

  1. Each signer runs step 1 on their own to sample their own secret exponent $k_i$. Then they run step 2 to compute $r_i = g^{k_i}$.
  2. The signers can share the results of step 2 and use them to produce a combined $r = g^{\sum k_i} = \prod r_i$.
  3. Everyone can run step 3, since $r$ and $M$ are known to everyone.
  4. Each signer can compute $s_i = k_i - x_i \cdot e$, share only $s_i$ with the other signers, and use them to produce a combined $s = \sum s_i$.

You can check for yourself that this exactly matches the Schnorr signature protocol, thanks to linearity!

This is one instance of a very powerful phenomenon:

Many people, each with their own private data that they do not wish to share, can still perform data analysis collaboratively and only reveal the result!

This is the power of secure multi-party computation!

We have just constructed our first instance of an MPC protocol that is customized for a specific function. While we saw last time that it is possible to compute MPC for any function by decomposing it into a sequence of additions and multiplications, here we see an example of an MPC protocol that exploits the existing structure of the existing function -- in this case, the math of the Schnorr signing algorithm.

14.2 Verifiable Secret Sharing

Secret sharing is a great tool to provide confidentiality. But as we saw in the last lecture, on its own it does not provide integrity or availability. Let's discuss and address this issue.

Let's look at additive secret sharing: say you have secret data $x$ and convert it into a collection of $n$ secret shares:

$$ x = x_1 + x_2 + \cdots + x_n \bmod{p}.$$

For now, suppose you don't even want to compute over this data; you just want to outsource the data to several cloud providers and reconstruct it at some later date.

So, you can disperse the message among the cloud providers by giving share $x_i$ to provider $i$.

Then, many months or years later, you can retrieve the secret data by fetching all shares and adding them together.

This provides excellent confidentiality: the cloud providers can perform their job of holding onto your data for a long time, but no coalition of $n-1$ or fewer of them can ever learn the data.

However, if one party provides an incorrect share $x_i^* = x_i + \Delta$, then your entire secret is going to be incorrect by $\Delta$:

$$x_1 + x_2 + \cdots + (x_i + \Delta) + \cdots + x_n = x + \Delta.$$

This is bad news for you: it means your data is corrupted, and even worse there's no way for you to pinpoint which cloud provider is to blame.

Depending on how the economics of this hypothetical "private multi-cloud outsourcing" service works, it could also be bad news for the honest cloud providers: maybe you refuse to pay them as a result of the data corruption, even though they properly did their job and held onto your data for a long time.

Basically what we want is something like filecoin.io: an "decentralized storage network" based on "an open-source cloud storage marketplace, protocol, and incentive layer."

(We're not going to explain the entirety of how filecoin works in today's lecture, but we will describe some of the main ideas.)

In other words, this problem is very similar to the Byzantine generals question. You are a leader and you're providing information to many other parties. The main differences here are that:

  • we are adding confidentiality through secret sharing, and
  • there are two stages to the protocol, a dispersal stage and a retrieval stage.

But the other trust questions from the Byzantine generals problem remain. You don't necessarily trust the other parties entirely, and they don't entirely trust you either.

Byzantine generals

Source:Wikipedia

Question. How can we provide each party with a verifiability guarantee? That is:

  • During dispersal, the cloud providers want to be convinced that they're holding onto a share of a real secret that you will pay them for.
  • During retrieval, you want to be convinced that the shares are correct (even though you have forgotten the secret yourself). Otherwise, you want to be able to pinpoint which provider(s) are sending you bad shares.

A construction

One way to solve this problem is to use Merkle trees. Recall that Merkle trees allow for uploading many files to the cloud, so that the cloud can prove that each file is part of the overall collection.

Generating a Merkle proof

Source: Alin Tomescu, Decentralized Thoughts blog

Verifying a Merkle proof

Source: Alin Tomescu, Decentralized Thoughts blog

Concretely, here is a protocol that you can follow to disperse the shares and later retrieve the message.

In dispersal:

  • The leader creates all secret shares such that $x = \sum x_i$, and builds a Merkle tree of all shares $x_i$. Let's call the Merkle root $h$.
  • The leader sends each cloud provider its share $x_i$, the Merkle root $h$, and a proof $\pi_i$ that connects its own share to the root $h$. (The provider can check for itself that the proof $\pi_i$ is correct, for the provided share $x_i$ and Merkle root $h$.)
  • The leader saves the (small) Merkle root $h$ and deletes the rest of its state. (After all, the purpose of cloud outsourcing is so that you don't have to keep the secret $x$ yourself.)

In retrieval:

  • Each cloud provider sends their share $x_i$ and proof $\pi_i$ to the leader.
  • The leader verifies each proof, and pays the corresponding provider if it's correct. Alternatively, if a provider has cheated, the leader detects the error and takes appropriate action (e.g., sue the cloud provider in court, or take money that the provider has placed in escrow in a smart contract, etc).

Question. There is actually a bug in this protocol. Can you find it?

One problem is that the leader might use different Merkle tree roots $h$ with each cloud provider. So even if the provider is honest, if the leader is malicious then they could send one cloud provider the wrong root $h^*$ and later accuse the provider of cheating during the retrieval phase.

To fix this problem, we can use the techniques from earlier parts of this course! Here are two recourses:

  • Have the leader post the Merkle root $h$ to a blockchain so that all providers can see them and know that they have a valid share $x_i$ and proof $\pi_i$ that connect to the publicly-posted $h$.
  • Have the leader run a Byzantine Broadcast protocol to disperse $h$. Broadcast ensures that each provider receives $h$, and that they are convinced that everyone else receives $h$ too. (This approach is cheaper, but would not convince an outside party like a judge in a later court case.)

So here is one way to fix the protocol.

In dispersal:

  • The leader creates all secret shares such that $x = \sum x_i$, and builds a Merkle tree of all shares $x_i$. Let's call the Merkle root $h$.
  • The leader and providers conduct a Byzantine Broadcast to disperse the Merkle root $h$ to all parties.
  • Then, the leader sends each cloud provider its share $x_i$ and a proof $\pi_i$ that connects its own share to the broadcasted Merkle root $h$. The provider refuses to participate further if the proof is incorrect.
  • The leader deletes all state.

Retrieval can work the same as before. And in fact, the leader doesn't even need to store the Merkle root $h$ because it can just request it from all providers later: if there is an honest majority or a PKI, then the leader can distinguish the correct $h$ from an incorrect one sent by a faulty cloud provider.

Computing over verifiable secret shares

So far, we have purposely avoided the discussion of computing over secrets; this technique works to disperse files and then retrieve them later.

Let's bring back the idea of computing over secrets.

Suppose the leader dispersed two secrets $x$ and $y$ using this protocol... or maybe two different leaders dispersed two different secrets, such as their company's payroll data.

Now, we only want to retrieve the overall sum $x + y$ of the two secrets, rather than retrieving $x$ and $y$ individually.

And as we saw last week, if we can take the sum of two secrets, then it's pretty easy to extend the idea in order to perform any linear algebra operation over several secrets.

Can we do this? Well... not quite.

  • The good news is that the secret sharing scheme is linear, so it works fine! The providers could just send shares $x_i + y_i$, rather than separately sending each of the two summands.
  • The bad news is that the Merkle tree is not linear, so provider $i$ cannot prove that it is sending the right value $x_i + y_i$.

So let's replace the Merkle tree with something else.

Using the discrete log problem, we can create a new way to commit and prove properties of the secrets that is amenable to computation.

14.3 Commitment schemes

A Merkle tree is a special case of a kind of cryptographic primitive that we should formalize: a commitment scheme.

Commitments are a digital analog to putting a written message inside of an envelope and placing it on the table. Everyone knows that the message is inside of the envelope, but nobody can read the message until the envelope is opened. (This is related to, but not the same thing as, encryption of a message.)

Formally, a commitment scheme has two methods.

  • Commit(m; r): given a message $m$ and (optionally) some randomness $r$, construct a commitment $c$.
  • Open(c, m, r): with the commitment $c$ and the inputs used to create it, it's possible to output the message $m$ and randomness $r$, in order to convince the receiver that $m$ is the correct message contained inside of the commitment.

Note that a commitment scheme does not use a long-term key, and as a result there is no KeyGen method. Any randomness $r$ used in the commitment is ephemeral, i.e., used only to commit a single message.

Committer Receiver
Mallory message $m$ $\underset{\longrightarrow}{c}$ Bob

Commitments have two guarantees:

  • Binding: even against a malicious committer, it is infeasible for the committer to show a different message $m^*$ in the Open step than the message $m$ that it originally used during the Commit routine.
  • Hiding: even against a malicious receiver, from only the commitment $c$ it is infeasible to learn anything about the message $m$.

Commitments from Discrete Log

Let's construct a commitment scheme.

Like most cryptographic primitives, we need to rely on the hardness of some math problem. That's because we want it to be the case that the commitment is easy to compute but hard to invert.

Let's build a commitment scheme from the hardness of the discrete log problem. Remember that this problem states that it is easy to exponentiate but difficult to take logarithms.

More specifically, for a random choice of $x$, the discrete log assumption states that:

$$ (g, x) \mapsto h = g^x \text{ is easy to compute, but } (g, h) \mapsto x = \log_g(h) \text{ is hard to compute}$$

(Note: if $x$ isn't chosen uniformly at random, and as a result $h$ doesn't have the uniform distribution either, then it might be the case that it is easy to compute the logarithm. For instance, if we only choose $x \in \{1, 2\}$ then we could just try both choices of the discrete log and see which one works.)

Question. How can we construct a commitment scheme from discrete logs?

First construction: Feldman commitments

First, let's consider the case where the message $m$ itself is chosen uniformly at random. Admittedly this isn't usually the case -- for instance, the messages that you send to your friends over the Internet probably have a lot of structure to them, such as being phrases or sentences of English words. But let's go with this assumption for now.

Here's a commitment scheme that works, if the message $m$ is chosen uniformly at random. As a starting point, assume that all parties agree upon some base element $g$ (e.g., it is specified in some open standard).

  • Commit(m): create the commitment $c = g^m$.
  • Open(c, m): reveal $m$, so that the receiver can calculate $g^m$ itself and check that it is equal to $c$.

This commitment scheme achieves:

  • Binding because there is a one-to-one mapping between messages $m$ and commitments $c$, so it is impossible to open to any other message.
  • Hiding because inverting $c$ to recover $m$ is computationally infeasible according to the discrete log assumption.

Second construction: Pedersen commitments

Second, let's consider the case where the message $m$ itself has some structure and might not be random. For instance, maybe your message is either the word "attack" or "retreat" (appropriately converted from a string to an integer). That's not enough uncertainty for $g^m$ to be difficult to invert on its own.

Instead, this time our commitment scheme needs to use randomness in order to hide the message. This time, we assume that all parties know two constants $g_1$ and $g_2$ before the protocol starts.

  • Commit(m; r): create the commitment $c = {g_1}^m \times {g_2}^r$.
  • Open(c, m): reveal $m$ and $r$, so that the receiver can calculate ${g_1}^m$ itself and check that it is equal to $c$.

This commitment scheme achieves:

  • Binding because the committer does not know the discrete log between the public constants $g_1$ and $g_2$, so it is infeasible to find an alternative opening $m^*$ and $r^*$ such that ${g_1}^m \times {g_2}^r = {g_1}^{m^*} \times {g_2}^{r^*}$ by the discrete log assumption.

  • Hiding because $g_2^r$ is a uniformly random number, and multiplying by a random number in a finite space causes the product to be uniformly random as well. In other words, given the commitment $c$, for any possible message $m^*$ there does exist some randomness $r^*$ such that $c = {g_1}^{m^*} \times {g_2}^{r^*}$. So it is impossible for the receiver to find $m$ because all choices remain possible given only the commitment $c$.

Interestingly: this is the opposite of what we found with Feldman commitments, where binding was impossible to break and hiding was computationally infeasible to break.

It turns out that you're always stuck with this choice: for any commitment scheme, at least one of binding or hiding will be at best computationally hard; you cannot construct a scheme such that both binding and hiding are impossible to break. Or in other words: you cannot build a "perfect" commitment scheme... you have to rely somewhere on the hardness of a math problem.

Additive homomorphism

We can use commitments to resolve the conundrum from before! As a reminder: we wanted to find a way to disperse several secrets, say $x$ and $y$, and then provide the option for the cloud providers to retrieve only the sum $x + y$.

Specifically, we will use Feldman commitments to construct a new verifiable secret sharing scheme. The key insights here are that:

  • Each secret share individually is chosen uniformly at random, so we can use them in Feldman commitments.
  • Feldman commitments are linear... well, sorta. Specifically, given the commitments to two messages $g^{m_1}$ and $g^{m_2}$, we can compute the commitment to their sum $g^{m_1} \times g^{m_2} = g^{m_1 + m_2}$.

We already used this "linearity" property earlier today when we constructed threshold Schnorr signatures. Technically though, this isn't linearity in the sense of a linear algebra course like DS 121, because we need to multiply the commitments in order to add the underlying messages.

So let's give this property a fancy new name: additive homomorphism. We say that Feldman and Pedersen commitments are additively homomorphic, because given several commitments it is possible (somehow) to compute the commitment to the sum of the underlying messages.

Constructing verifiable secret sharing with commitments

Here is a new verifiable secret sharing scheme that we can use to disperse a secret over $n$ cloud providers. As a starting point, we assume that there are $n$ public constants $g_1, g_2, \ldots, g_n$ that everybody knows before the protocol starts; importantly, they have nothing to do with our secrets.

In dispersal:

  • The leader creates all secret shares such that $x = \sum x_i$, and calculates a commitment $c_i = {g_i}^{x_i}$ to each share. Let's call the vector of commitments $\vec{c} = [c_1, c_2, \ldots, c_n]$.
  • The leader and providers conduct a Byzantine Broadcast to disperse all commitments $\vec{c}$. (That is, every provider agrees on the commitment for everyone, not just themselves.)
  • Then, the leader sends each cloud provider its share $x_i$. The provider can check for itself that ${g_i}^{x_i}$ equals the public commitment $c_i$, and refuses to participate further if the value is incorrect.
  • The leader saves the commitment vector $\vec{c}$ and deletes everything else.

Later on, if we want to retrieve $x$ then the procedure is similar to before: each cloud provider sends their share $x_i$, and the leader can check that ${g_i}^{x_i}$ equals the appropriate commitment inside the vector. (In fact, using what you learned in yesterday's lab, you could even create a smart contract to automate payment from the leader to the cloud provider upon delivery of $x_i$!)

Now let's do some computation over secrets. Suppose that two secrets $x$ and $y$ have been dispersed to the same set of cloud providers (even by different leaders!), with corresponding commitment vectors $\vec{c}_x = [g_i^{x_i} : i \in \{1, \ldots, n\}]$ and $\vec{c}_y = [g_i^{y_i}: i \in \{1, \ldots, n\}]$.

The cloud providers can form a verifiable secret sharing of their sum $z = x + y$ by:

  • Adding their individual shares $z_i = x_i + y_i$.
  • Componentwise-multiplying $\vec{c}_x$ and $\vec{c}_y$ to form a new vector of commitments $\vec{c}_z = [g_i^{x_i} g_i^{y_i} : i \in \{1, \ldots, n\}]$.

This works because the new commitment corresponding to $z_i$ equals $g_i^{x_i} \times g_i^{y_i} = g_i^{x_i + y_i}$, which is a valid Feldman commitment that will properly open and verify during the retrieval stage.

So: we can retrieve $z$ without ever learning the summands $x$ and $y$ individually!

14.4 Final thoughts

Today, we have built protocols that allow a collection of parties to perform computations with confidentiality and integrity. We started by investigating a specific protocol: threshold digital signatures. Then we constructed a generic way to calculate addition of secrets using verifiable secret sharing.

By combining this idea with the previous lecture, it is possible to multiply secrets that have been verifiably secret shared, so we can generically perform secure computation with integrity.

In the remainder of today's lecture, let me share a few assorted thoughts that (a) convey some extensions of this basic construction and (b) show how powerful the idea of verifiable secret sharing can be.

Vector commitments

One small downside of our construction is that we have to disperse a vector of commitments $\vec{c}$. The length of this vector equals the number of cloud providers $n$, independent of the data size. This is somewhat strange: if we increase the number of cloud providers to make the dispersal more secure, we have to pay for this in extra storage size that is linear in $n$.

By comparison, the Merkle tree looks nicer: the root $h$ is independent of $n$, and each individual proof $\pi_i$ has size $\log(n)$.

The good news is that it is possible to combine all of the commitments into a single element that has constant size, independent of both the message length and the number of providers! The idea is simple and Pedersen-esque. Rather than computing $\vec{c} = [g_1^{x_1}, g_2^{x_2}, \ldots, g_n^{x_n}]$, we will instead compute a single commitment:

$$\vec{c} = g_1^{x_1} \times g_2^{x_2} \times \cdots \times g_n^{x_n}.$$

This is called a vector commitment. The math of how to open such a commitment is not too difficult but beyond the scope of this course; it uses something called "bilinear maps."

Relatedly: it is also possible to build a polynomial commitment, which means that you commit to a polynomial with something like $c = g^{p(x)}$, and then can open the polynomial at a single point of your choice. This is useful when combined with the next comment...

Thresholds

To disperse the commitments, we relied upon Byzantine Broadcast as a building block. In other words, we've shown that we can compose secret sharing (which provides confidentiality) with broadcast (which provides integrity and availability) to get everything at once.

I've shown a protocol that uses additive secret sharing with a privacy threshold of $n$. Good news: the idea generalizes to work with Shamir secret sharing based on polynomials, so you can support an arbitrary privacy threshold of $t$. (Polynomial commitments are a helpful building block here.)

In fact, notice that there are actually two different thresholds at play here, of how many cloud providers need to collude in order to break different parts of this construction:

  • The privacy threshold $t$ of cloud providers that would need to be malicious and colluding in order to reconstruct your secret data.

  • The agreement threshold $f$ of providers that need to be faulty in order to break Byzantine Broadcast.

Without a PKI, we know that $f < n/3$. But the privacy threshold $t$ can actually be larger than this. That is, we can protect the confidentiality of the data even in circumstances where the attacker controls enough parties that agreement breaks and we can no longer guarantee correct retrieval of the data.

(This is a subtle point, and the papers on verifiable secret sharing missed this for several decades.)

Some terminology: a dual-threshold verifiable secret sharing scheme supports $t > f$, and a high-threshold scheme supports the largest privacy threshold possible.

We constructed a high-threshold scheme today: the privacy threshold was $t = n$ but the agreement threshold might be lower if we don't have a PKI.

Synchrony vs asynchrony

In today's lecture, I've discussed protocols that have clearly-defined rounds -- that is, I've assumed that the network is synchronous. It is important that the cloud providers receive their shares after a short delay, so that they can complain instantly and publicly if they haven't received shares that are consistent with the commitment.

One could ask the same question in the asynchronous setting. That is, we could ask for asynchronous verifiable secret sharing. Constructing this primitive is more subtle than what we did today, for several reasons:

  • Termination is no longer guaranteed, as with all asynchronous protocols.
  • The best possible privacy threshold is now $t = n - f$, since some of the faulty parties might never show up to reconstruction and you won't know whether they're slow or evading you.
  • The dispersal phase cannot include a step where a provider complains if their shares are inconsistent with the commitment, because asynchronous protocols are slow and the provider might not get its information for a long time. And issuing a complaint months/years later doesn't work, because we won't know whether the provider is telling the truth or is just covering up the fact that it never stored the data to begin with. Instead, the protocol must guarantee upfront that the shares are definitely consistent with the commitment.

To my knowledge, the most efficient asynchronous verifiable secret sharing scheme is one that Nicolas and I built two years ago. It's also high-threshold. If you're interested, here is the paper containing all of the details: https://eprint.iacr.org/2021/118.pdf.