CDS DS 453 / 653 — Crypto for Data Science

Syllabus

Course Overview

Instructors

Meetings

  • Lecture: Tuesdays and Thursdays at 9:30-10:45am (CCDS 164)
  • Discussion: Mondays at 9:05-9:55am or 10:10-11:00am (CCDS 164)
  • Office hours: TBD, will be posted on Piazza

Assignments

  • Homework: 7 assignments, due on Fridays unless otherwise posted
  • Project: more details after spring break

Exams

  • Midterm: in class on March 2
  • Final exam: yes, with date and time set by the university

Websites

Course description

CDS DS 453 investigates techniques for performing trustworthy data analyses without a trusted party, and for conducting data science without data sharing.

The first half of the course investigates cryptocurrencies, the blockchain technology underpinning them, and the incentives for each participant. Students will learn how to create transactions, develop smart contracts, and participate in decentralized exchanges. Then, we take a deeper dive into consensus mechanisms, historical and modern, that maintain stability if a certain fraction of the participants or computing power behaves honestly.

The second half of the course focuses on privacy and anonymity using advanced tools from cryptography. We study zero knowledge proofs and their role in preventing re-identification attacks and increasing scalability of blockchains. We also study secure multiparty computation and its role in designing private contracts and atomic swaps. The course concludes with a broader exploration into the power of conducting data science without being able to see the underlying data.

Within the undergraduate Data Sciences major, this course satisfies the DS methodology elective in the “scalable & trustworthy DS” category.

Course schedule

There are five parts to this course.

  1. Currencies without centralization: exploring the mechanics of cryptocurrencies, blockchains, and mining
  2. Consensus without trust: reaching agreement over the Internet using distributed algorithms
  3. Data analysis without data sharing: using crypto to work with data you cannot see
  4. Accountability without transparency: proving compliance with a stated rule, policy, or regulation
  5. Social, policy, legal, and regulatory impacts: discussing (in)appropriate uses of the tech we learn in this course

The full course schedule is available on Piazza. It may change slightly over time, and I welcome suggestions from you about topics that you want to see covered in this course.

Prerequisites

No prior background in cryptography is necessary.

This course requires mathematical maturity with linear algebra and probability concepts at the level of DS 122 or equivalent.

It also requires familiarity with algorithms and big-O notation at the level of DS 320, CS 330, or equivalent. This is a co-requisite: it is acceptable to take an algorithms course this semester too.

If you do not meet the prerequisites of this course, please talk with me after today's lecture.

Absence policy

This course follows BU’s policy on absences for religious observance. You can read it at https://www.bu.edu/academics/policies/absence-for-religious-reasons/.

Also due to ongoing public health concerns, please err on the side of caution and do not attend class if you show even light symptoms of COVID-19 or other illnesses. For now the BU Chief Health Officer recommends that "everyone in our community, regardless of symptoms, mask in crowded indoor areas including classrooms"

Classes will be recorded whenever possible, with video links posted on Piazza.

Office hours

Both Nicolas and I will hold office hours, and students are welcome and encouraged to visit. The exact time of our office hours is still TBD; we will decide in the next few days and post on Piazza the date, time, and location of each instructor’s office hours. Office hours will be held in a hybrid format; you are welcome to join over Zoom as well.

If you want to meet but cannot make the scheduled times, then send the instructors a private Piazza note with at least 2 suggestions for times that you are available and we will find a time to meet.

Course websites

There are 2 websites for this course, Piazza and Gradescope.

  • The course schedule and homework assignments will be posted on Piazza at https://piazza.com/bu/spring2023/ds453653/info. You are responsible for any announcements on Piazza, so please register using an email address that you check regularly.

    Also, please only email the instructors for administrative matters. Scientific questions must be sent via Piazza, possibly marked as private (e.g., if the question reveals a partial solution to a homework assignment).

Assignments & Late work policy

There will be 8 assignments in this course: 7 homeworks and 1 project. Assignments are typically due on a Friday, and they will be posted on Piazza at least one week prior to the due date.

You have two opportunities to submit each assignment: on the due date, and 1 week after the initial grades are posted. Resubmissions must include a short description of changes.

Late submissions are not accepted, except in cases of long-term emergencies (e.g., family, medical); if this applies to you, inform us as soon as you are able.

Grading assignments

This course uses specifications grading on the assignments (https://eric.ed.gov/?id=EJ717675). Each assignment includes a rubric that specifies the minimum requirements and optional extensions. Grades are based on the extent to which the solution:

  1. Meets the expectations stated in the assignment,
  2. Demonstrates mastery of the relevant technical skills and concepts, and
  3. Showcases your ability to apply knowledge learned from class, discussion, and the textbooks to new problems and settings.

Homeworks and projects are mostly programming-oriented, and they are assessed on the correctness and clarity of your solutions. For this reason, before submitting an assignment you should:

  • crop down to the pertinent work,
  • document all code,
  • describe your thought process,
  • display any relevant outputs, and
  • explain your findings.

For work that is assessed to be complete and well-communicated, you will receive a score of E or M.

  • E: Exceeds the expectations of the assignment. Mastery of the concepts is evident, and there are no nontrivial errors. Explanations of code or concepts are clear and complete.
  • M: Meets the expectations of the assignment. Demonstrates understanding of the concepts, and no significant gaps or errors are present. Communication is sufficiently explained.

Otherwise, you will receive a score of R or N as described below.

  • R: Revision needed. Partial understanding is evident from the submitted work. But significant gaps remain, either in displaying the concepts required of the assignment or communicating the ideas. Needs further work.
  • N: Not assessable. Incomplete or insubstantial response that does not include enough information to assess whether there is an understanding of the concepts. The work contains significant omissions, or many errors.

Regrading assignments

You may request a regrade of any assignment via Gradescope if you identify a factual error in our assessment. (Note that a regrade is different than a resubmission.)

Exams

There will be two exams in this course: an in-class midterm held on Thursday, March 2, and a final exam held at the date and time specified by the university. Reserve these dates on your calendar now!

Exams may cover any material from lectures, discussion sections, homework, and the required textbook reading.

Make-up exam policy

If you have a valid conflict with an exam date, you must send me a written note as soon as you are aware, and with a minimum of 1 week notice (unless there are extenuating circumstances) so we can arrange a make-up test.

Overall course letter grade

Your letter grade in this course will be determined in two parts.

First, your base grade is based on your completion of the assignments.

Base grade Scores on the 8 assignments
A- All M or higher, at least 3 scores of E
B At least 1 E, at most 1 R, rest are M
C+ At most 3 scores of R or 1 score of N
C- Anything else

For graduate students registered in DS 653, a project score of E is required in order to receive a base grade of A-.

Then, it is adjusted based on your scores on each of the two exams.

Base grade Scores on the 8 assignments
93-100 Raise 2 steps (e.g., B to A-)
85-92 Raise 1 step (e.g., B to B+)
75-84 No adjustment
60-74 Lower 1 step (e.g., B to B-)
59 or lower Lower 1 letter (e.g., B to C)

Academic honesty policy

You must adhere to BU’s Academic Conduct Code at all times. Please be sure to read it here: https://www.bu.edu/academics/policies/academic-conduct-code.

In particular: cheating on an exam, passing off another student’s work as your own, or plagiarism of writing or code will result in _receiving a score of N on the current assignment without the possibility of a resubmission, and may be grounds for referral to BU’s Academic Conduct Committee.

If you have any questions about the policy, please ask me in person or via a private Piazza note before taking an action that might be a violation.

Collaboration policy

The goal of homework and project assignments is to learn. Hence, I encourage you to use any and all resources that can help you to learn the material: computers/calculators, Piazza, lecture notes, textbooks, other websites, and your fellow classmates. That said, please always obey the following rules:

  1. In your submission, you must list (a) names of classmates you worked with, (b) any websites you used besides the ones listed in the lecture notes or textbooks, and (c) any code that you used from other sources. Taking ideas without attribution will be considered plagiarism. You are graded on your original work.
  2. You cannot copy solutions from anyone else, or give your solutions to a classmate to copy.

The goal of the exams is for you to show me what you have learned. As a result, any collaboration is strictly prohibited; exams must reflect solo work. (But I encourage you to work with classmates in preparing for the exams.)

Accommodations

This course follows all BU policies regarding accommodations for students with documented disabilities. If you are a student with a disability or believe you might have a disability that requires accommodations, please contact the Office for Disability Services (ODS) at (617) 353-3658 or access@bu.edu to coordinate any reasonable accommodation requests. ODS is located at 25 Buick Street on the 3rd floor.

Learning environment

Please respect your fellow classmates and contribute toward a positive learning environment for everyone. For instance, I actively encourage discussion and debate on ideas, but I won’t tolerate criticism of other people. Also, while you can use a computer for note-taking, do not use your laptop in class for web surfing, sending messages, or anything else that can cause a distraction to your fellow classmates.

Lecture 1: Introduction to Crypto for Data Science

“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”

– Ben Adida

1.1 Introduction to Crypto(graphy)

This is a course about crypto... which is admittedly a term with many different meanings to different people.

We will get to the topic of cryptocurrencies in a few minutes. But first, here is a question to start us off in this course: what is cryptology?

Cryptology comes from two Greek word parts:

  • kryptos, meaning secret or hidden
  • logy, meaning the study of

So the goal of cryptology is to study the art of keeping secrets.

In its modern form, crypto uses hard math problems as a way to encode data in order to restrict who can use it, how they can use it, and sometimes even when and where they can access it.

Cryptology can be divided into two sub-fields

  • Cryptography: the art of making codes.
  • Cryptanalysis: the art of breaking codes.

"Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can't break."

– Bruce Schneier

In this course we will almost exclusively focus on the defensive, cryptographic side. Even though I won't say much about cryptanalysis in this course, it is important to know that all of the building blocks I will show you in today's lecture have been vetted through decades of cryptanalysts.

"Don't roll your own crypto."

Crypto is a scientific field at the intersection of many disciplines. To accomplish their goals, cryptographers use:

  • Algorithms techniques in their protocol designs.
  • Engineering techniques to produce vetted, secure software implementations of their protocols.
  • Complexity theory to demonstrate security reductions from new protocols to more well-studied ones.
  • Mathematics for cryptanalysis.

This class will primarily make use of the first two skills: algorithms and software engineering.

While we won't delve into the type of math used in cryptanalysis, nevertheless we will often use concepts from algebra and probability.

Crypto and data science

This is also a course about how crypto can be used within data science.

It's likely that you have already used crypto in all of your data science projects, perhaps without explicitly thinking about it.

Lock icon on kaggle.com

In this image, the lock icon in the web browser means that your computer has a protected connection to the kaggle.com servers.

At a high level, this means that nobody else on the internet can

  • see which dataset you're downloading from kaggle.com, or
  • tamper with the dataset in transit.

(Note though that the kaggle.com servers can see what you are doing on their website. We will see later in the semester how even this can potentially be avoided.)

Using crypto is an essential part of providing security over the Internet, because it has always been designed as an open network. There is no wire that directly connects your computer with kaggle.com. Instead, the data flows through several other computers along the way.

Here is a picture of all of the computer servers connected to the Internet in 1968.

The Internet in 1968

Source: pwnallthethings on Twitter

Of course, the Internet is much larger now.

Growth of the Internet over time

Source: Wikipedia

But the design of the Internet has fundamentally remained the same.

“The Internet is just the world passing notes in a classroom.”

– Jon Stewart

Getting the crypto right is essential here.

  • If the system is vulnerable to decoding or tampering by unauthorized parties, then it can lead to universal, covert breaches of security. Moreover, everyone will have a false sense of security.

  • If the system is too onerous or slow to use, then people will default to other, less secure options.

Goals of cryptography

We will see throughout this course that crypto can be used in many ways besides network transmissions on the Internet.

Still though, this example highlights a general theme of all of our applications:

crypto is a social science that masquerades as a mathematical science

Our goal is to use the tools from math, computer science, and data science in order to design systems that help people.

At a high level, cryptographic systems attempt to provide (some or all of) the following three objectives.

  • Confidentiality: keeping data or people's identities hidden from others.
  • Integrity: preventing raw data or the data science results from being altered, or at least allowing for detection of tampering.
  • Availability: ensuring that data science is censorship-resistant.

These three security objectives are sometimes referred to as the "CIA triad."

Admittedly, these three objectives are both underspecified (lacking mathematical rigor) and overloaded (the terms have many different sub-meanings).

There are many different flavors of confidentiality, integrity, and availability that cryptography can provide... in fact, far more than we have time to cover in this course.

(Note: BU courses CS 538 and CS 558 also cover different aspects of cryptography and its application in modern digital systems. This course has a partial overlap with them, but we will cover many topics that they don't and vice-versa. You are welcome to register for those courses as well.)

Crypto and policy

Because crypto is a scientific field that has social impacts, it is probably inevitable that crypto has clashed with law and policy over the past five decades.

  • On the one hand, crypto offers ways to upend existing norms, laws, and rules... in both good and bad ways.
  • On the other hand, the use of crypto is influenced and regulated by the law and politicians.

"Cryptography rearranges power: it configures who can do what, from what. This makes cryptography an inherently political tool, and it confers on the field an intrinsically moral dimension."

– Phillip Rogaway

We will see aspects of the two-way connection between crypto and policy throughout the semester, and we will spend the last ~2 weeks focusing exclusively on it.

1.2 Introduction to Crypto(currencies)

[Note: This section of the lecture is based on Johns Hopkins CS 601.641 by Prof. Abishek Jain.]

Pausing the cryptography discussion for a moment, let's provide an overview of cryptocurrencies like Bitcoin and Ethereum. We will spend the next few weeks delving into their design.

From the start, it is worthwhile to state what this course is NOT about.

This is not a finance course on cryptocurrencies. You should not expect to be taught how to invest in cryptocurrencies or how to become a billionaire overnight.

In fact, the cryptocurrency market is still in its infancy and is incredibly volatile. Investing in cryptocurrencies is a good way to lose money. So this course doesn't offer any investment advice, except in this sentence: I advise you not to invest in cryptocurrencies.

Still though, cryptocurrencies are a worthy topic of study in order to understand their capabilities, potential, and limitations.

In the first unit of this course, we will explore the following topics about cryptocurrencies.

  • Understanding the mechanics of blockchains
  • Understanding why current implementations work
  • Understanding the necessary cryptographic background
  • Exploring applications of blockchains to cryptocurrencies and beyond
  • Understanding limitations of current blockchains

Let's start at the top of the list. What is a blockchain?

It is:

  • A distributed ledger or database
  • Used for building decentralized cryptocurrencies such as Bitcoin
  • Capable of improving many other applications such as distributed Domain Name system (DNS), Public-Key Infrastructure (PKI), stock trade databases, etc.
  • The basis of lots of exciting academic research and industry startups

Transferring digital money

Blockchains are a data structure to keep track of transactions.

Here is the motivating scenario for this unit: there are two people, who we will call Alice and Bob.

Alice would like to transfer $1 to Bob.

Let's think about

  • the functionality of how this could work, i.e., how it can work if everyone is honest
  • the security concerns, i.e., ways this could go wrong if someone is malicious
Alice Bob
Alice $\longrightarrow$ Bob

Let's start by exploring how money transfers work in the world of physical banking. Suppose Alice writes a check to Bob. What happens?

Physical check

If everyone is honest, the money is transferred because a bank performs the bookkeeping task of recording every account and its balance.

19th century ledger

Source: Wikipedia

Now let's think about what happens if each party is malicious. For this single transaction, Alice doesn't really care whether Bob is malicious or not; she isn't expecting to receive anything of value.

In the other direction: even if Bob trusts the bank (for now), he may be suspicious that Alice is trying to deceive him. He needs to be able to verify the following three properties.

  1. Alice properly signed the check: To accomplish this goal, Bob can verify the signature himself, say by comparing the check against Alice's identification card.

  2. Alice possesses $1 in her bank account: Bob can ask the bank to verify that Alice's account balance is sufficient for the check to clear. Note that Bob doesn't even need to learn Alice's balance, just the result of the binary predicate.

  3. Alice does not double spend the money by writing a check to someone else at the same time: This is the toughest property to achieve, because neither Bob nor the bank know yet whether Alice has overdrafted her account. They will only discover this later; if so, Alice and Bob can use the bank and the legal system to settle their dispute.

By contrast, the digital world is diverse and international. There is no single entity that everyone in the world trusts, nor is there a single legal system to identify and prosecute thieves and money launderers.

So, if we want a digital system of money, then we need a different approach altogether.

It will take us a few lectures to build up the tools for a digital system of transferring money. First, we need to understand two essential crypto tools: digital signatures and hash functions.

1.3 Introduction to Digital Signatures

Today we only focus on property #1: building a digital version of signing a check.

Appropriately, the crypto primitive we need is called a digital signature scheme. The most common standard is called the digital signature algorithm, or DSA.

A digital signature allows Alice to append a signature $\sigma$ to a message $M$ that satisfies two properties.

  1. Correctness: Everyone in the world can verify whether Alice created the signature $\sigma$.
  2. Unforgeability: The signature $\sigma$ is bound to the message $M$. Nobody besides Alice could use it to forge a signature for any other message $M'$.

Also, there are (at least) two properties that digital signatures do not achieve on their own.

  • Receiver authentication: Because a signature can be verified by anyone, on its own it will not tell Bob that he was the intended target of the message. If Alice wants to convey that Bob is the intended target, she should do so in the message contents $M$.

  • Thwarting replay attacks: If Bob has a digital signature, he can show it to others (e.g., a bank) as many times as he wants. To solve this issue, each message $M$ can contain a unique serial number, so the bank can detect if the same check is being cashed twice.

Cryptographic keys

To satisfy the unforgeability goal, Alice must have something that the rest of the world does not. We call this special something a "secret key."

Informally, a digital signature allows Alice to "lock" her message using her secret key.

To satisfy the correctness goal, there is a corresponding public key that anyone can use to verify whether a message was previously "locked" by Alice.

Image source: idea-instructions.com

Informal description of a digital signature scheme

In more detail, a secret key is a long, random sequence of bytes. One common choice is to pick a key that is 32 bytes (aka 256 bits) in length. For example, here is a secret key:

In [75]:
# first install PyCryptoDome with, e.g., "pip install pycryptodome"
from Crypto.PublicKey import ECC
secretKey = ECC.generate(curve='P-256')

# warning: this is just an illustrative example
# do not export a secret key that you are using for any legitimate purpose!
print(secretKey.export_key(format='PEM'))
-----BEGIN PRIVATE KEY-----
MIGHAgEAMBMGByqGSM49AgEGCCqGSM49AwEHBG0wawIBAQQgvZ3Kis3judSdWDcK
W/WS2bnHqzi3Ip94rT+Mg3qi13ShRANCAARHh3A+8Kn9OIFwzuC0jhS59o/8sL/3
PAR7PqD2dbTbj4OUOMg5q/R1WB747UEfB+ryO8ajoaY9W6mRJd9urzjV
-----END PRIVATE KEY-----

From a secretKey, it should be possible to produce the corresponding publicKey.

But we want this operation to be one way: nobody can go in the other direction of publicKey $\to$ secretKey.

In [76]:
publicKey = secretKey.public_key()

# this key is safe to share with the world
print(publicKey.export_key(format='PEM'))
-----BEGIN PUBLIC KEY-----
MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAER4dwPvCp/TiBcM7gtI4UufaP/LC/
9zwEez6g9nW024+DlDjIOav0dVge+O1BHwfq8jvGo6GmPVupkSXfbq841Q==
-----END PUBLIC KEY-----

As the name suggests, only Alice knows the key; she never shares her secret key with anyone. (Hence my warning in the code snippet above).

This key is long enough that it is incredibly unlikely that anyone else in the world will be able to guess it by chance. As a result, knowing the key identifies Alice, in the sense that if you meet a stranger on the Internet who can has the ability to sign a new message that can be verified with Alice's public key, then this Internet stranger must be Alice! (...Or someone who has compromised Alice's computer and stolen her keys.)

To give you a sense of the scale here, let's imagine that I tried to forge Alice's key in a brute-force manner, simply by guessing every single 32-byte long string and trying to use it to "lock" the message just as Alice did.

Concretely, suppose I create a for loop that iterates from:

In [60]:
from binascii import hexlify
print(hexlify(32 * b'\x00')) # taking 32 copies of the byte value 0 (printed in hexadecimal format)
b'0000000000000000000000000000000000000000000000000000000000000000'

to this:

In [63]:
print(hexlify(32 * b'\xff')) # taking 32 copies of the byte value 255 (printed in hexadecimal format)
b'ffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffff'

Because the set of possible secret keys is finite, in theory it is possible to enumerate over all of them.

But it isn't practically feasible to do so. Running this for loop would require more time than the length of the universe. And simply flipping all of these bits back and forth in RAM would consume approximately the energy of the sun.

Granted, surely there might be more effective ways for an attacker to try to break our digital signature scheme (i.e., to forge a signature) that are smarter than this guess-and-check method. But our hope is that any method of forging signatures is infeasible in practice.

We will discuss next week how to formalize this guarantee, and then show a few ways of achieving it.