Instructors
Meetings
Assignments
Exams
Websites
CDS DS 453 investigates techniques for performing trustworthy data analyses without a trusted party, and for conducting data science without data sharing.
The first half of the course investigates cryptocurrencies, the blockchain technology underpinning them, and the incentives for each participant. Students will learn how to create transactions, develop smart contracts, and participate in decentralized exchanges. Then, we take a deeper dive into consensus mechanisms, historical and modern, that maintain stability if a certain fraction of the participants or computing power behaves honestly.
The second half of the course focuses on privacy and anonymity using advanced tools from cryptography. We study zero knowledge proofs and their role in preventing re-identification attacks and increasing scalability of blockchains. We also study secure multiparty computation and its role in designing private contracts and atomic swaps. The course concludes with a broader exploration into the power of conducting data science without being able to see the underlying data.
Within the undergraduate Data Sciences major, this course satisfies the DS methodology elective in the “scalable & trustworthy DS” category.
There are five parts to this course.
The full course schedule is available on Piazza. It may change slightly over time, and I welcome suggestions from you about topics that you want to see covered in this course.
No prior background in cryptography is necessary.
This course requires mathematical maturity with linear algebra and probability concepts at the level of DS 122 or equivalent.
It also requires familiarity with algorithms and big-O notation at the level of DS 320, CS 330, or equivalent. This is a co-requisite: it is acceptable to take an algorithms course this semester too.
If you do not meet the prerequisites of this course, please talk with me after today's lecture.
This course follows BU’s policy on absences for religious observance. You can read it at https://www.bu.edu/academics/policies/absence-for-religious-reasons/.
Also due to ongoing public health concerns, please err on the side of caution and do not attend class if you show even light symptoms of COVID-19 or other illnesses. For now the BU Chief Health Officer recommends that "everyone in our community, regardless of symptoms, mask in crowded indoor areas including classrooms"
Classes will be recorded whenever possible, with video links posted on Piazza.
Both Nicolas and I will hold office hours, and students are welcome and encouraged to visit. The exact time of our office hours is still TBD; we will decide in the next few days and post on Piazza the date, time, and location of each instructor’s office hours. Office hours will be held in a hybrid format; you are welcome to join over Zoom as well.
If you want to meet but cannot make the scheduled times, then send the instructors a private Piazza note with at least 2 suggestions for times that you are available and we will find a time to meet.
There are 2 websites for this course, Piazza and Gradescope.
The course schedule and homework assignments will be posted on Piazza at https://piazza.com/bu/spring2023/ds453653/info. You are responsible for any announcements on Piazza, so please register using an email address that you check regularly.
Also, please only email the instructors for administrative matters. Scientific questions must be sent via Piazza, possibly marked as private (e.g., if the question reveals a partial solution to a homework assignment).
There will be 8 assignments in this course: 7 homeworks and 1 project. Assignments are typically due on a Friday, and they will be posted on Piazza at least one week prior to the due date.
You have two opportunities to submit each assignment: on the due date, and 1 week after the initial grades are posted. Resubmissions must include a short description of changes.
Late submissions are not accepted, except in cases of long-term emergencies (e.g., family, medical); if this applies to you, inform us as soon as you are able.
This course uses specifications grading on the assignments (https://eric.ed.gov/?id=EJ717675). Each assignment includes a rubric that specifies the minimum requirements and optional extensions. Grades are based on the extent to which the solution:
Homeworks and projects are mostly programming-oriented, and they are assessed on the correctness and clarity of your solutions. For this reason, before submitting an assignment you should:
For work that is assessed to be complete and well-communicated, you will receive a score of E or M.
Otherwise, you will receive a score of R or N as described below.
You may request a regrade of any assignment via Gradescope if you identify a factual error in our assessment. (Note that a regrade is different than a resubmission.)
There will be two exams in this course: an in-class midterm held on Thursday, March 2, and a final exam held at the date and time specified by the university. Reserve these dates on your calendar now!
Exams may cover any material from lectures, discussion sections, homework, and the required textbook reading.
If you have a valid conflict with an exam date, you must send me a written note as soon as you are aware, and with a minimum of 1 week notice (unless there are extenuating circumstances) so we can arrange a make-up test.
Your letter grade in this course will be determined in two parts.
First, your base grade is based on your completion of the assignments.
Base grade | Scores on the 8 assignments |
---|---|
A- | All M or higher, at least 3 scores of E |
B | At least 1 E, at most 1 R, rest are M |
C+ | At most 3 scores of R or 1 score of N |
C- | Anything else |
For graduate students registered in DS 653, a project score of E is required in order to receive a base grade of A-.
Then, it is adjusted based on your scores on each of the two exams.
Base grade | Scores on the 8 assignments |
---|---|
93-100 | Raise 2 steps (e.g., B to A-) |
85-92 | Raise 1 step (e.g., B to B+) |
75-84 | No adjustment |
60-74 | Lower 1 step (e.g., B to B-) |
59 or lower | Lower 1 letter (e.g., B to C) |
You must adhere to BU’s Academic Conduct Code at all times. Please be sure to read it here: https://www.bu.edu/academics/policies/academic-conduct-code.
In particular: cheating on an exam, passing off another student’s work as your own, or plagiarism of writing or code will result in _receiving a score of N on the current assignment without the possibility of a resubmission, and may be grounds for referral to BU’s Academic Conduct Committee.
If you have any questions about the policy, please ask me in person or via a private Piazza note before taking an action that might be a violation.
The goal of homework and project assignments is to learn. Hence, I encourage you to use any and all resources that can help you to learn the material: computers/calculators, Piazza, lecture notes, textbooks, other websites, and your fellow classmates. That said, please always obey the following rules:
The goal of the exams is for you to show me what you have learned. As a result, any collaboration is strictly prohibited; exams must reflect solo work. (But I encourage you to work with classmates in preparing for the exams.)
This course follows all BU policies regarding accommodations for students with documented disabilities. If you are a student with a disability or believe you might have a disability that requires accommodations, please contact the Office for Disability Services (ODS) at (617) 353-3658 or access@bu.edu to coordinate any reasonable accommodation requests. ODS is located at 25 Buick Street on the 3rd floor.
Please respect your fellow classmates and contribute toward a positive learning environment for everyone. For instance, I actively encourage discussion and debate on ideas, but I won’t tolerate criticism of other people. Also, while you can use a computer for note-taking, do not use your laptop in class for web surfing, sending messages, or anything else that can cause a distraction to your fellow classmates.
“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”
– Ben Adida
This is a course about crypto... which is admittedly a term with many different meanings to different people.
We will get to the topic of cryptocurrencies in a few minutes. But first, here is a question to start us off in this course: what is cryptology?
Cryptology comes from two Greek word parts:
So the goal of cryptology is to study the art of keeping secrets.
In its modern form, crypto uses hard math problems as a way to encode data in order to restrict who can use it, how they can use it, and sometimes even when and where they can access it.
Cryptology can be divided into two sub-fields
"Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can't break."
– Bruce Schneier
In this course we will almost exclusively focus on the defensive, cryptographic side. Even though I won't say much about cryptanalysis in this course, it is important to know that all of the building blocks I will show you in today's lecture have been vetted through decades of cryptanalysts.
"Don't roll your own crypto."
Crypto is a scientific field at the intersection of many disciplines. To accomplish their goals, cryptographers use:
This class will primarily make use of the first two skills: algorithms and software engineering.
While we won't delve into the type of math used in cryptanalysis, nevertheless we will often use concepts from algebra and probability.
This is also a course about how crypto can be used within data science.
It's likely that you have already used crypto in all of your data science projects, perhaps without explicitly thinking about it.
In this image, the lock icon in the web browser means that your computer has a protected connection to the kaggle.com servers.
At a high level, this means that nobody else on the internet can
(Note though that the kaggle.com servers can see what you are doing on their website. We will see later in the semester how even this can potentially be avoided.)
Using crypto is an essential part of providing security over the Internet, because it has always been designed as an open network. There is no wire that directly connects your computer with kaggle.com. Instead, the data flows through several other computers along the way.
Here is a picture of all of the computer servers connected to the Internet in 1968.
Source: pwnallthethings on Twitter
But the design of the Internet has fundamentally remained the same.
“The Internet is just the world passing notes in a classroom.”
– Jon Stewart
Getting the crypto right is essential here.
If the system is vulnerable to decoding or tampering by unauthorized parties, then it can lead to universal, covert breaches of security. Moreover, everyone will have a false sense of security.
If the system is too onerous or slow to use, then people will default to other, less secure options.
We will see throughout this course that crypto can be used in many ways besides network transmissions on the Internet.
Still though, this example highlights a general theme of all of our applications:
crypto is a social science that masquerades as a mathematical science
Our goal is to use the tools from math, computer science, and data science in order to design systems that help people.
At a high level, cryptographic systems attempt to provide (some or all of) the following three objectives.
These three security objectives are sometimes referred to as the "CIA triad."
Admittedly, these three objectives are both underspecified (lacking mathematical rigor) and overloaded (the terms have many different sub-meanings).
There are many different flavors of confidentiality, integrity, and availability that cryptography can provide... in fact, far more than we have time to cover in this course.
Source: Handbook of Applied Cryptography, page 4
(Note: BU courses CS 538 and CS 558 also cover different aspects of cryptography and its application in modern digital systems. This course has a partial overlap with them, but we will cover many topics that they don't and vice-versa. You are welcome to register for those courses as well.)
Because crypto is a scientific field that has social impacts, it is probably inevitable that crypto has clashed with law and policy over the past five decades.
"Cryptography rearranges power: it configures who can do what, from what. This makes cryptography an inherently political tool, and it confers on the field an intrinsically moral dimension."
– Phillip Rogaway
We will see aspects of the two-way connection between crypto and policy throughout the semester, and we will spend the last ~2 weeks focusing exclusively on it.
[Note: This section of the lecture is based on Johns Hopkins CS 601.641 by Prof. Abishek Jain.]
Pausing the cryptography discussion for a moment, let's provide an overview of cryptocurrencies like Bitcoin and Ethereum. We will spend the next few weeks delving into their design.
From the start, it is worthwhile to state what this course is NOT about.
This is not a finance course on cryptocurrencies. You should not expect to be taught how to invest in cryptocurrencies or how to become a billionaire overnight.
In fact, the cryptocurrency market is still in its infancy and is incredibly volatile. Investing in cryptocurrencies is a good way to lose money. So this course doesn't offer any investment advice, except in this sentence: I advise you not to invest in cryptocurrencies.
Still though, cryptocurrencies are a worthy topic of study in order to understand their capabilities, potential, and limitations.
In the first unit of this course, we will explore the following topics about cryptocurrencies.
Let's start at the top of the list. What is a blockchain?
It is:
Blockchains are a data structure to keep track of transactions.
Here is the motivating scenario for this unit: there are two people, who we will call Alice and Bob.
Alice would like to transfer $1 to Bob.
Let's think about
Alice | Bob | |
---|---|---|
$\longrightarrow$ |
Let's start by exploring how money transfers work in the world of physical banking. Suppose Alice writes a check to Bob. What happens?
If everyone is honest, the money is transferred because a bank performs the bookkeeping task of recording every account and its balance.
Source: Wikipedia
Now let's think about what happens if each party is malicious. For this single transaction, Alice doesn't really care whether Bob is malicious or not; she isn't expecting to receive anything of value.
In the other direction: even if Bob trusts the bank (for now), he may be suspicious that Alice is trying to deceive him. He needs to be able to verify the following three properties.
Alice properly signed the check: To accomplish this goal, Bob can verify the signature himself, say by comparing the check against Alice's identification card.
Alice possesses $1 in her bank account: Bob can ask the bank to verify that Alice's account balance is sufficient for the check to clear. Note that Bob doesn't even need to learn Alice's balance, just the result of the binary predicate.
Alice does not double spend the money by writing a check to someone else at the same time: This is the toughest property to achieve, because neither Bob nor the bank know yet whether Alice has overdrafted her account. They will only discover this later; if so, Alice and Bob can use the bank and the legal system to settle their dispute.
By contrast, the digital world is diverse and international. There is no single entity that everyone in the world trusts, nor is there a single legal system to identify and prosecute thieves and money launderers.
So, if we want a digital system of money, then we need a different approach altogether.
It will take us a few lectures to build up the tools for a digital system of transferring money. First, we need to understand two essential crypto tools: digital signatures and hash functions.
Today we only focus on property #1: building a digital version of signing a check.
Appropriately, the crypto primitive we need is called a digital signature scheme. The most common standard is called the digital signature algorithm, or DSA.
A digital signature allows Alice to append a signature $\sigma$ to a message $M$ that satisfies two properties.
Also, there are (at least) two properties that digital signatures do not achieve on their own.
Receiver authentication: Because a signature can be verified by anyone, on its own it will not tell Bob that he was the intended target of the message. If Alice wants to convey that Bob is the intended target, she should do so in the message contents $M$.
Thwarting replay attacks: If Bob has a digital signature, he can show it to others (e.g., a bank) as many times as he wants. To solve this issue, each message $M$ can contain a unique serial number, so the bank can detect if the same check is being cashed twice.
To satisfy the unforgeability goal, Alice must have something that the rest of the world does not. We call this special something a "secret key."
Informally, a digital signature allows Alice to "lock" her message using her secret key.
To satisfy the correctness goal, there is a corresponding public key that anyone can use to verify whether a message was previously "locked" by Alice.
Image source: idea-instructions.com
In more detail, a secret key is a long, random sequence of bytes. One common choice is to pick a key that is 32 bytes (aka 256 bits) in length. For example, here is a secret key:
# first install PyCryptoDome with, e.g., "pip install pycryptodome"
from Crypto.PublicKey import ECC
secretKey = ECC.generate(curve='P-256')
# warning: this is just an illustrative example
# do not export a secret key that you are using for any legitimate purpose!
print(secretKey.export_key(format='PEM'))
From a secretKey
, it should be possible to produce the corresponding publicKey
.
But we want this operation to be one way: nobody can go in the other direction of publicKey
$\to$ secretKey
.
publicKey = secretKey.public_key()
# this key is safe to share with the world
print(publicKey.export_key(format='PEM'))
As the name suggests, only Alice knows the key; she never shares her secret key with anyone. (Hence my warning in the code snippet above).
This key is long enough that it is incredibly unlikely that anyone else in the world will be able to guess it by chance. As a result, knowing the key identifies Alice, in the sense that if you meet a stranger on the Internet who can has the ability to sign a new message that can be verified with Alice's public key, then this Internet stranger must be Alice! (...Or someone who has compromised Alice's computer and stolen her keys.)
To give you a sense of the scale here, let's imagine that I tried to forge Alice's key in a brute-force manner, simply by guessing every single 32-byte long string and trying to use it to "lock" the message just as Alice did.
Concretely, suppose I create a for
loop that iterates from:
from binascii import hexlify
print(hexlify(32 * b'\x00')) # taking 32 copies of the byte value 0 (printed in hexadecimal format)
to this:
print(hexlify(32 * b'\xff')) # taking 32 copies of the byte value 255 (printed in hexadecimal format)
Because the set of possible secret keys is finite, in theory it is possible to enumerate over all of them.
But it isn't practically feasible to do so. Running this for
loop would require more time than the length of the universe. And simply flipping all of these bits back and forth in RAM would consume approximately the energy of the sun.
Granted, surely there might be more effective ways for an attacker to try to break our digital signature scheme (i.e., to forge a signature) that are smarter than this guess-and-check method. But our hope is that any method of forging signatures is infeasible in practice.
We will discuss next week how to formalize this guarantee, and then show a few ways of achieving it.