bls12_381

introduction

BLS12-381 is a pairing-friendly elliptic curve. Curve BLS12-381 was designed by Sean Bowe in early 2017 as the foundation for an upgrade to the Zcash protocol. It is both pairing-friendly (making it efficient for digital signatures) and effective for constructing zkSnarks. BLS12-381 is part of a family of curves described by Barreto, Lynn, and Scott (BLS). The 12 is the embedding degree of the curve: neither too low, nor too high. The 381 is the number of bits needed to represent coordinates on the curve: the field modulus, \(q\). 381 is a fairly handy number as we can use 48 bytes per field element, with 3 bits left over for useful flags or arithmetic optimisations. This size of this number is guided both by security requirements and implementation efficiency.

curve equation and parameters

The basic equation of the BLS12-381 curve is \( y^2 = x^3 + 4\)
The key parameters for a BLS curve are set using a single parameter \(\mathbf{x}\) (different from the in the curve equation!). BLS12-381 is derived from the \( k \equiv 0 \mod 6\) case of Construction 6.6 in the taxonomy.

Specific design goals for BLS12-381 are:

  • \(\mathbf{x}\) has “low hamming weight”, meaning that it has very few bits set to 1. This is particularly important for the efficiency of the algorithm that calculates pairings (the Miller loop).
  • The field modulus \(q\) mentioned above is prime and has 383 bits or fewer, which makes 64-bit or 32-bit arithmetic on it more efficient.
  • The order \(r\) of the subgroups we use is prime and has 255 bits or fewer, which is good for the same reason as above.
  • To support zkSnark schemes, we want to have a large power of two root of unity in the field. This means we want \(2^n\) to be a factor of \(r-1\), for some biggish \(n\). (Making \(\mathbf{x}\) a multiple of \(2^{n/2}\) will achieve this.) This property is key to being able to use fast Fourier transforms for interesting things like polynomial multiplication.

The value \(\mathbf{x}\) = -0xd201000000010000 (hexadecimal, note that it is negative) gives the largest and the lowest Hamming weight meeting these criteria. With this value we have,

parameters equation value(hex) comments
Field modulus \(q\) \(\frac{1}{3}(\mathbf{x}-1)^2(\mathbf{x}^4-\mathbf{x}^2+1) + \mathbf{x}\) 0x1a0111ea397fe69a4b1ba7b6434bacd764774b84f3
8512bf6730d2a0f6b0f6241eabfffeb153ffffb9feffffffffaaab
381 bits, prime
Subgroup size \(r\) \((\mathbf{x}^4-\mathbf{x}^2+1)\) 0x73eda753299d7d483339d80809a1d80553
bda402fffe5bfeffffffff00000001
255 bits, prime

Field extensions

Field extensions are fundamental to elliptic curve pairings. The “12” is BLS12-381 is not only the embedding degree, it is also (relatedly) the degree of field extension that we will need to use.
For example, The complex numbers are a quadratic extension of the real numbers (\(x^2 = 1\)). Complex numbers can’t be extended any further because there are no irreducible polynomials over the complex numbers. But for finite fields, if we can find an irreducible \(k\)-degree polynomial in our field \(F_q\), and we often can, then we are able to extend the field to \(F_{q^k}\), , and represent the elements of the extended field as degree \(k-1\) polynomials, \(a_0 + a_1 x + … + a_{k-1}x^{k-1}\). we can represent this compactly as (\(a_0, a_1, …, a_{k-1}\))
In practice, large extension fields like \(F_{q^{12}}\)are implemented as towers of smaller extensions.

the curves

BLS12-381 is really dealing with two curves. Both curves share more-or-less the same curve equation, but are defined over different fields.
The simpler one is over the finite field \(F_q\), which is just the integers mod \(q\). So the curve has points only where the equation \(y^2 = x^3 +4\) has solutions with \(x\) and \(y\) both integers less than \(q\). We shall call this curve \(E(F_q)\).
The other curve is defined over an extension of \(F_q\)to \(F_{q^2}\) (think complex numbers). In this case, the curve equation is slightly modified to be \(y^2 = x^3+4(1+i)\), and we call the curve \(E’(F_{q^2})\)

the subgroups

A pairing is a bilinear map. This means that it takes as input two points, each from a group of the same order, \(r\). for rather technical reasons, these two groups need to be distinct. Let’s call them
\(G_1\) and \(G_2\).
Unfortunately, our simple curve \(E(F_q)\) has only a single large subgroup of order \(r\), so we can’t define a pairing based solely on \(E(F_q)\). However, if we keep extending the field over which
\(E\) is defined, it can be proved that we eventually find a curve that has more than one subgroup of order \(r\). that is, for some \(k\), \(E(F_{q^k})\) contains other subgroups of order \(r\) that we can use. One of these subgroups contains only points having a trace of zero[1], and we choose that subgroup to be \(G_2\)

This number \(k\), the amount that we need to extend the base field by to find the new group, is called the embedding degree of the curve, which in our case is the “12” in BLS12-381. For completeness, note that each of \(G_1\) and \(G_2\) shares with its containing curve the “point at infinity”. This is the identity element of the elliptic curve arithmetic group, often denoted \(\mathcal O\)
. For any point \(P\), \( P + \mathcal O = \mathcal O + P = P\).

Twists

But there’s another challenge. As discussed earlier, doing arithmetic in \(F_{q^{12}}\) is horribly complicated and inefficient. And curve operations need a lot of arithmetic. A twist is something like a coordinate transformation. Rather wonderfully, this can be used to transform our \(E(F_{q^{12}})\) curve into a curve defined over a lower degree field that still has an order \(r\) subgroup. Moreover, this subgroup has a simple mapping to and from our \(G_2\) group

BLS12-381 uses a “sextic twist”. This means that it reduces the degree of the extension field by a factor of six. So \(G_2\) on the twisted curve can be defined over \(F_{q^2}\) instead of \(F_{q^{12}}\)
I haven’t seen this written down anywhere—but attempting to decode section 3 of this—if we find a \(u\) such that \(u^6 = (1+i)^{-1}\), then we can define our twisting transformation as \( (x,y) -> (x/u^2, y/u^3)\). This transforms our original curve \(E: y^2 = x^3 + 4\) into the curve \(E’: y^2 + 4/u^6 = x^3 + 4(1+i)\). \(E\) and \(E’\) look different, but are actually the same object presented with respect to coefficinets in different base fields.

So these are the two groups we will be using:

  • \(G_1 \subset E(F_q)\), where \(E: y^2 = x^3 + 4\)
  • \(G_2 \subset E(F_{q^2})\), where \(E: y^2 = x^3 + 4(1+i)\)
    Note that coordinates of points in the \(G_1\) group are pairs of integers, and coordinates of points in the \(G_2\) group are pairs of complex integers.

Parings

s far as BLS12-381 is concerned, a pairing simply takes a point \(P \in G_1 \subset E(F_q)\), and a point \(Q \in G_2 \subset E’(F_{q^2})\) and outputs a point from a group \(G_T \subset F_{q^{12}}\). That is, for a paring \(e\), \(e: G_1 \times G_2 \rightarrow G_T\)

properties of pairing
\( e(P, Q+R) = e(P,Q) \cdot e(P,R) \),
note as a way of memory, can think it as \(P^{Q+R} = P^Q \cdot P^R\)
\( e(P+S, R) = e(P,R) \cdot e(S,R)\)
From this, we can deduce that all of the following identities hold:
\(e([a]P, [b]Q) = e(P, [b]Q)^a = e(P,Q)^{ab} = e(P, [a]Q)^b = e([b]P, [a]Q)\)

Embedding degree

The embedding degree, \(k\), is calculated as the smallest positive integer such that \(r\) divides \(q^k -1\). So, in the case of BLS12-381, \(r\) is a factor of \(q^{12} -1\), but not of any lower power.
The choice of an embedding degree is a balance between security and efficiency (as ever). On the security front, the embedding degree is also known as the security multiplier: a higher embedding degree makes the discrete logarithm problem harder to solve in
. However, a high embedding degree means we have to do field operations in high-degree extensions, which is clunky and inefficient.

cofactor

A subgroup’s cofactor is the ratio of the size of the whole group to the size of the subgroup.

Group Cofactor Equation value(hex)
\(G_1\) \(h_1\) \(\frac{\mathbf{x-1}^2}{3}\) 0x396c8c005555e
1568c00aaab0000aaab
\(G_2\) \(h_2\) \(\frac{\mathbf{x}^8 -4\mathbf{x}^7+5\mathbf{x}^6-4\mathbf{x}^4+6\mathbf{x}^3-4\mathbf{x}^2-4\mathbf{x}+13}{9}\) 0x5d543a95414e7f1091d50792876a202cd91de
4547085abaa68a205b2e5a7ddfa628f1cb4d9e82ef2
1537e293a6691ae1616ec6e786f0c70cf1c38e31c7238e5

note multiplying by the cofactor turns out to be a straightforward way to map any arbitrary point on the elliptic curve into the respective subgroup \(G_1\) or \(G_2\)

Generators

\(G_1\) and \(G_2\) are cyclic groups of prime order, so any point (except the identity/point at infinity) is a generator. Thus, picking generators is just a matter of convention.
Generator points \(G_1\) and \(G_2\) are specified in decimal here

These were chosen as follows:

The generators of \(G_1\) and \(G_2\) are computed by finding the lexicographically smallest valid x-coordinate, and its lexicographically smallest > y-coordinate and scaling it by the cofactor such that the result is not the point at infinity.

Foot notes

[1] The “trace zero subgroup” qualifies as an obscure incantation. Basically, the trace of a point is \(\sum_{i=0}^{k-1}(x^{q^i}, y^{q^i})\), where \(k=12\) in our case. Understanding this involves stuff like the Frobenius endomorphism.

references

key concepts in cuda

key concepts

  • Warps: In CUDA, a warp is a fundamental execution unit in the GPU. It represents a group of threads that are executed together in a SIMD (Single Instruction, Multiple Data) fashion on a single GPU core. In CUDA, each group of 32 consecutive threads is called a warp. A Warp is the primary unit of execution in an SM. Once a thread block is allocated to an SM, it will be further divided into a set of warps for execution.
  • Stream Multiprocessor (SM)
    The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors
    Each SM contains 8 CUDA cores, and at any one time they’re executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads()

memory

stream

A sequence of operations that execute in issue-order on the GPU

  • CUDA operations in different streams may run concurrently
  • CUDA operations from different streams may be interleaved
    cuda stream example

event

events are synchronization mechanisms used to measure and manage time, particularly for timing purposes and performance profiling.

  • event creation
    1
    2
    3
    4
    5
    #include <cuda_runtime.h>

    cudaEvent_t startEvent, stopEvent;
    cudaEventCreate(&startEvent);
    cudaEventCreate(&stopEvent);
  • record event
    1
    2
    3
    cudaEventRecord(startEvent, 0);
    // ... GPU code to be timed ...
    cudaEventRecord(stopEvent, 0);
    Here, cudaEventRecord records the current time in the events startEvent and stopEvent. The second argument, 0, specifies the stream in which to record the event (0 means the default stream).
  • synchonize events
    1
    cudaEventSynchronize(stopEvent);
    This function call makes the CPU wait until the GPU has completed all tasks associated with the stopEvent.
  • Calculate Elapsed Time
    1
    2
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, startEvent, stopEvent);
  • destroy event
    1
    2
    cudaEventDestroy(startEvent);
    cudaEventDestroy(stopEvent);

keywords

  • global functions that are executed on the GPU, called from CPU

  • device functions that are executed on the GPU, called within GPU

  • shared

  • align is a compiler directive that can be used to specify alignment requirements for variables in device code.

  • __launch_bounds__
    the launch_bounds attribute is used to specify launch bounds for a kernel function. The launch bounds control the number of threads per block and the maximum number of blocks per multiprocessor for a specific kernel.

    1
    __launch_bounds__(maxThreadsPerBlock, minBlocksPerMultiprocessor)

macros

  • CUDACC is a predefined macro in CUDA C/C++ that is used to determine whether the code is being compiled by the NVIDIA CUDA C/C++ compiler (nvcc).
  • NVCC is a predefined macro that is automatically defined by the NVIDIA CUDA compiler (nvcc). It allows you to conditionally compile code based on whether the code is being processed by the CUDA compiler or a regular C++ compiler

usage

the triple-chevron notation <<<...>>> is used to specify the execution configuration of a CUDA kernel launch.

1
kernel<<<gridDim, blockDim, sharedMem, stream>>>(...);
  • gridDim: Specifies the dimensions of the grid (number of blocks). It can be a 1D, 2D, or 3D configuration, specified as (gridDimX, gridDimY, gridDimZ).

  • blockDim: Specifies the dimensions of each block (number of threads per block). It can also be a 1D, 2D, or 3D configuration, specified as (blockDimX, blockDimY, blockDimZ).

  • sharedMem: Optional parameter specifying the dynamic shared memory per block in bytes. If not specified, it defaults to 0.

  • stream: Optional parameter specifying the CUDA stream in which the kernel should be executed. If not specified, it defaults to the default stream (0).

cuda standard api & tools

API

  • __syncthreads(): A thread’s execution can only proceed past a __syncthreads() after all threads in its block have executed the __syncthreads()
  • __syncwarp() is an intrinsic function that provides a warp-level synchronization point within a warp. A warp is a group of threads in CUDA that execute instructions in lockstep.

Tool

cmd

  • nvidia-smi

references

cryptography arithmetic tricks

  • get the next m*SIZE
    1
    2
    # define WARP_SZ 32
    size_t n = (nelems+WARP_SZ-1) & ((size_t)0-WARP_SZ)
    0-WARP_SIZE = 0xffffffffffffffe0
    eo = 0b11100000
    it will select multiples of WARP_SZ
    nelems+WARP_SZ-1 will counter for the removed numbers by 00000

n will be [32, 64, 96, …]

  • make a value to be 1<<N
    1
    2
    3
    uiint32_t a;
    while(a & (a-1)) // equal to 0 only when a is 1<<N
    a -= (a&(0-a)); // subtract the first 1 bit

classic implementations of arithematic of multi precision big integers

multiple-precision integer arithmetic

If \(b ≥ 2\) is an integer, then any positive integer \(a\) can be expressed uniquely as \(a = a_n \cdot b^n+a_{n−1}\cdot b^{n−1}+···+a_1 \cdot b+a_0\), where \(a_i\) is an integer with \(0≤a_i <b \) for \(0≤ i ≤ n \), and \(a_n \ne 0\).

The representation of a positive integer \(a\) as a sum of multiples of powers of \(b\), as given above, is called the base \(b\) or \(radix b\) representation of \(a\), which is usually written as
\(a = (a_n a_{n−1} ···a_1 a_0)_b\)

If \( (a_n a_{n−1} ···a_1 a_0)_b\) is the base b representation of \(a\) and \(a_n \ne 0\), then the precision
or length of a is n + 1. If n = 0, then a is called a single-precision integer; otherwise, a is a multiple-precision integer.

The division algorithm for integers provides an efficient method for determining the base b representation of a non-negative integer
radix_b_repesentation

if \( (a_{n} a_{n−1} ···a_1 a_0)_b \) is the base b representation of a and k is a positive integer, then

\( (u_{l} u_{l−1} ···u_1 u_0)_{b^k} \) is the base \( b^k \) representation of a, where \( l =\lceil (n+1)/k \rceil -1 \),

\( u_i = \sum_{j=0}^{k-1} a_{ik+j}b^j \) for \( 0 \le i \le l-1 \), and
\( u_l = \sum_{j=0}^{n-lk} a_{lk+j}b^j\)

Representing negative numbers

complement representation.

addition and subtraction

multiple_precision_addition
subraction is quite similar to addition. (omitted here)

multiplication

Let x and y be integers expressed in radix b representation:
\( x = (x_n x_{n−1} ··· x_1 x_0)_b \) and

\( y = (y_t y_{t−1} ··· y_1 y_0)_b \). The product x · y will have at most (n + t + 2) base b digits.

modular reduction

Barrett

Mongtgomery

references

cpp generics

generics

variadic template parameter

1
2
3
4
5
template<typename... Types>
void printValues(const Types&... values) {
// Fold expression (C++17 and later) to print all values
(std::cout << ... << values) << std::endl;
}

macros

examples

1
2
3
#if defined(_WIN32)
#else
#endif
  • __cplusplus macro is a predefined macro in C++ that is used to indicate the version of the C++ standard being used by the compiler. The value of __cplusplus is an integer that represents the version.
  • FILE
    __FILE__ is a predefined macro in C and C++ that expands to the name of the current source file as a string literal.
  • LINE
    __LINE__ is a predefined macro in C and C++ that expands to the current line number in the source code.

cpp key workds

  • decltype
    decltype is a keyword that is used to obtain the type of an expression or an entity

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    #include <iostream>

    int main() {
    int x = 42;
    double y = 3.14;

    // Using decltype to declare a variable with the same type as another variable
    decltype(x) result1 = x; // result1 has the type int
    decltype(y) result2 = y; // result2 has the type double

    // Using decltype with an expression
    int a = 10;
    int b = 20;
    decltype(a + b) sum = a + b; // sum has the type int

    return 0;
    }
  • friend
    In C++, the friend keyword is used to grant non-member functions or other classes access to the private and protected members of a class. When a function or class is declared as a friend of another class, it is allowed to access private and protected members of that class as if it were a member of that class.

  • const

    1
    2
    3
    4
    const int var = 1;
    int* a = new int;
    *a = 2; // change the pointed int
    a = (int*)&var; // change pointer to pointing another int
1
2
3
4
const int var = 1;
const int* a = new int; // cannot change the value pointed by a; same as written `int const* a = new int`
*a = 2; // illegal
a = (int*)&var; // valid
1
2
3
4
5
6
7
8
9
10
const int var = 1;
int* const a = new int; // cannot change the pointer address, but can change the pointed value;
*a = 2; // valid
a = (int*)&var; // invalid

```cpp
const int var = 1;
const int* const a = new int;
*a = 2; // invalid
a = (int*)&var; // invalid
1
2
3
4
5
6
7
8
9
10
11
class Entity
{
private:
int m_x, m_y;
public:
int GetX() const // const here means the method will not change state of the object
{
return m_x;
}

}
  • const is a promise, you can bypass it actually.

cpp std

snprintf

snprintf is a function that formats a string and writes the resulting characters to a character array.

1
int snprintf(char *str, size_t size, const char *format, ...);
  • str: A pointer to the destination buffer where the resulting string is stored.
  • size: The size of the destination buffer (including the null-terminating character).
  • format: A format string that specifies the desired format of the output.
  • ...: Additional arguments corresponding to the placeholders in the format string.
    1
    2
    3
    4
    5
    int main() {
    size_t n = std::snprintf(nullptr, 0, "%s", "hello, world");
    std::cout << n << std::endl; // print 12
    return 0;
    }

string

std::string(n, '\0')

creates a string with a length of n and fills it with null characters (‘\0’).

1
2
size_t n = 10;
std::string myString(n, '\0');

strerror_r

is a function that is commonly used for thread-safe retrieval of error messages. The purpose of this function is to convert an error number (an integer typically set by a system call or library function to indicate an error) into a human-readable error message

1
2
3
#include <cstring>

int strerror_r(int errnum, char *buf, size_t buflen);
  • errnum: The error number for which you want to retrieve the error message.
  • buf: A pointer to the buffer where the error message will be stored.
  • buflen: The size of the buffer pointed to by buf.

others

  • strncpy: Copy no more than N characters of SRC to DEST.
  • std::strlen: Return the length of String.
  • std::strstr is a standard C library function that is used to find the first occurrence of a substring within a string. The function returns a pointer to the first occurrence of the substring in the given string, or NULL if the substring is not found.