Abstract—Large scale quantum computers will break classical public-key cryptography protocols by quantum algorithms such as Shor’s algorithm. Hence, designing quantum-safe cryptosystems to replace current classical algorithms is crucial. Luckily there are some post-quantum candidates that are assumed to be resistant against future attacks from quantum computers, and NIST is considering them. Among these candidates, lattice-based cryptography sounds more interesting than others due to the performance results as well as confidence in the security. There are few works in the literature evaluating the performance of lattice-based cryptography in hardware. In this paper, we focus on Cryptographic Suite for Algebraic Lattices (CRYSTALS) key exchange mechanisms known as Kyber and provide an instruction-set hardware architecture and implement on Xilinx Artix-7 FPGA for performance evaluation and testing. Our proposed architecture provides an efficient and high-performance set of components to perform polynomial sampling, number-theoretic transform (NTT), and point-wise multiplication to speed up lattice-based post-quantum cryptography (PQC). This architecture implemented on ASIC outperforms state-of-the-art implementations.

Index Terms—ASIC, FPGA, hardware architecture, Kyber, lattice-based cryptography, post-quantum cryptography.

I. INTRODUCTION

QUANTUM computing development constitutes a significant threat to classical public-key cryptography protocols based on Shor’s algorithm [1]. Most current cryptosystems, i.e., RSA and Elliptic Curve Cryptography (ECC), are envisioned to be broken when large quantum computers will be built. Thus, designing the lattice-based cryptosystem as one of the most promising algorithms in Post-Quantum Cryptography (PQC) based on alternative mathematical features has become a fundamental research topic.

Recently, the National Institute of Standards and Technology (NIST) announced the third-round finalists, which includes 4 key encapsulation mechanisms (KEMs) and 3 signature schemes [2]. Among these KEM schemes, CRYSTALS-Kyber shares a common framework with the CRYSTALS-Dilithium signature scheme [2]. This scheme also supports efficient matrix-vector and vector-vector multiplication over a polynomial ring using the fast number-theoretic transform (NTT) [3]. Although the optimization of NTT-based multiplication is not a new idea and is used in countless applications, particularly in signal processing, it is still a performance bottleneck in the lattice-based cryptography implementation. Thus, several works have been done to optimize NTT from different perspectives, such as resource utilization, performance, efficiency, and energy consumption.

Recently, implementations of lattice-based cryptography have been investigated on various platforms. While software (SW) implementations offer programming capabilities, flexibility, and a shorter design cycle, the hardware (HW) platforms accelerate the computations and result in significantly higher throughput. Recently, there are considerable efforts to implement cryptosystems using hardware-software (HW/SW) co-design. This method makes the design smaller, slower, and more controllable/programmable compared to pure HW schemes at the cost of implementing a software-based processor. Furthermore, a HW/SW co-design requires a shorter design period; nevertheless, this method may not lead to the best performance. On the other hand, pure hardware implementations can be significantly accelerated using well-known optimization strategies, including register balancing, parallelization, and resource sharing, to increase the overall throughput of the hardware architectures. The main difficulty of this strategy is its hand-optimized design requiring a longer time and may be achieved at the cost of losing flexibility.

To transition to PQC, we must develop hybrid cryptosystems to maintain industry or government regulations, while PQC updates will be applied thoroughly. Therefore, classical cryptosystems, e.g., ECC, cannot be eliminated even if PQC will significantly be developed. The instruction-set processor builds an appropriate platform for accelerated implementation compared to SW and HW/SW, while the architecture remains flexible compared to highly optimized HW. Specifically, the flexible HW architecture is a promising solution for
integrating classic cryptosystems and PQC to move towards hybrid systems.

Kyber is notable for high speed and constant-time implementations. It has to be implemented in various platforms subject to the performance requirement. However, Kyber has not got sufficient study in the field of hardware implementation. Therefore, investigation of the hardware implementation is required considering the advantages of FPGA-based architectural designs to exploit parallelism, which leads to improvements in the efficiency of the overall system. In this paper, we implement a pure hardware design since it is faster and could be integrated into any HW/SW co-design solutions.

A. Related Work

Software implementation of Kyber has been studied by Botros et al. in [3], proposing a memory-efficient high-speed implementation on Cortex-M4. Recently, several PQC schemes have been implemented, targeting HW/SW co-design. The work of [4] was one of the first initiatives of post-quantum acceleration using high-level synthesis (HLS). Furthermore, Banerjee et al. in [5] proposed a flexible ASIC crypto-processor to support several lattice-based algorithms into a RISC-V architecture, including Frodo, NewHope, qTESLA, and CRYSTALS-Kyber/Dilithium. This work is extended in [6] to show FPGA validation results. Their design strategy targets reducing power consumption. The authors in [7] employ the RISC-V processor integrated with a finite field multiplier to accelerate polynomial multiplications in a lightweight architecture of NewHope and Kyber. In [8], performing vectorized modular arithmetic and NTT computations are proposed employing RISC-V for NewHope, Kyber, and Saber. The vector processor architecture based on the extensible RISC-V architecture has been studied in [9], which shows a remarkable speed up occupying 979k gate equivalent (GE) in ASIC implementations.

The pure hardware architectures of Kyber are proposed in [10]–[13]. The work of [10] heavily relies on BlockRAM primitives between components to perform arithmetic tasks and store intermediate results. We addressed the high-performance implementation of Kyber in our previous work [13] as the fastest Kyber design in the literature. The authors in [14] proposed a Kyber processor for computing NTT and point-wise multiplication. An instruction-set coprocessor for Saber is presented in [15] to design a flexible hardware architecture using the quadratic-complexity schoolbook polynomial multiplication algorithm. Schoolbook polynomial multiplication is also employed in [16].

Since NTT plays a central role in lattice-based cryptography, several hardware implementations focus on NTT from performance, efficiency, and flexibility perspectives. The work of [17], [18] introduced a scalable NTT architecture that can be used for various lattice-based schemes. Furthermore, the authors in [19] proposed a RISC-V architecture to increase efficiency and flexibility for NTT computation used in NewHope, qTESLA, CRYSTALS-Kyber, CRYSTALS-Dilithium, and Falcon. Additionally, Fitzmann and Sepúlveda [20] proposed an efficient and low-power NTT, which reduces the number of clock cycles to $n \log(n)$ cycles. The authors in [21] proposed a low-complexity NTT/INTT in the architecture of NewHope-NIST.

The proposed architecture combines the NTT, INTT, and point-wise multiplication architectures in an efficient way to utilize significantly fewer resources and improve the overall performance. To do so, using the Cooley-Turkey (CT) as NTT and the Gentleman-Sande (GS) as INTT [22], [23] is a well-known trick in the literature. Moreover, the resource sharing technique from [5], [24] is extended by using compact storage for pre-computed twiddle factors from [25] and doubled bandwidth scheme from [14], [21] to account for the high-performance architecture.

B. Our Contributions

To the best of our knowledge, there appear to be very few pure hardware implementations that focus only on the Kyber cryptosystem and make the best of all its features. This paper proposes an efficient hardware implementation of the module lattice-based post-quantum KEM CRYSTAL-Kyber on a Xilinx Artix-7 FPGA (as recommended by NIST) and the application specific integrated circuit (ASIC) platform. Our proposed architecture provides an efficient and high-performance set of components, including polynomial sampling, NTT, and point-wise multiplication, to accelerate lattice-based PQC exploiting fewer resources. The contributions of this paper are itemized in the following:

1) We propose a new approach for implementing a resource-efficient reconfigurable butterfly core on FPGA. We reduce the execution time for Kyber NTT computation from $\frac{n}{2} \log_2 \frac{n}{2} + 2N$ to $\frac{n}{2} \log_2 \frac{n}{2}$ by doubling the transform throughput and merging the pre-processing into NTT algorithm. We also customize a memory addressing strategy to implement a high-speed polynomial multiplier on the target platform.

2) We highly parallelize the operations in polynomial sampling cores through tightly coupling with Keccak core to decrease the required cycles. The performance of proposed parallel scheduling for binomial sampler indicates a significant improvement, while our rejection sampler latency can be completely absorbed by the Keccak core.

3) Our fast and scalable architecture provides a constant-time implementation over three different quantum security levels. To enhance our HW accelerator from a flexibility point of view, we design a set of customized high-level instruction codes to run the protocol. Hence, this set identifies the control flow of the proposed components and provides flexibility for integration with host processors.

4) We employ various optimization techniques to achieve an overall optimization in terms of efficiency, including parallelization, resource sharing, utilizing distributed RAM and ROM blocks, which significantly improve the area-time product. The proposed implementation is constant-time and is resistant to known timing attacks.

The rest of the paper is organized as follows. In Sec. II, we discuss the preliminaries. In Sec. III, our proposed algorithms
and architectures are discussed. We discuss our results and compare them to the counterparts in Sec. IV. Finally, we conclude the paper in Sec. V.

II. PRELIMINARIES

A. Symbol Definition

To make the paper more readable, Table I provides the list of notations used in this paper. The polynomial ring \( \mathcal{R}_q = \mathbb{Z}_q[X]/(X_n + 1) \) is defined over the field of \( \mathbb{Z}_q = \mathbb{Z}/q\mathbb{Z} \) in which \( n = 2^a - 1 \) is the dimension and \( q \) is the prime modulo.

<table>
<thead>
<tr>
<th>Symbol</th>
<th>Definition</th>
</tr>
</thead>
<tbody>
<tr>
<td>regular font lower-case letter (a/h)</td>
<td>Polynomial in normal/NTT domain</td>
</tr>
<tr>
<td>bold lower-case letter (a/b)</td>
<td>Polynomial vector in normal/NTT domain</td>
</tr>
<tr>
<td>bold upper-case letter (A/A)</td>
<td>Polynomial matrix in normal/NTT domain</td>
</tr>
<tr>
<td>( a^T/\bar{A}^T )</td>
<td>Transpose of vector/matrix</td>
</tr>
<tr>
<td>( a = b )</td>
<td>Point-wise multiplication</td>
</tr>
</tbody>
</table>

Table II

PARAMETER SETS FOR KYBER IMPLEMENTATION [26]

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>NIST Level</th>
<th>Parameters</th>
<th>Size (in Bytes)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Kyber-512</td>
<td>1</td>
<td>(3,2)</td>
<td>1,632 800 768</td>
</tr>
<tr>
<td>Kyber-768</td>
<td>3</td>
<td>(2,2)</td>
<td>2,400 1,184 1,088</td>
</tr>
<tr>
<td>Kyber-1024</td>
<td>5</td>
<td>(1,1)</td>
<td>3,168 1,568 1,568</td>
</tr>
</tbody>
</table>

Kyber-512 is an IND-CCA secure KEM based on hardness assumptions over module learning with errors (Module-LWE) [27]. NIST has recently announced the 3rd round PQC standardization candidates, and Kyber was among the chosen algorithms as a finalist [2]. Kyber provides three post-quantum security levels, and its parameter sets are reported in Table II.

Kyber cryptosystem uses a uniformly ring element \( \rho \). The Kyber KEM is defined as follows where \( sk \) stands for secret key, \( pk \) for public key, and \( ct \) for ciphertext:

1. **KeyGen()**: This function returns \( (sk, pk) \) by choosing \( s \) and \( e \) from a binomial sampling, and \( A \) from a uniform distribution. \( pk = (\rho, t) \) and \( sk = \hat{s} \) where \( t = \hat{A} \circ \hat{s} + \hat{e} \).
2. **Enc(pk, m, \mu)**: Using seed of \( \mu \), a binomial sampling is employed to choose \( r, e_1 \), and \( e_2 \). Furthermore, \( \hat{A}^T \) is sampled from a uniform distribution. Computing of \( \hat{u} = \text{INTT}(\hat{A}^T \circ \hat{r}) + e_1 \) and \( \hat{v} = \text{INTT}(\hat{A}^T \circ \hat{r} + e_2 + m) \) construct the ciphertexts such that \( ct = (\text{Compress}(\hat{u}), \text{Compress}(\hat{v})) \).
3. **Dec(sk, ct)**: Message \( m \) is computed such that \( m = \text{Compress}(v - \text{INTT}(\hat{s}^T \circ \hat{u})) \), while \( \hat{u} \) and \( \hat{v} \) are extracted from \( ct \).

Kyber Algorithms

Kyber [26] is an IND-CCA secure KEM based on hardness assumptions over module learning with errors (Module-LWE) [27]. NIST has recently announced the 3rd round PQC standardization candidates, and Kyber was among the chosen algorithms as a finalist [2]. Kyber provides three post-quantum security levels, and its parameter sets are reported in Table II.

Kyber cryptosystem uses a uniformly random ring element \( \rho \). The Kyber KEM is defined as follows where \( sk \) stands for secret key, \( pk \) for public key, and \( ct \) for ciphertext:

1. **KeyGen()**: This function returns \( (sk, pk) \) by choosing \( s \) and \( e \) from a binomial sampling, and \( A \) from a uniform distribution. \( pk = (\rho, t) \) and \( sk = \hat{s} \) where \( t = \hat{A} \circ \hat{s} + \hat{e} \).
2. **Enc(pk, m, \mu)**: Using seed of \( \mu \), a binomial sampling is employed to choose \( r, e_1 \), and \( e_2 \). Furthermore, \( \hat{A}^T \) is sampled from a uniform distribution. Computing of \( \hat{u} = \text{INTT}(\hat{A}^T \circ \hat{r}) + e_1 \) and \( \hat{v} = \text{INTT}(\hat{A}^T \circ \hat{r} + e_2 + m) \) construct the ciphertexts such that \( ct = (\text{Compress}(\hat{u}), \text{Compress}(\hat{v})) \).
3. **Dec(sk, ct)**: Message \( m \) is computed such that \( m = \text{Compress}(v - \text{INTT}(\hat{s}^T \circ \hat{u})) \), while \( \hat{u} \) and \( \hat{v} \) are extracted from \( ct \).

B. Kyber Algorithms

Kyber [26] is an IND-CCA secure KEM based on hardness assumptions over module learning with errors (Module-LWE) [27]. NIST has recently announced the 3rd round PQC standardization candidates, and Kyber was among the chosen algorithms as a finalist [2]. Kyber provides three post-quantum security levels, and its parameter sets are reported in Table II.

Kyber cryptosystem uses a uniformly random ring element \( \rho \). The Kyber KEM is defined as follows where \( sk \) stands for secret key, \( pk \) for public key, and \( ct \) for ciphertext:

1. **KeyGen()**: This function returns \( (sk, pk) \) by choosing \( s \) and \( e \) from a binomial sampling, and \( A \) from a uniform distribution. \( pk = (\rho, t) \) and \( sk = \hat{s} \) where \( t = \hat{A} \circ \hat{s} + \hat{e} \).
2. **Enc(pk, m, \mu)**: Using seed of \( \mu \), a binomial sampling is employed to choose \( r, e_1 \), and \( e_2 \). Furthermore, \( \hat{A}^T \) is sampled from a uniform distribution. Computing of \( \hat{u} = \text{INTT}(\hat{A}^T \circ \hat{r}) + e_1 \) and \( \hat{v} = \text{INTT}(\hat{A}^T \circ \hat{r} + e_2 + m) \) construct the ciphertexts such that \( ct = (\text{Compress}(\hat{u}), \text{Compress}(\hat{v})) \).
3. **Dec(sk, ct)**: Message \( m \) is computed such that \( m = \text{Compress}(v - \text{INTT}(\hat{s}^T \circ \hat{u})) \), while \( \hat{u} \) and \( \hat{v} \) are extracted from \( ct \).

2) **Sampling Units**: The rejection sampling generates a matrix from the uniform distribution, while the accepted samples are smaller than \( q \). The public matrix \( \hat{A} \) is sampled directly in the NTT domain. In the updated Kyber v3 specification the rejection probability calculated as \( 1 - q/2^{\log_2(q)} \) is increased from 3.48% to 18.7%.

Noise sampling is performed from a centered binomial distribution (CBD) based on the subtraction of the Hamming weights of the two \( \eta \)-bit chunks. Let \( \beta \) be the Keccak output, the coefficients are computed as follows:

\[
\epsilon_i = \sum_{j=0}^{\eta-1} \beta_{2i\eta + j} - \sum_{j=0}^{\eta-1} \beta_{2i\eta + \eta + j} \tag{1}
\]

which turns uniformly distributed samples into binomial distribution. According to Table II, in Kyber-512 architecture, two different samplers are implemented, i.e., \( \eta = 2 \) and \( \eta = 3 \), while binomial sampling units in Kyber-768 and Kyber-1024 work only with \( \eta = 2 \).

3) **NTT and Multiplication**: The centerpiece of KEM is NTT which is a fast Fourier transform (FFT) applied in a finite field. Fig. 1 illustrates the butterfly diagram for 8-point NTT. Let \( a \) be a polynomial as follows:

\[
a(x) = (a_0, a_1, \ldots, a_{255}) \in \mathcal{R}_q \tag{2}
\]

NTT\( (a) \) is defined as \( \hat{a} = (a_0 + \hat{a}_1 X, a_2 + \hat{a}_3 X, \ldots, a_{254} + \hat{a}_{255} X) \) such that \( \hat{a}_{2i} = \sum_{j=0}^{127} a_{2j+1} \zeta^{2(2r_j+1)i+j} \) and \( \hat{a}_{2i+1} = \sum_{j=0}^{127} a_{2j+1} \zeta^{2(2r_j+i+1)i+j} \), where \( \zeta = 17 \) is the first primitive 256-th root of unity modulo \( q \), and \( br_7 \) is the bit reversal function. The pseudo-code of the iterative NTT is shown in Algorithm 1. The NTT is similar to NTT, while \( \omega_n^{-1} \) is used instead of \( \omega_n \), and the resulting coefficients of \( a(x) \) is divided by \( n \).

However, the original computing of NTT and INTT needs the pre-processing and the post-processing, respectively. A point-wise multiplication includes 128 multiplications of polynomial of degree 2 modulo \( X^2 - \zeta^{2br_7+1} \).
Algorithm 1 Iterative In-Place NTT Algorithm Based on Cooley-Tukey Butterfly [25]

Input: a polynomial \( a(x) \in \mathbb{Z}_q[X]/(X_n + 1) \), \( n \)-th primitive root of unity \( \omega_n \in \mathbb{Z}_q \), \( n = 2^l \)

Output: \( \hat{a}(x) = \text{NTT}_{\omega_n}(a) \in \mathbb{Z}_q[X]/(X_n + 1) \)

1: \( \hat{a} \leftarrow \text{bit-reverse}(a) \)
2: for \( i \) from 1 to \( l \) do
3: \( m = 2^{l-i} \)
4: for \( j \) from 0 to \( 2^{l-i} - 1 \) do
5: \( W \leftarrow \omega_n^{-j} \)
6: for \( k \) from 0 to \( m - 1 \) do
7: \( T \leftarrow W \cdot \hat{a}[2 \cdot j \cdot m + k + m] \mod q \)
8: \( U \leftarrow \hat{a}[2 \cdot j \cdot m + k] \)
9: \( \hat{a}[2 \cdot j \cdot m + k + m] = U + T \mod q \)
10: end for
11: end for
12: end for
13: end for
14: return \( \hat{a}(x) \)

Algorithm 2 Barrett Reduction Modulus \( q = 3, 329 \) [29]

Input: \( q = 3, 329, m = \frac{2^x}{q} = 5, 039, x \in [0, q^2) \)

Output: \( z = x \mod q \)

1: \( u \leftarrow x \cdot m \)
2: \( u \leftarrow u \gg 24 \)
3: \( u \leftarrow u - u \cdot q \)
4: \( u \leftarrow u - q \)
5: if \( u \geq 0 \) then
6: \( z = u \)
7: else
8: \( z = u \)
9: end if
10: return \( z \)

The matrix-vector multiplication \( \hat{A} \circ \hat{s} \) in NTT domain for Kyber-512 is shown in (3) while a point-wise multiplication \( \hat{A}_{j,i} \circ \hat{s}_i \) can be performed as shown in (4).

\[
\hat{A} \circ \hat{s} = \begin{bmatrix}
\hat{A}_{00} \\
\hat{A}_{01} \\
\hat{A}_{10} \\
\hat{A}_{11}
\end{bmatrix} \circ \begin{bmatrix}
\hat{s}_0 \\
\hat{s}_1
\end{bmatrix}
\]

\[
= \begin{bmatrix}
\hat{A}_{00} \circ \hat{s}_0 + \hat{A}_{01} \circ \hat{s}_1 \\
\hat{A}_{10} \circ \hat{s}_0 + \hat{A}_{11} \circ \hat{s}_1
\end{bmatrix}
\]

\[
(\hat{a}_{j,2i} + \hat{a}_{j,2i+1}X) \cdot (\hat{s}_{2j} + \hat{s}_{2j+1}X) = (\hat{a}_{j,2i}\hat{s}_{2j} + \hat{a}_{j,2i+1}\hat{s}_{2j+1} + \hat{a}_{j,2i+1}\hat{s}_{2j} + \hat{a}_{j,2i}\hat{s}_{2j+1})X
\]

III. HIGH-SPEED KYBER ARCHITECTURE

The top-level architecture of Kyber is designed and presented in Fig. 2.

A. High-Level Architecture

Full HW methodology enhances the performance of architecture over a HW/SW co-design scheme at the cost of a longer design cycle, killing the flexibility, and demands customized data paths for different protocol-level operations. However, using an instruction-set processor makes the design smaller, simpler, slower, and more controllable/programmable. A customized instruction-set can be a plausible option to achieve fine-tuned hardware acceleration with a low to moderate logic overhead. In order to implement a full HW architecture, cascading computation units in a customized data flow reduces the required latency significantly while the design becomes inflexible. In this paper, we implement all computation blocks in hardware; meanwhile, our implementation remains flexible to be extended, which is vital for a fast evolving field like PQC despite existing HW architecture.
To enhance the proposed architecture from a flexibility point of view, we design 20 different customized high-level instruction codes to perform the protocol. In particular, each line of the program ROM is 25-bit wide: 5 bits for instruction code and two 10 bits for operand addresses. The instruction memory is located within the controller and stores instructions for all required operations, including arithmetic, Keccak, and various memory operations. For example, Table III summarizes our proposed hashing instructions for different hash types. As one can see, our instructions can be easily used for integration with classic cryptosystems, e.g., Ed448 digital signature scheme [30], in a hybrid architecture, which is beyond of this work. The data memory can share data with other modules through a databus handled by the controller. To perform KEM, the required parameters should be pre-loaded into the memory.

### B. Keccak Core

Keccak unit is configured to perform four functions, including SHA3-256, SHA3-512, SHAKE-128, and SHAKE-256 during KEM. To design a high-performance core, we modify the high-speed core implementation of the Keccak provided by the Keccak team [31]. We develop a dedicated buffer for interfacing with the Keccak core. This dedicated buffer read/write data in 64-bit width from/to the memory unit. The buffer length is adjusted to the most extended required data, i.e., 1344-bit for SHAKE-128. Therefore, the buffer interfacing needs a maximum of 21 cycles, which can be handled during the Keccak sponge function computation, i.e., 24 cycles.

### C. Rejection Sampling

Since the 64-bit data path can be matched with the Keccak core, the rejection data path is set to 64 bits. To design a high-performance rejection core, we implement six parallel cores in this module fed by Keccak results. Therefore, a buffer should be added to store the accepted samples. When the number of buffered samples is more than three, the 64 bits of the buffer, i.e., four accepted samples, are stored in the RAM.

As shown in Fig. 2, a 64-bit word is read from memory. Since 64-bit input is not a multiple of a 12-bit integer, the input buffer is extended to 80-bit to store some parts of input for the next cycles. In the first cycle, only four samples are generated in parallel, and 16 bits of the input are postponed to the next cycle. In the second cycle, all six cores work on 72 bits of the buffer, which 16 bits are kept from the first iteration, and 56 bits are extracted from the second input. Hence, 8 bits of the input are postponed to concatenate with 64 bits of the third cycle processed with six rejection cores. A specific flag for each core shows whether the input is valid or not.

### Table III

**PROPOSED INSTRUCTION FOR HASHING**

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>MST_Keccak</td>
<td>Reset the Keccak buffer</td>
</tr>
<tr>
<td>EN_Keccak</td>
<td>Enable Keccak</td>
</tr>
<tr>
<td>PD_Keccak A</td>
<td>Padding Keccak hash for type A</td>
</tr>
<tr>
<td>LDKeccak_CONST #B</td>
<td>Load Keccak buffer with value A in B-width</td>
</tr>
<tr>
<td>LDKeccak_MEM A, B</td>
<td>Load Keccak buffer with address A in B-width</td>
</tr>
</tbody>
</table>

### D. Binomial Sampling

Fig. 2 illustrates the datapath of the binomial sampler. Since this module is inherently lightweight, we implement 16 parallel combinational cores. Then, 16 consecutive samples are generated in parallel and stored in a buffer register. Although the resulting samples, which are in \([-\eta, \eta]\), can be presented in 3-bit, we use 4-bit representation to simplify the addressing. The main difference in implementing CBD core with \(\eta = 3\) is an input buffer to keep data for concatenating with the input in the next cycles. In this mode, three consecutive 64-bit words are read to generate 32 samples in two words.

### E. Butterfly Unit

The main configurations of our butterfly unit are detailed in Fig. 3. We employ hand-crafted resource sharing techniques to implement this core with optimized resources. There is only one modular multiplier in our butterfly architecture. In addition, we use only one reduction unit in the middle of the butterfly operation and employ a modular adder/subtractor in the proposed configurations. Hence, implementing Montgomery reduction requires more resources due to converting back from that domain and demands more clock cycles. Moreover, our proposed modular reduction is constant-time and takes two cycles, as illustrated in Fig. 3. As one can see, the architecture is pipelined to avoid any delay in butterfly operation.

1) **Speeding up the NTT/INTT**: An \(n\)-point NTT requires \(n/2\) independent butterfly operations per stage. As a result, the naive implementation of polynomial multiplications requires 4,352 modular multiplications, of which 2 \((7 \times 128 + 256) = 2,304\) modular multiplications for twice performing NTT, \(5 \times 128 = 640\) modular multiplications for point-wise multiplication, and \(7 \times 128 + 2 \times 256 = 1,408\)
modular multiplications for INTT are required. To avoid the bit-reverse permutation in Algorithm 1, two different butterfly configurations, i.e., CT and GS, are required for NTT and INTT, respectively, as follows:

\[ f.g = \text{INTT}^{\text{GS}}(\text{NTT}^{\text{CT}}(f) \circ \text{NTT}^{\text{CT}}(g)). \]  

(5)

To be consistent with standard software implementation, the input polynomials in normal order are transformed to the NTT domain in bit-reverse order employing CT configuration, while twiddle factors are absorbed in bit-reversed order. The point-wise multiplication is performed in bit-reverse order and transformed back using GS configuration in normal order. However, the required twiddle factors are absorbed in the bit-reversed order.

We observe that an efficient implementation of point multiplication requires 3,584 modular multiplications reducing 18% complexity compared to the naive implementation. According to Fig. 3, for NTT operation, the butterfly is arranged based on CT configuration, while in INTT, it is reconfigured to match with the GS configuration. In NTT/INTT, when the pipeline is fulfilled, the butterfly unit can read and write two data inputs and outputs in each clock cycle.

The most crucial bottleneck in implementing NTT core is memory access because memory access patterns change during each operation stage [15], [32]. Therefore, designing efficient memory management is critical to avoid memory conflicts and achieve high throughput. On the other hand, memory bandwidth limits the efficiency of the butterfly operation. Hence, we use two memory units to provide double bandwidth during NTT operation to reduce latency. In the first round, the results are stored in NTT RAM 0. After completing the first round, the input coefficients are read from NTT RAM 0, and the butterfly outputs are stored in NTT RAM 1. This scenario is repeated for seven rounds until NTT is computed.

In this method, two coefficients are fetched from the first RAM block at a time and fed into a butterfly unit. Then, the butterfly output will be prepared and written into the second RAM block after pipelined stages, i.e., five cycles. Employing the ping-pong strategy, after 128 cycles, all coefficients are fed into the butterfly core, and the five additional cycles are required to complete a round of NTT/INTT computation. In the next round, the input coefficients are fetched from the second RAM block, and the outputs are stored in the first RAM block. This computation will be continued to complete all seven required rounds of NTT. To optimize the memory utilization in this method, different vectors are stored in the same RAM block. For example, the \( s_0 \) and \( s_1 \) are located in the same memory, where in each address the lower column stores \( s_0 \) and the higher column stores \( s_1 \) coefficients. In each clock cycle, two addresses of memory (e.g., \( i \) and \( j \)) are read which contains four coefficients, i.e., \( s_{0,i} \) and \( s_{1,i} \) from address \( i \), and \( s_{0,j} \) and \( s_{1,j} \) from address \( j \). Then, \( s_{0,i} \) and \( s_{0,j} \) are fed into the first butterfly, while \( s_{1,i} \) and \( s_{1,j} \) are used by the second core. The results of these cores will be stored in the same fashion in the second RAM. Fig. 4 shows the address flow of our proposed NTT architecture using RAM0 and RAM1.

To implement a highly parallel architecture, we implement multiple butterfly units matched with the number of polynomial vectors in \( s \), i.e., two, three, and four units for Kyber-512, Kyber-768, and Kyber-1024, respectively.

Our first method reduces the NTT execution time from \( \frac{N}{2} \log_2 N + 2N \) to \( \frac{N}{2} \log_2 N \) compared with the naive implementation. In our second method, we take advantage of the NTT definition in the Kyber scheme to perform two independent NTT computations for odd and even coefficients. Hence, we employ two butterfly cores in parallel to computes NTT, which halves execution time to \( \frac{N}{2} \log_2 N \). In this method, each address of memory stores two consecutive coefficients, i.e., \( s_{i,j} \) and \( s_{i,j+1} \). Then, two addresses of memory are fed into two butterfly cores where contains four coefficients, i.e., \( s_{i,j} \) and \( s_{i,j+1} \) from address \( j \), and \( s_{i,k} \) and \( s_{i,k+1} \) from address \( k \) of memory. So, \( s_{i,j} \) and \( s_{i,k} \) are used for the first butterfly, which are independently processed form \( s_{i,j+1} \) and \( s_{i,k+1} \) in the second core. Similar to the previous method, the results should be stored similarly in the second RAM. Although this method does not improve the efficiency due to doubling the resources to halve the latency, it can accelerate the computations to target high-performance architectures.

2) Optimizing Point-Wise Multiplication: To implement an optimized high-throughput point-wise multiplication core, we use a specific memory pattern for matrix \( \tilde{A} \) coefficients. In our proposed memory pattern for \( \tilde{A} \), four consecutive coefficients are stored in pairs, i.e., \((\tilde{A}_{00}(3), \tilde{A}_{00}(2), \tilde{A}_{00}(1), \tilde{A}_{00}(0)), \ldots, (\tilde{A}_{11}(255), \tilde{A}_{11}(254), \tilde{A}_{11}(253), \tilde{A}_{11}(252))\). Further, two parallel butterfly cores are employed to accelerate the polynomial multiplication. The number of the pipelined stages is set to five to design a high-throughput architecture for point-wise multiplication, i.e., 4-coefficient per 5-cycle. In other words, based on detailed scheduling and our proposed memory scheme, this design results in higher throughput while limits the maximum operating frequency. It is observed that the path from reduction output to the multiplier is the critical path. Nevertheless, increasing the pipeline latency improves the critical path.

Fig. 4. The proposed address flow of our NTT memory architecture in the first two stages. (Butterfly inputs are in white and outputs are in black.)
delay at the cost of decreasing the point-wise multiplication throughput.

Let \( \hat{R}_{00} = \hat{A}_{00} \circ \hat{S}_0 \); hence, based on (4), the \( \hat{R}_{00} \) coefficients can be computed as follows:

\[
\hat{R}_{00}(2i) = \zeta_i \hat{A}_{00}(2i + 1) \hat{S}_0(2i + 1) + \hat{A}_{00}(2i) \hat{S}_0(2i) \quad (6)
\]

\[
\hat{R}_{00}(2i + 1) = \hat{A}_{00}(2i + 1) \hat{S}_0(2i) + \hat{A}_{00}(2i) \hat{S}_0(2i + 1) \quad (7)
\]

Hence, we use the first core for the \( \hat{R}_{00}(4i) \) and \( \hat{R}_{00}(4i + 1) \), and the second core works on \( \hat{R}_{00}(4i + 2) \) and \( \hat{R}_{00}(4i + 3) \). Operations in each step is described for a core as follows:

**Step 1:** \( \hat{S}_0(2i) \) is multiplied by \( \hat{A}_{00}(2i) \). Furthermore, the previous multiplication result is passed into the modular reduction unit.

**Step 2:** \( \hat{S}_0(2i) \) is multiplied by \( \hat{A}_{00}(2i) \). The second term of \( \hat{R}_{00}(2i + 1) \), i.e., \( \hat{A}_{00}(2i) \circ \hat{S}_0(2i) \), is entered sequentially into the pipeline stages. Moreover, the next coefficients are read from the memories to start from Step 1.

**Step 5:** The second term of \( \hat{R}_{00}(2i + 1) \), i.e., \( \hat{A}_{00}(2i) \circ \hat{S}_0(2i) \), is multiplied. The reduced result of step 2, i.e., \( \hat{A}_{00}(2i) \circ \hat{S}_0(2i) \), is entered into the pipeline stages.

**Steps 6-7:** The reduction outputs, i.e., \( \hat{A}_{00}(2i + 1) \circ \hat{S}_0(2i) \) and \( \zeta_i \hat{A}_{00}(2i + 1) \circ \hat{S}_0(2i + 1) \), are entered sequentially into the pipeline stages. Moreover, the next coefficients are read from the memories to start from Step 1.

**Step 8:** The modular addition computes \( \zeta_i \hat{A}_{00}(2i + 1) \circ \hat{S}_0(2i + 1) + \hat{A}_{00}(2i) \circ \hat{S}_0(2i) \). Furthermore, \( \hat{A}_{00}(2i) \circ \hat{S}_0(2i + 1) \) is passed from the reduction unit into the pipeline stages.

**Step 9:** The previous addition result, i.e., \( \hat{R}_{00}(2i) \), is buffered in the next register, while the modular addition computes \( \hat{A}_{00}(2i + 1) \circ \hat{S}_0(2i + 1) + \hat{A}_{00}(2i) \circ \hat{S}_0(2i + 1) \).

**Step 10:** The \( \hat{R}_{00}(2i) \) and \( \hat{R}_{00}(2i + 1) \), which are already buffered in the output registers, are stored in the memory.

Since the memory \( \hat{A} \) includes four coefficients per address, the addition between \( \hat{A}_{00} \circ \hat{S}_0 \) and \( \hat{A}_{01} \circ \hat{S}_1 \) can be performed by a 64-bit addition. In the described scenario, one port of the memory is always in read mode to feed the cores. The second port is used for accumulating the results.

**F. Scalability**

The proposed architecture for NTT computation employing two butterfly cores for Kyber-512 achieves high-performance results with reasonable resource utilization. However, different hardware resource utilization can be explored to achieve a desirable area-time trade-off from various optimization perspectives. For example, to reduce the required cycles, the number of butterfly cores can be increased to 4 cores. However, the resources can be saved if only one butterfly core is implemented at the cost of increasing the total latency. It should be noted that increasing the number of butterfly cores changes the memory access patterns, and some modifications should be considered to feed all cores. Hence, a high-performance design requires complex memory access management to reduce the access overhead.

Besides, a high-performance Keccak core occupies almost 25% of the total area. We can implement different architectures of this core and achieve scalability through area versus latency trade-offs.

This architecture can be easily scaled to match the upper or lower security level. To scale up the architecture, the same structure can be applied, while the number of butterfly cores should increase. Moreover, the depth of Data RAM and RAM(A) needs to be increased. The main difference between these architectures is using two separate CBD circuits for Kyber-512, which causes more resources to provide a dedicated sampler. Hence, a general core utilizing the most up security level resources with additional CBD core for \( \eta = 3 \) can be used to provide a scalable Kyber cryptosystem.

**IV. EXPERIMENTAL RESULTS AND COMPARISON**

In this section, we provide implementation results and compare them to the counterparts available in the open literature. Along with the fact that the implementations employ different platforms, a fair and meaningful discussion or comparison of different designs and implementations with previous work is not straightforward. Nevertheless, we like to put our results in the context with existing implementations to allow the reader a quick overview of other designs and architectures.

**A. Results for Keccak and Polynomial Sampling**

Tables IV and V report the required FPGA and ASIC resources and latency specifications for the Keccak, the CBD,
and rejection sampling cores in our design and other state-of-the-art implementations. As one can see, the software implementation of Keccak runs in thousands of clock cycles, which can be significantly accelerated while implemented in hardware. A lightweight Keccak core presented in [34] uses 359 LUTs to perform a round of Keccak-f[1600] in 1,665 cycles, while in [35], the authors proposed an architecture performing in 12 clock cycles at the cost of almost 10k LUTs. In our proposed design, a Keccak-f[1600] is performed in 24 cycles at the cost of 4.4k LUTs or 24k GEs. Additionally, decreasing the latency of the Keccak core does not considerably improve the performance due to interfacing cost, which requires 21 clock cycles for a 1,344-bit output.

The reported results show that the performance of our binomial and rejection sampler outperform sampling units of previous works [6], [8], [9]. Our proposed implementation takes advantage of parallel computations between our sampling units and Keccak core. Our binomial sampler requires 68 clock cycles for generating four polynomials of degree 256, i.e., 1,024 samples. The rejection sampler in our proposed scheme works simultaneously with the Keccak core. Therefore, its required latency for generating matrix $\hat{A}$ with $\eta = 3$, two rounds of Keccak are required. Each round of Keccak result is processed in 17 cycles by the binomial sampler. However, processing the second round result cannot be parallelized by the next CBD due to memory bandwidth limitation.

Fig. 5 shows the proposed scheduling for sampling units in Kyber-512. Rejection sampler works parallel by Keccak core, and therefore its latency, i.e., 108 cycles, is absorbed completely. The accepted samples will be stored in RAM(A), shown in Fig. 2. For a binomial sampling of a polynomial of degree 256 with $\eta = 3$, two rounds of Keccak are required. Each round of Keccak result is processed in 17 cycles by the binomial sampler. However, processing the second round result cannot be parallelized by the next CBD due to memory bandwidth limitation.

### TABLE VI

**FPGA IMPLEMENTATION RESULTS FOR OUR NTT CORE AND COMPARISON WITH STATE-OF-THE-ART ($n = 256$)**

<table>
<thead>
<tr>
<th>Work</th>
<th>Method</th>
<th>Platform</th>
<th>Parameter</th>
<th>Area</th>
<th>Cycles</th>
<th>NTT Area x Cycle Ratio</th>
</tr>
</thead>
<tbody>
<tr>
<td>[3]</td>
<td>SW</td>
<td>Cortex-M4</td>
<td>$q = 3,323$</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>[19]</td>
<td>HW/SW</td>
<td>VIRTEX-7</td>
<td>$q = 3,323$</td>
<td>417</td>
<td>462</td>
<td>43,756</td>
</tr>
<tr>
<td>[8]</td>
<td>HW/SW</td>
<td>Zynq-7000</td>
<td>$q = 3,323$</td>
<td>2,908</td>
<td>170</td>
<td>1,935</td>
</tr>
<tr>
<td>[7]</td>
<td>HW/SW</td>
<td>Artix-7</td>
<td>$q = 3,323$</td>
<td>NA</td>
<td>NA</td>
<td>59</td>
</tr>
<tr>
<td>[6]</td>
<td>HW/SW</td>
<td>Artix-7</td>
<td>$q = 3,323$</td>
<td>NA</td>
<td>NA</td>
<td>59</td>
</tr>
<tr>
<td>[10]</td>
<td>HW</td>
<td>Artix-7</td>
<td>$q = 3,323$</td>
<td>NA</td>
<td>NA</td>
<td>155</td>
</tr>
<tr>
<td>[20]</td>
<td>HW</td>
<td>Zynq-7000</td>
<td>$q = 3,323$</td>
<td>980</td>
<td>395</td>
<td>2,056</td>
</tr>
<tr>
<td>[34]</td>
<td>HW</td>
<td>Artix-7</td>
<td>$q = 3,323$</td>
<td>442</td>
<td>237</td>
<td>2,055</td>
</tr>
<tr>
<td>[37]</td>
<td>HW</td>
<td>Artix-7</td>
<td>$q = 3,323$</td>
<td>472</td>
<td>472</td>
<td>4,108</td>
</tr>
<tr>
<td>[38]</td>
<td>HW</td>
<td>Artix-7</td>
<td>$q = 3,323$</td>
<td>609</td>
<td>640</td>
<td>490</td>
</tr>
<tr>
<td>[13]</td>
<td>HW</td>
<td>Artix-7</td>
<td>$q = 3,323$</td>
<td>801</td>
<td>717</td>
<td>324</td>
</tr>
<tr>
<td>This work</td>
<td>HW</td>
<td>Artix-7</td>
<td>$q = 3,323$</td>
<td>340</td>
<td>145</td>
<td>940</td>
</tr>
<tr>
<td>This work</td>
<td>HW</td>
<td>Artix-7</td>
<td>$q = 3,323$</td>
<td>737</td>
<td>290</td>
<td>474</td>
</tr>
</tbody>
</table>

### TABLE VII

**ASIC RESULTS FOR NTT AND COMPARISON WITH STATE-OF-THE-ART**

<table>
<thead>
<tr>
<th>Work</th>
<th>Parameter</th>
<th>Area</th>
<th>Cycles</th>
<th>NTT [kGis]</th>
<th>INTT [kGis]</th>
<th>Point-wise [kGis]</th>
</tr>
</thead>
<tbody>
<tr>
<td>[6]</td>
<td>$q = 7,681$</td>
<td>NA</td>
<td>1,289</td>
<td>NA</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>[9]</td>
<td>$q = 3,329$</td>
<td>512</td>
<td>200</td>
<td>41</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>[20]</td>
<td>$q = 7,681$</td>
<td>14</td>
<td>25</td>
<td>2056</td>
<td>NA</td>
<td>NA</td>
</tr>
<tr>
<td>This work</td>
<td>$q = 3,329$</td>
<td>1,9</td>
<td>200</td>
<td>940</td>
<td>1,289</td>
<td>1,289</td>
</tr>
<tr>
<td>This work</td>
<td>$q = 3,329$</td>
<td>3.8</td>
<td>200</td>
<td>474</td>
<td>602</td>
<td>1,289</td>
</tr>
</tbody>
</table>

### B. Results for the Butterfly Core

Tables VI and VII report the required FPGA and ASIC hardware resources and latency specifications for our proposed butterfly unit in different configurations, i.e., NTT, INTT, and point-wise multiplication, including other state-of-the-art implementations. We remark that a more technology-independent measurement is the required cycle. Thus, for efficiency comparison between different proposed NTT architectures, efficiency can be computed by the required clock cycles $\times$ area.

Our pipelined architecture employing our first method requires 133 cycles for performing one round of 256-point NTT; hence, a full NTT with seven rounds requires 940 cycles. Computing INTT requires 263 additional clock cycles for post-processing. Moreover, point-wise multiplication between two polynomials of degree 256 requires 1,289 clock cycles. Our proposed architecture is significantly smaller compared to previous best works occupying 360 LUTs, 145 FFs, 187 Slices, 3 DSPs, and 2 BRAMs.

Our pipeline architecture employing our second method requires 474 cycles for performing a full NTT employing two parallel butterfly cores. Hence, this method results in a significant speedup by halving the cycle count compared to other NTT implementations for Kyber. Although the efficiency of both methods is the same, a trade-off between area and time can be achieved.

The authors in [19] presented a flexible NTT architecture over RISC-V, which consumes significantly greater cycles. In [7], 3-layer merged NTT for NewHope was proposed. The work of [24] and [13] implemented 2-layer merged NTT using the KRED algorithm, while this reduction algorithm needs a special prime form. In [10], Montgomery reduction was employed. From a resource sharing perspective, we use a
TABLE VIII
FPGA IMPLEMENTATION RESULTS AND COMPARISON WITH STATE-OF-THE-ART

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Botes et al. [3]</td>
<td>SW</td>
<td>Cortex-M4</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>100</td>
<td>499</td>
<td>634</td>
<td>597</td>
<td>12,400</td>
<td>83.9</td>
<td>-</td>
</tr>
<tr>
<td>Banerjee et al. [6]</td>
<td>HW/SW</td>
<td>Artix-7</td>
<td>13K</td>
<td>3K</td>
<td>4K</td>
<td>11</td>
<td>14</td>
<td>25</td>
<td>75</td>
<td>132</td>
<td>142</td>
<td>10,960</td>
<td>74.1</td>
<td>61.8</td>
</tr>
<tr>
<td>Pritmann et al. [8]</td>
<td>HW/SW</td>
<td>Zynq-7000</td>
<td>24K</td>
<td>11K</td>
<td>NA</td>
<td>21</td>
<td>32</td>
<td>150</td>
<td>193</td>
<td>205</td>
<td>223</td>
<td>8,710</td>
<td>211.1</td>
<td>25.5</td>
</tr>
<tr>
<td>Akkam et al. [7]</td>
<td>HW/SW</td>
<td>Artix-7</td>
<td>2K</td>
<td>2K</td>
<td>NA</td>
<td>5</td>
<td>34</td>
<td>59</td>
<td>210</td>
<td>971</td>
<td>870</td>
<td>31,203</td>
<td>51.6</td>
<td>0.19</td>
</tr>
<tr>
<td>Banu et al. [13]</td>
<td>HW</td>
<td>Virtex-7</td>
<td>1,979K</td>
<td>194K</td>
<td>NA</td>
<td>0</td>
<td>0</td>
<td>67</td>
<td>-</td>
<td>32</td>
<td>43</td>
<td>1,119</td>
<td>7</td>
<td>832.1</td>
</tr>
<tr>
<td>Huang et al. [10]</td>
<td>HW</td>
<td>Virtex-7</td>
<td>79K</td>
<td>NA</td>
<td>NA</td>
<td>354</td>
<td>202</td>
<td>155</td>
<td>-</td>
<td>49</td>
<td>69</td>
<td>761</td>
<td>5</td>
<td>23.5</td>
</tr>
<tr>
<td>Xing et al. [12]</td>
<td>HW</td>
<td>Artix-7</td>
<td>7K</td>
<td>5K</td>
<td>2K</td>
<td>2</td>
<td>3</td>
<td>161</td>
<td>5</td>
<td>7</td>
<td>72</td>
<td>0.48</td>
<td>0.19</td>
<td></td>
</tr>
<tr>
<td>Dang et al. [11]</td>
<td>HW</td>
<td>Artix-7</td>
<td>12K</td>
<td>10K</td>
<td>4K</td>
<td>8</td>
<td>15</td>
<td>210</td>
<td>-</td>
<td>4</td>
<td>6</td>
<td>46</td>
<td>0.22</td>
<td>0.25</td>
</tr>
<tr>
<td>Bishsh-Nasar et al. [13]</td>
<td>HW</td>
<td>Artix-7</td>
<td>11K</td>
<td>10K</td>
<td>4K</td>
<td>8</td>
<td>13</td>
<td>200</td>
<td>2</td>
<td>2</td>
<td>4</td>
<td>31</td>
<td>0.20</td>
<td>0.14</td>
</tr>
<tr>
<td>This work</td>
<td>HW</td>
<td>Artix-7</td>
<td>16K</td>
<td>6K</td>
<td>4K</td>
<td>9</td>
<td>16</td>
<td>115</td>
<td>4</td>
<td>7</td>
<td>10</td>
<td>144</td>
<td>1</td>
<td>1.0</td>
</tr>
</tbody>
</table>

TABLE IX
ASIC RESULTS AND COMPARISON WITH STATE-OF-THE-ART

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Botes et al. [3]</td>
<td>SW</td>
<td>Cortex-M4</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>100</td>
<td>974</td>
<td>1,113</td>
<td>1,039</td>
<td>21,720</td>
<td>104.1</td>
<td>-</td>
</tr>
<tr>
<td>Banerjee et al. [6]</td>
<td>HW/SW</td>
<td>Artix-7</td>
<td>15K</td>
<td>3K</td>
<td>4K</td>
<td>11</td>
<td>14</td>
<td>25</td>
<td>112</td>
<td>178</td>
<td>191</td>
<td>14,760</td>
<td>70.7</td>
<td>66.3</td>
</tr>
<tr>
<td>Pritmann et al. [8]</td>
<td>HW/SW</td>
<td>Zynq-7000</td>
<td>24K</td>
<td>11K</td>
<td>NA</td>
<td>21</td>
<td>32</td>
<td>-</td>
<td>273</td>
<td>326</td>
<td>340</td>
<td>-</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Huang et al. [10]</td>
<td>HW</td>
<td>Virtex-7</td>
<td>167K</td>
<td>NA</td>
<td>NA</td>
<td>292</td>
<td>202</td>
<td>155</td>
<td>-</td>
<td>77</td>
<td>102</td>
<td>1,155</td>
<td>5.5</td>
<td>57.8</td>
</tr>
<tr>
<td>Xing et al. [12]</td>
<td>HW</td>
<td>Artix-7</td>
<td>7K</td>
<td>5K</td>
<td>2K</td>
<td>2</td>
<td>3</td>
<td>161</td>
<td>6</td>
<td>8</td>
<td>10</td>
<td>111</td>
<td>0.53</td>
<td>0.23</td>
</tr>
<tr>
<td>Dang et al. [11]</td>
<td>HW</td>
<td>Artix-7</td>
<td>12K</td>
<td>10K</td>
<td>4K</td>
<td>8</td>
<td>15</td>
<td>210</td>
<td>-</td>
<td>4</td>
<td>6</td>
<td>63</td>
<td>0.30</td>
<td>0.24</td>
</tr>
<tr>
<td>Bishsh-Nasar et al. [13]</td>
<td>HW</td>
<td>Artix-7</td>
<td>12K</td>
<td>10K</td>
<td>4K</td>
<td>12</td>
<td>14</td>
<td>200</td>
<td>3</td>
<td>3</td>
<td>5</td>
<td>40</td>
<td>0.19</td>
<td>0.14</td>
</tr>
<tr>
<td>This work</td>
<td>HW</td>
<td>Artix-7</td>
<td>16K</td>
<td>6K</td>
<td>4K</td>
<td>9</td>
<td>16</td>
<td>115</td>
<td>7</td>
<td>10</td>
<td>14</td>
<td>209</td>
<td>1</td>
<td>1.0</td>
</tr>
</tbody>
</table>

Note that the NTT can also be parallelized by sampling unit to reduce the total latency; however, applying this parallelization in this work results in diminishing the flexibility and increasing the required memory units. To achieve both high speed and instruction-level flexibility, we do not follow this methodology such that the design remains flexible to add or modify new instructions.

C. FPGA Implementations
Our proposed architecture for different NIST security levels is synthesized with Xilinx Vivado 2019.2 and implemented on a Xilinx Artix XC7A100T-3 FPGA. All given results are obtained after place-and-route (PAR). We report the area, timing, and area-time trade-off (number of LUT×time in μs) results of the design in Table VIII. In some previous works, each DSP is considered equivalent to 100 Slices [39]. However, no single element of FPGA can be accurately expressed in terms of other elements; hence, DSP and BRAM are not considered in A. To have a fair comparison, we evaluate the
performance of the proposed design on the state-of-the-art targeted platforms, which changes performance by a factor of 1.35×, 1.4×, and 0.68× on Zynq-7000, Virtex-7, and Virtex-6 compared to Artix-7.

We compare our architecture results to the best SW design on the ARM Cortex-M4 chip, as well as the HW implementations and the HW/SW co-design. The total latency is the summation of key encapsulation and key decapsulation (Encaps + Decaps), as the key generation can be done offline. As one can see, for NIST level 1 security, our proposed scheme occupies 18k LUTs, 5k FFs, 6 DSPs, and 15 BRAMs. It also runs at 115 MHz and performs the whole Kyber protocol in 148 µs. Our design achieves a speedup factor of 83.9× and 74.1× compared to the leading counterpart in SW and HW/SW designs. Furthermore, our architecture employing the various optimization techniques is highly efficient, with area-time trade-off being about 98% improved compared to [6]. It is to be noted that the HW/SW co-design [6]–[9] is a complete design for all Kyber security levels. The same improvement can be observed in the remaining security levels. Compared to HW architecture, our proposed design consumes 5× time than our previous work [13], resulting in a greater A×T by a factor of 7. Our design is also 2× slower and 2.5× larger compared to [12]. However, this overhead comes to keep the customized instruction-set design flexible compared to highly parallel [13] or highly compact architectures [12]. The hardware specially designed to cater a scheme may fail in flexibility; thereby, this work aims to achieve both high speed and flexibility for Kyber to support extension for building a hybrid cryptosystem.

Although our implementations are constant-time, investigating side-channel analysis attacks will part of our future work.

D. ASIC Results

The ASIC implementation results of our architectures based on the 65-nm TSMC cell library using Synopsys Design Compiler are presented in this section. All the designs are synthesized with a 5ns clock period. Table IX reports the maximum clock frequency and the amount of logic cells for our proposed designs and state-of-the-art implementations. As one can see, the placed-and-routed design of our proposed Kyber-1024 consists of 104 kGE for logic and 190 KB SRAM for memory, which shows a significant speedup compared to previous works.

E. Comparison With Other Implementations

In Table X, the comparison between our proposed architecture with some existing PQC hardware implementations targeting NIST security level 5 is reported. It should be noted that due to the varying techniques of different FPGA generations, a fair comparison is actually not accurately possible.

In [15], a fast architecture of Saber is proposed using the high-speed instruction-set coprocessor on a Xilinx ZCU102 board. In this work, a non-NTT-based approach is used, taking advantage of the module power of 2 in the Saber scheme, which results in 153 µs time execution. Employing multiply-and-accumulate units provides the required trade-off between area and time for different applications. However, this design needs more hardware resources compared to ours, which results in 1.1× area-time product.

We also compare our work with FrodoKEM-1344 based on standard learning with error problem. To the best of our knowledge, there is not a pure HW work for FrodoKEM targeting security level 5; hence, the results in [6] used a HW/SW approach are reported. As one can see, the FrodoKEM scheme requires a considerable cycle compared to other PQC schemes due to performing expensive matrix-vector multiplications. Our implementation of Kyber-1024 is almost 26,000 times faster, occupying almost the same resources compared to [6].

SIKE [40] as an isogeny-based PQC scheme requires significantly more DSP resources to design parallel Montgomery multiplier architecture over a large prime. Although this scheme outperforms FrodoKEM implementation, our Kyber-1024 design shows 155 times better area-time product compared to this scheme.

It should be noted that there is a large body of work on optimizing PQC schemes on a variety of platforms. For example, the work of [21] and [28] propose the NewHope on a Xilinx XC7Z020 and Zynq-7000, respectively. The architecture of NewHope is very similar to that of Kyber; however, this scheme has not been selected to continue into the third round of NIST. In [21], a low-complexity architecture of NewHope is introduced, having a competitive performance compared to our design. Hence, taking advantage of this architecture to improve the total performance of Kyber is kept for future works.

Although one of the drawbacks of various post-quantum cryptosystems is requiring larger key sizes and more computational power than the current pre-quantum algorithms, the efficiency of our proposed implementation already has performance levels comparable to or even significantly better than pre-quantum algorithms [30], [41], [42].

V. Conclusion

The threat from large-scale quantum computers is real, and we need to act now as the deployment, integration, and migration to quantum-safe security systems take several
years. In this paper, we have presented an instruction-set post-quantum cryptosystem for CRYSTALS-Kyber. Our proposed architecture is synthesized for a Xilinx Artix-7 FPGA (which is a NIST recommended tool for prototype) prototype and an ASIC. Implementing efficient components, including sampling cores, NTT, and point-wise multiplication architectures, increases the performance compared to the state-of-the-art SW and HW/SW implementations. More specifically, our proposed architecture performs Kyber-512, Kyber-768, and Kyber-1024 protocols in only 148, 209, and 286 μs on an Artix-7 FPGA, respectively. Our future work will focus on the side-channel resistance and the development of countermeasures against such attacks.

ACKNOWLEDGMENT

The authors would like to thank the reviewers for their comments.

REFERENCES


Mojtaba Bisheh-Niasar (Student Member, IEEE) received the B.Sc. degree from Amirkabir University of Technology in 2011 and the M.Sc. degree in electrical engineering from Iran University of Science and Technology in 2015. He is currently pursuing the Ph.D. degree in computer engineering with Florida Atlantic University under the supervision of Dr. Azarderakhsh. He is also a Research Assistant with I-SENSE Lab. He is a Research Intern in azure hardware security architecture (AHSA) at Microsoft, Redmond, Washington. His research interests include applied cryptography, post-quantum cryptography, and efficient implementation of cryptographic algorithms.

Reza Azarderakhsh (Member, IEEE) received the Ph.D. degree in electrical and computer engineering from Western University in 2011. He has worked at the Center for Applied Cryptographic Research and the Department of Combinatorics and Optimization, University of Waterloo. He is currently an Associate Professor with the Department of Electrical and Computer Engineering, Florida Atlantic University. His current research interests include finite field and its application, elliptic curve cryptography, isogenies on elliptic curves, and lattice-based post-quantum cryptography. He was a recipient of the NSERC Post-Doctoral Research Fellowship. He is serving as an Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS.

Mehran Mozaffari-Kermani (Senior Member, IEEE) received the B.Sc. degree from the University of Tehran, Iran, and the M.E.Sc. and Ph.D. degrees from the University of Western Ontario, London, Canada, in 2007 and 2011, respectively. In 2012, he joined the Department of Electrical Engineering, Princeton University, NJ, USA, as an NSERC Post-Doctoral Research Fellow. From 2013 to 2017, he was an Assistant Professor with Rochester Institute of Technology and has joined the Department of Computer Science and Engineering, University of South Florida, in 2017, where he is currently an Associate Professor. He has been the TPC Member for a number of conferences, including HOST (publications chair), CCS (publications chair), DAC, DATE, RFIDSec, LightSec, WAIFI, FDTC, and DFT. He is serving as an Associate Editor for the IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, the Transactions on Embedded Computing Systems (ACM), and the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS. He has been a Guest Editor of the IEEE TRANSACTIONS ON DEPENDABLE AND SECURE COMPUTING, the IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, and the IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING for special issues on security.