Primitive Component: SRAM Scrambler

Overview

The scrambling primitive prim_ram_1p_scr employs a reduced-round (7 instead of 11) PRINCE block cipher in CTR mode to scramble the data. The PRINCE lightweight block cipher has been selected due to its low latency and low area characteristics, see also prim_prince for more information on PRINCE. The number of rounds is reduced to 7 in order to ease timing pressure and ensure single cycle operation (the number of rounds can always be increased if it turns out that there is enough timing slack).

In CTR mode, the block cipher is used to encrypt a 64bit IV with the scrambling key in order to create a 64bit keystream block that is bitwise XOR’ed with the data in order to transform plaintext into ciphertext and vice versa. The IV is assembled by concatenating a nonce with the word address.

If the word width of the scrambled memory is smaller than 64bit, the keystream block is truncated to fit the data width. If the word width is wider than 64bit, the scrambling primitive by default instantiates multiple PRINCE primitives in order to create a unique keystream for the full datawidth. For area constrained settings, the parameter ReplicateKeyStream in prim_ram_1p_scr can be set to 1 in order to replicate the keystream block generated by one single primitive instead of using multiple parallel PRINCE instances (but it should be understood that this lowers the level of security).

In order to break the linear address space, the CTR mode is augmented with an S&P network to non-linearly remap the SRAM address as shown in the block diagram above. The S&P network employed is similar to the one employed in PRESENT and is explained in more detail further below. This particular address scrambling network additionally XOR’s in a nonce that has the same width as the address.

Optionally, the scheme can be augmented by passing each individual data word through a substitution-permutation (S&P) network implemented with the prim_subst_perm primitive to diffuse the data bits. The number of diffusion rounds and the diffusion chunk width can be parameterized via the NumDiffRounds and the DiffWidth parameter, respectively. The same S&P network that is used for address scrambling is leveraged for the data diffusion. For details, see below. If individual bytes need to be writable without having to perform a read-modify-write operation, the diffusion chunk width should be set to 8. Note that since this optional data diffusion can affect end-to-end bus and memory integrity schemes, it is disabled by default.

Parameters

The following table lists the instantiation parameters of the prim_ram_1p_scr primitive. These are not exposed in the sram_ctrl IP, but have to be set directly when instantiating prim_ram_1p_scr in the top.

Parameter	Default (Max)	Top Earlgrey	Description
`Depth`	512	multiple	SRAM depth, needs to be a power of 2 if `NumAddrScrRounds` > 0.
`Width`	32	32	Effective SRAM width without redundancy.
`DataBitsPerMask`	8	8	Number of data bits per write mask.
`EnableParity`	1	1	This parameter enables byte parity.
`CfgWidth`	8	8	Width of SRAM attributes field.
`NumPrinceRoundsHalf`	3 (5)	3	Number of PRINCE half-rounds.
`NumDiffRounds`	0	0	Number of additional diffusion rounds, set to 0 to disable.
`DiffWidth`	8	8	Width of additional diffusion rounds, set to 8 for intra-byte diffusion.
`NumAddrScrRounds`	2	2	Number of address scrambling rounds, set to 0 to disable.
`ReplicateKeyStream`	0 (1)	0	If set to 1, the same 64bit key stream is replicated if the data port is wider than 64bit. Otherwise, multiple PRINCE primitives are employed to generate a unique keystream for the full data width.

Signal Interfaces

Signal	Direction	Type	Description
`key_valid_i`	`input`	`logic`	Indicates whether the key and nonce are considered valid. New memory requests are blocked if this is set to 0.
`key_i`	`input`	`logic [127:0]`	Scrambling key.
`nonce_i`	`input`	`logic [NonceWidth-1:0]`	Scrambling nonce.
`req_i`	`input`	`logic`	Memory request indication signal (from TL-UL SRAM adapter).
`gnt_o`	`output`	`logic`	Grant signal for memory request (to TL-UL SRAM adapter)
`write_i`	`input`	`logic`	Indicates that this is a write operation (from TL-UL SRAM adapter).
`addr_i`	`input`	`logic [AddrWidth-1:0]`	Address for memory op (from TL-UL SRAM adapter).
`wdata_i`	`input`	`logic [Width-1:0]`	Write data (from TL-UL SRAM adapter).
`wmask_i`	`input`	`logic [Width-1:0]`	Write mask (from TL-UL SRAM adapter).
`intg_error_i`	`input`	`logic`	Indicates whether the incoming transaction has an integrity error
`rdata_o`	`output`	`logic [Width-1:0]`	Read data output (to TL-UL SRAM adapter).
`rvalid_o`	`output`	`logic`	Read data valid indication (to TL-UL SRAM adapter).
`rerror_o`	`output`	`logic [1:0]`	Error indication (to TL-UL SRAM adapter). Bit 0 indicates a correctable and bit 1 an uncorrectable error. Note that at this time, only uncorrectable errors are reported, since the scrambling device only supports byte parity.
`raddr_o`	`output`	`logic [31:0]`	Address of the faulty read operation.
`cfg_i`	`input`	`logic [CfgWidth-1:0]`	Attributes for physical memory macro.

Custom Substitution Permutation Network

In addition to the PRINCE primitive, prim_ram_1p_scr employs a custom S&P network for byte diffusion and address scrambling. The structure of that S&P network is similar to the one used in PRESENT, but it uses a modified permutation function that makes it possible to parameterize the network to arbitrary data widths as shown in the pseudo code below.


NUM_ROUNDS = 2;
DATA_WIDTH = 8; // bitwidth of the data

// Apply PRESENT Sbox4 on all nibbles, leave uppermost bits unchanged
// if the width is not divisible by 4.
state_t sbox4_layer(state) {
    for (int i = 0; i < DATA_WIDTH/4; i ++) {
        nibble_t nibble = get_nibble(state, i);
        nibble = present_sbox4(nibble)
        set_nibble(state, i, nibble);
    }
    return state;
}

// Reverses the bit order.
state_t flip_vector(state) {
    state_t state_flipped;
    for (int i = 0; i < DATA_WIDTH; i ++) {
        state_flipped[i] = state[width-1-i];
    }
    return state_flipped;
}

// Gather all even bits and put them into the lower half.
// Gather all odd bits and put them into the upper half.
state_t perm_layer(state) {
    // Initialize with input state.
    // If the number of bits is odd, the uppermost bit
    // will stay in position, as intended.
    state_t state_perm = state;
    for (int i = 0; i < DATA_WIDTH/2; i++) {
      state_perm[i]                = state[i * 2];
      state_perm[i + DATA_WIDTH/2] = state[i * 2 + 1];
    }
    return state_perm;
}

state_t prim_subst_perm(data_i, key_i) {

    state_t state = data_i;
    for (int i = 0; i < NUM_ROUNDS; i++) {
        state ^= key_i;
        state = sbox4_layer(state);
        // The vector flip and permutation operations have the
        // combined effect that all bits will be passed through an
        // Sbox4 eventually, even if the number of bits in data_i
        // is not aligned with 4.
        state = flip_vector(state);
        state = perm_layer(state);
    }

    return state ^ key_i;
}