用RUST实现MMKV-数据编解码

编解码抽象 Link to heading

上篇已经实现了mmap，现在需要在基本功能之上加入数据校验以及加密。

序列化和反序列化的能力抽象为trait：

// 编码，加上Send是因为后续的实现需要在线程间转移Encoder所有权
pub trait Encoder: Send {
    fn encode_to_bytes(&self, raw_buffer: &Buffer, position: u32) -> Result<Vec<u8>>;
}

// 解码结果的包装
pub struct DecodeResult {
    // 数据
    pub buffer: Option<Buffer>,
    // 已解码数据长度
    pub len: u32,
}

// 解码，加Send原因同Encoder
pub trait Decoder: Send {
    fn decode_bytes(&self, data: &[u8], position: u32) -> Result<DecodeResult>;
}

数据校验 Link to heading

先说思路，数据校验参考原版MMKV使用CRC算法，但是原版MMKV是对整个文件使用CRC校验计算校验码，如果初始化校验不过，则丢掉全部文件（当然原版也有recovery策略），但是感觉这种方法并不高效且丢掉全部数据属实太过激进，所以本文采取另一个策略，对每次添加记录，都计算校验码，然后将数据和校验码一起写入，如此则可以做到只舍弃校验出错的数据而依然能够正常使用其他数据。权衡效率和数据量，决定采取CRC8，因为单笔记录数据并不会特别多，所以使用8位算法应该足够（当然这里并没有实际验证，仅仅是凭经验推测），而且8位校验码只会为每笔数据额外增加一个字节的存储占用。

将数据验证的能力包装进crc mod：

// 初始化CRC8实例
const CRC8: Crc<u8> = Crc::<u8>::new(&CRC_8_AUTOSAR);

// 使用单元组结构体包装Buffer
pub struct CrcBuffer(Option<Buffer>);

// 实现Encoder
impl Encoder for CrcBuffer {
    fn encode_to_bytes(&self, raw_buffer: &Buffer, _: u32) -> Result<Vec<u8>> {
        // 将内部Buffer序列化为字节数组
        let bytes_to_write = self.0.as_ref().unwrap().to_bytes();
        // 计算校验码，这里sum类型为u8，占一个字节
        let sum = CRC8.checksum(bytes_to_write.as_slice());
        // 完整数据的长度为原本字节数据的长度加一个字节的的校验码
        let len = bytes_to_write.len() as u32 + 1;
        // 先将长度写入头部，再将数据写入后续，最后写入校验码
        let mut data = len.to_be_bytes().to_vec();
        data.extend_from_slice(bytes_to_write.as_slice());
        data.push(sum);
        Ok(data)
    }
}

// 实现Decoder
impl Decoder for CrcBuffer {
    fn decode_bytes(&self, data: &[u8], _: u32) -> Result<DecodeResult> {
        // 先读长度，如果这里也失败那说明数据损坏严重，已经没法继续读取，直接返回Error
        let offset = size_of::<u32>();
        let item_len = u32::from_be_bytes(data[0..offset].try_into().map_err(|_| DataInvalid)?);
        // 再读待反序列化的数据，这里是从第4位到末尾的前一位（前面4位存的长度，最后一位存的校验码）
        let bytes_to_decode = &data[offset..(offset + item_len as usize - 1)];
        let read_len = offset as u32 + item_len;
        // 验证校验码
        let sum = data[3 + item_len as usize];
        let result = if CRC8.checksum(bytes_to_decode) == sum {
            Buffer::from_encoded_bytes(bytes_to_decode)
        } else {
            Err(DecodeFailed("CRC check failed".to_string()))
        };
        let buffer = match result {
            Ok(data) => Some(data),
            Err(e) => {
                error!(LOG_TAG, "Failed to decode data, reason: {:?}", e);
                None
            }
        };
        // 无论校验成功失败，返回已读长度，以便后续解码
        Ok(DecodeResult {
            buffer,
            len: read_len,
        })
    }
}

接下来还需要改造迭代器使其依赖抽象而不是具体实现。

pub struct Iter<'a, F>
where
    F: Fn(&[u8], u32) -> crate::Result<DecodeResult>,
{
    mm: &'a MemoryMap,
    pub position: u32,
    start: usize,
    end: usize,
    decode: F,
}

impl MemoryMap {
    // 在构造迭代器时传入一个闭包隔离解码的实现
    pub fn iter<F>(&self, decode: F) -> Iter<F>
    where
        F: Fn(&[u8], u32) -> crate::Result<DecodeResult>,
    {
        let start = LEN_OFFSET;
        let end = self.offset();
        Iter {
            mm: self,
            position: 0,
            start,
            end,
            decode,
        }
    }
}

// 实现Iter
impl<'a, F> Iterator for Iter<'a, F>
where
    F: Fn(&[u8], u32) -> crate::Result<DecodeResult>,
{
    // 关联类型变为Option，因为可能存在能正常读取，但数据校验不过而被丢弃的记录
    type Item = Option<Buffer>;

    fn next(&mut self) -> Option<Self::Item> {
        if self.start >= self.end {
            return None;
        }
        let bytes = self.mm.read(self.start..self.end);
        // 调用闭包方法解码数据
        let decode_result = (self.decode)(bytes, self.position);
        self.position += 1;
        match decode_result {
            Ok(result) => {
                self.start += result.len as usize;
                Some(result.buffer)
            }
            Err(e) => {
                error!(LOG_TAG, "Failed to iter memory map, reason: {:?}", e);
                None
            }
        }
    }
}

impl<'a, F> Iter<'a, F>
where
    F: Fn(&[u8], u32) -> crate::Result<DecodeResult>,
{
    // 工具方法，调用方使用如下代码：
    // let (kv_map, decoded_position) = mm
    //     .iter(|bytes, position| decoder.decode_bytes(bytes, position))
    //     .into_map();
    // 即可获取解码后的map
    pub fn into_map(self) -> (HashMap<String, Buffer>, u32) {
        let mut iter_count = 0;
        let mut map = HashMap::new();
        self.for_each(|buffer| {
            iter_count += 1;
            if let Some(data) = buffer {
                if data.is_deleting() {
                    map.remove(data.key());
                } else {
                    map.insert(data.key().to_string(), data);
                }
            }
        });
        (map, iter_count)
    }
}

数据加密 Link to heading

数据加密是Android平台一个比较普遍的需求（毕竟Android的root限制比较松散，如果不加密，root之后得到授权的应用几乎可以随意读取任意文件）。依然参考原版MMKV使用AES算法进行加密，参考CRC的思路，我们需要对每条记录进行加密，同时不想增加太多存储占用，所以这里我们选择支持流式加密，不需要数据填充，且支持自定义Tag长度的Eax算法，我们设置Tag长度为8位以减少存储占用(加密后的数据为原始数据长度加上Tag长度，所以加密会带来每个记录8字节的额外存储开销)。

考虑到加密并不是必要功能，且引入加密会明显增大包体积和降低效率，同时还会增加数据的存储空间占用，同时由于加密也起到了数据校验的作用（AES-EAX算法的特性，使用nonce来验证数据的完整性），因此不用将加密和CRC校验叠加，所以本文将加密作为一个可选feature，只有在选择这个feature时才编译相关代码：

[dependencies]
......
eax = { version = "0.5.0", features = ["stream"], optional = true }
aes = { version = "0.8.3", optional = true }
hex = {version = "0.4.3", optional = true }

[features]
default = []
encryption = ["dep:eax", "dep:aes", "dep:hex"]

将加解密逻辑封装在encrypt这个mod之中，只有在选中feature时才会编译:

// core/mod.rs
#[cfg(feature = "encryption")]
mod encrypt;

具体实现：

// 使用密钥长度16位字节，tag长度8位字节的Eax算法
type Aes128Eax = Eax<Aes128, U8>;
type Stream = StreamBE32<Aes128Eax>;

// 定义封装加解密能力的结构体
#[derive(Clone)]
pub struct Encryptor {
    meta_file_path: PathBuf,
    encryptor: Arc<StreamWrapper>,
}

// 实现Encoder
impl Encoder for EncryptBuffer {
    fn encode_to_bytes(&self, raw_buffer: &Buffer, position: u32) -> Result<Vec<u8>> {
        let bytes_to_write = raw_buffer.to_bytes();
        // 加密bytes
        let crypt_bytes = self.encryptor.encrypt(bytes_to_write, position)?;
        let len = crypt_bytes.len() as u32;
        let mut data = len.to_be_bytes().to_vec();
        // 头部4字节为长度，后续为实际内容
        data.extend_from_slice(crypt_bytes.as_slice());
        Ok(data)
    }
}

// 实现Decoder
impl Decoder for EncryptBuffer {
    fn decode_bytes(&self, data: &[u8], position: u32) -> Result<DecodeResult> {
        let data_offset = size_of::<u32>();
        // 先读长度，如果这里也失败那说明数据损坏严重，已经没法继续读取，直接返回Error
        let item_len =
            u32::from_be_bytes(data[0..data_offset].try_into().map_err(|_| DataInvalid)?);
        let bytes_to_decode = &data[data_offset..(data_offset + item_len as usize)];
        let read_len = data_offset as u32 + item_len;
        // 解密bytes
        let result = self
            .encryptor
            .decrypt(bytes_to_decode.to_vec(), position)
            .and_then(|vec| Buffer::from_encoded_bytes(vec.as_slice()));
        let buffer = match result {
            Ok(data) => Some(data),
            Err(e) => {
                error!(LOG_TAG, "Failed to decode data, reason: {:?}", e);
                None
            }
        };
        Ok(DecodeResult {
            buffer,
            len: read_len,
        })
    }
}

Encryptor和StreamWrapper封装了加密算法初始化以及加解密API调用的细节，这里不再赘述，具体可参阅代码库以及AEAD库的文档。

至此数据校验和加密功能已经完成，下一篇讲MMKV的具体实现。