升级网络过程中遇到的 bug（二）

接上文，这次升级过程中碰到的问题还需要稍微展开描述一下。

Future Cancellation safety

在怀疑问题的目光聚焦到 select 宏上的时候，我一度怀疑是否是因为 Future cancellation safety 的问题导致消息丢失，这个概念有一点抽象，要完全理解需要对 select 宏的工作机制和 Future 的工作机制有了解。

select 宏的拆解在上一篇已经做过了，它真正做的事情是，“按随机顺序调用用户给出的几个 Future 的 poll 方法，有 ready 就直接返回，没有就 pending”。那么这里会存在什么问题呢，为什么会有 cancellation safety 的限制呢？

这个问题又要从 future 的运作方式及中间状态讲起：

pub trait Future {
    type Output;

    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output>;
}

每一次 Future 执行，入口都是 poll 函数，也就是说，如果某一个 Future 存在中间状态的概念，如果用户实现上并没有对中间状态做特殊处理，那会导致中间状态的丢失，所以我们会称该 Future 不是 cancellation safety。

举个简单的例子：

struct Yield {
    polled: bool
}

impl Future For OnlyOnce {
    type Output = ();
    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        if self.polled {
            return Poll::Ready(())
        }
        self.polled = true;
        cx.waker().wake_by_ref();
        Poll::Pending
    }
}

这是一个简单的 yield 实现，它的行为是，在第一次执行 poll 的时候，返回 pending，第二次执行的时候返回 ready，需要保留一个状态记录自己是第几次被 poll。这个被保留的状态就是该 Future 的中间状态，现在的实现是对的。

如果改成下面这样：

struct Yield {}

impl Future For OnlyOnce {
    type Output = ();
    fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
        let mut polled = false;
        if polled {
            return Poll::Ready(())
        }
        polled = true;
        cx.waker().wake_by_ref();
        Poll::Pending
    }
}

这就是一个严重的错误，该 Future 每次 poll 都重置了自己的初始状态，永远无法 Ready 并且一直被调度。这种 Future 其实也是一种实现错误。

还有一种错误，它来自于 async fn，或者可以更直接一点，来自于 async fn 返回的带 lifetime 的 Future，当这个 Future 被 scope 回收之后，中间状态就完全消失了，这类 Future 就不是 cancellation safety。

tokio 也给了一个简单的解释：

To determine whether your own methods are cancellation safe, look for the location of uses of .await. This is because when an asynchronous method is cancelled, that always happens at an .await. If your function behaves correctly even if it is restarted while waiting at an .await, then it is cancellation safe.

更针对于完全 async 的场景，尤其是一个 async 上下文中多个 await 连续调用的场景。

Reuse Address

还一个调试了很久的问题是，mac 系统与 linux 系统设置不一致导致的行为不一致。大家都知道，我日常工作于 archlinux 系统下，并没有也不打算持有 mac 系统的电脑。这个调试需要临时开云服务进行，而且还需要稍微理解一下 mac 系统的一些差异。当然，本次问题并不是系统差异造成的，更多的还是意外。

我在 tentacle 这个 PR 引入了 tcp config 功能，将所有配置项都移交给用户处理，也同时修改了 listen 的初始化写法。但这导致了一个问题，原有 tcp listen 使用的是 mio 默认的方式：默认在 unix 系统上，自动加入 reuse_addr 的配置，而我并没有注意到这点，导致了行为的差异。最后在 PR 中重新加回该默认行为。

Uninit Memory

rust 对 uninit memory 的使用是一件很麻烦的事情，它为了未来可能的修改，直接规定：

As a consequence, zero-initializing a variable of reference type causes instantaneous undefined behavior, no matter whether that reference ever gets used to access memory:

就是说，如果有一段未初始化的内存，你只能用指针操作，任何读或者引用传递都是原地 UB 的行为。

什么叫引用传递，如下：

pub fn copy_from_slice(&mut self, src: &[T])
where
	T: Copy,

我们知道 slice 有一个方法是 copy_from_slice，我们也知道，它的最终实现是 copy_from_nonoverlapping 一个纯指针操作的东西，但用户使用 copy_from_slice 初始化 uninit slice 就是 UB，因为它使用了 reference。

这是一个很苛刻的判断条件，几乎断绝了 uninit memory 跨函数调用的路，尤其是跨 crate 调用，因为一般 crate 提供的方法都会封装好，谨慎暴露 unsafe 接口。

最后

好，就到这里啦