Lecture 3 Ideal vs Real Distributed Systems

type

status

date

slug

summary

RPC Transport

Local procedure call : 1 invocation = 1 execution

Trivial to emulate in ideal distributed system

How to guarantee this in spite of system flakiness?

Two approaches to handling this flakiness（不稳定性）

Approaches 1: Outsource! Why suffer this headache? 这里的意思是指调用成熟的库或者使用成熟的框架。

Approaches 2: The buck has to stop somewhere! Do it yourself

Approach 1 : Outsource pain

Use TCP as foundation

layer RPC on top of it

simpler code (Project 1)

TCP = “ Transmission Controller Protocol” guarantees 可靠性传输协议

reliable delivery(no data is ever lost or corrupted) 可靠的传输（数据不会丢失或损坏）

in-order delivery(bytes arrive in the exact order they were sent) 按顺序送达（字节按发送的确切顺序送达）

unlimited data size (feel free to ship a GB if you want) 无限的数据大小（如果你需要，可随意发送 GB 数据）

abstraction of continuous pipeline between sender and receiver

read() may return fewer than number of bytes requested read() 返回的字节数可能少于请求的字节数

在传输方面交给 TCP，以减少自己实现可靠传输的代价。

TCP 并不保证不切割数据包，也就说可能发送的数据包并不是以一个整体进行传输。

What is NOT guaranteed(price paid for using TCP) 不保证的内容(使用 TCP 的代价)

data is inserted in certain-size chunks comes out in those size chunks 数据以一定大小的块插入，以一定大小的块输出

no preservation of write() boundaries

aka “ data is re-framed in transit”

read() may return fewer than number of bytes requested read() 返回的字节数可能少于请求的字节数

这一部分主要是对 TCP 协议的讲述，使用 TCP 的代价，需要注意的是粘包、还有数据包切分等问题。

Approach 2: Do it Yourself

Basic idea → Retransmission 基本思想-重新传输

lost packets for transient reasons common 出于短暂原因丢失的数据包很常见

giving up too soon is pessimistic

(maybe server never received your request) 也许服务器从未收到您的请求

Implementation

send request packet , then start timer 发送请求数据包，然后启动计时器

if reply not in when timer goes off ,retransmit and start timer 如果计时器关闭时没有收到回复，则重新传输并启动计时器

… and again… and again… and again… and again 再次重试

finally give up and declare failure 最终放弃并宣布失败

Problem with blind retransmission

perhaps server is still computing or perhaps it is overloaded 也许服务器仍在运算，也许已超载

or perhaps it sent a reply and this was lost 或者它发送了回复，但丢失了

duplicate execution violates RPC semantics 重复执行违反 RPC 语义

Solution: Duplicate Elimination (using Sequence Numbers)

Note: TCP implements retransmission and duplicate elimination 注意：TCP实现了重传和重复消除功能

序列号解决重传问题（幂等问题）

不同进程并不共享 TCP 连接。

自主实现采用的底层协议通常为 UDP 协议。

How TCP Ensures Delivery

TCP 如何确保交付

TCP is a streaming protocol (aka “byte stream” protocol)

ACKS refer to byte number rather than packet number ACKS 指的是字节编号而不是数据包编号

breakup of byte sequence into packets happens at lower layer 在下层将字节序列分解成数据包

📔

TCP 滑动窗口和确认

Timeouts in Distributed Systems

How do you pick a perfect timeout value?

in the worst case , no perfect value exists 在最坏的情况下，不存在完美值

at best , using known statistics , one can pick a “reasonable” value 充其量，利用已知的统计数据，我们可以选择一个“合理”值

can be wrong , sometimes giving up too soon 可能是错误的，有时候过早放弃。

no matter what value is picked , it could be “too soon” 无论选择什么值，都可能“太早”

reply could arrive just after you give up 答复可能在你放弃之后。

延时是可以计算的，端到端的响应时间是可以估算的，可以使用概率分布，可以选择一个均值加上 1 倍标准差、2 倍或三倍。这是实际中采用的做法。实际上即便这么设置在最坏的情况下也可能不尽任意。

What should server do when it sees a duplicate? 当 server 看到重复项时应该怎么做？

May mean any of the following possibilities happened

reply lost

reply crossed retransmitted request （回复还在传输中，客户端又重传了）

compute time was excessive 计算时间过长

client was too impatient

Knowledge at server is always stale relative to client and vice versa

The best server can do is to retransmit reply 服务器所能做的就是重新发送回复

Replies must be preserved 必须保留回复

only 1 reply saved per connection

cannot re-compute reply 无法重新计算回复

would result in multiple computations per invocation 会导致每次调用进行多次计算

保留回复才能避免多次计算，至于保留多少，服务器至少要保留最近一次的回复记录。

Q：能否在解释一下为什么必须保留回复，所以服务器只为每个连接保存最新的回复？
A：只需要保存最新的回复，因为收到下一个请求这一事实本身，就意味着客户端必定已经收到了你的回复，否则客户端就永远不会推进到下一个请求。

Exactly-once Semantics

theoretical ideal

How long to keep old replies and sequence numbers? 旧回复和序列号要保留多长时间？

rigorous interpretation of “RPC” → forever!

across server crashes too

they have to be saved in non-volatile memory
server response has to be after non-volatile write
disk(or flash) latency on every RPC

clean undo of partial computations before crash 彻底撤销崩溃前的部分计算

回复和磁盘写入不能并行。RPC 的性能受限于写入设备的性能。

📖

Exactly-once

对于调用者（caller）发出的每次远程调用请求，被调用方（callee）确保精确执行且仅执行一次业务逻辑，即使遇到网络故障、节点崩溃等异常情况。

本质是通过 幂等性 + 原子状态机 + 持久化日志 在应用层模拟出的语义。可以理解是远程调用模拟本地调用。

真正的 Exactly不存在，但是通过”事务+幂等+快照+人工修复”可无限接近。

Such an RPC would have exactly-once semantics 这样的 RPC 将具有精确一次的语义

success return from RPC call → call executed exactly once RPC 调用成功返回 → 调用被精确执行一次

call blocks indefinitely , no failure return 调用无限期阻塞，无失败返回

Not appropriate for many real applications 不适合许多实际应用

too slow because of synchronous disk writes

indefinite blocking unacceptable in many cases - 在许多情况下，无限期阻塞是不可接受的

application-level recovery precluded 排除应用级恢复

requires transactional semantics for server actions 要求服务器操作采用事务语义

Exactyly-once 操作繁琐且效率低下，实际开发中会放宽语义要求。

At-most-once Semantics

practically achievable

至多一次语义

How to avoid indefinite blocking?

declare timeout if call takes longer than specified bound 如果调用时间超过指定时限，则宣布超时

Such an RPC has at-most-once semantics

refers to what can be inferred in the worst case 指在最坏情况下可以推断出的结果

success → call executed exactly once 成功 → 调用正好执行一次

timeout → call executed once or not at all 超时 → 调用执行一次或根本不执行

Many possible reasons for RPC timeout RPC 超时可能有多种原因

request and retries never got to server 请求和重试从未到达服务器

server died while working on request 服务器在处理请求时宕机

network broke while server working on request 服务器在处理请求时网络中断

server still working on request 服务器仍在处理请求

server replied , but reply lost 服务器已回复，但回复丢失

server resent reply , but all copies of reply lost 服务器重新回复，但所有回复副本均已丢失

Server may be sluggish or unreachable 服务器可能迟缓或无法访问

complicates setting of timeout value 使超时值的设置复杂化

probes to check server health during long calls 探测，用于在长时间调用期间检查服务器运行状况

server responds with busy if still working 如果仍在工作，服务器将以忙响应

essentially a keepalive mechanism 本质上是一种保活机制

Orphaned Computations

孤儿计算

Danger with at-most-once semantics 至多一次的风险

client sends request , server starts computing

network failure occurs 发生网络故障

server continues , unaware its work is useless 服务器继续工作，却不知道自己的工作毫无用处

server may hold resources(e.g.locks),slowing other activity

Orphan detection and extermination are difficult 难以消灭孤儿计算

typically require application-specific recovery 通常需要针对特定应用的恢复

“Failure” closely related to “timeout value” "失败 "与 "超时值 "密切相关

fundamental limitation in a distributed system 分布式系统的基本限制

due to absence of out-of-band error detection 由于没有带外误差检测无法区分服务器死亡和网络故障

can’t tell server death from network failure ①

client 发送请求给Server , Server 还在计算，Client 已经超时或宕机，无法看到 Server 的回复，这种请求称为”孤儿请求“。

孤儿计算的风险：浪费计算资源、死锁冲突、脏数据写入、状态分裂等

例如调用下单服务超时，而下单服务只是还在处理业务，调用方却认为失败，其实已经下单成功。如果业务逻辑足够复杂，例如下单成功推送了短信，但是用户在 APP 上却是下单失败的提醒。

📖

① 由于缺乏带外错误检测机制，无法区分服务器宕机与网络故障

无法通过心跳机制来确认Server是宕机还是网络发生了故障。

多维度健康检测

带外检测（Out-of-Band Monitoring）硬件级别
跨路径探测（Multi-Path Probe）

从不同的网络区域同时探测目标服务器

分布式共识协议辅助判断

Quorum投票机制
故障检测器（Failure Detector）

Φ-accrual算法（如Akka/Aeron所用）不熟，先标注下。

业务层设计

有限重试+断路器模式（重试+熔断）
心跳检测接口暴露更多信息，客户端检查响应中的业务指标而不仅仅是 HTTP 状态码

RPC Transport

Approach 1 : Outsource pain

Approach 2: Do it Yourself

How TCP Ensures Delivery

Timeouts in Distributed Systems

Exactly-once Semantics

At-most-once Semantics

Orphaned Computations

newrain-zh

交流频道

加入我们的社群讨论分享