llm 原理

📄 Introduction

source:https://stanford-cs324.github.io/winter2022/lectures/introduction/ - CS324

是这门课程 lecture note 的摘抄和部分“用自己的语言表述”

基本原理

语言模型（LM）的经典定义是标记序列的概率分布。

概率直观地告诉我们一个标记序列有多“好”。例如，如果词汇表是 V={ate,ball,cheese,mouse,the} ，语言模型可能会给出下述赋值：

p(the,mouse,ate,the,cheese)=0.02,

p(the,cheese,ate,the,mouse)=0.01,

p(mouse,the,the,cheese,ate)=0.0001.

这需要大量的语法知识和世界知识

“生成”的含义：以 $P (x_{1 : L})$ 的概率生成序列 $x_{1 : L}$ ，具体生成细节以模型不同而不同

Autoregressive language models-> 通过链式概率规则

$p (x_{1 : L}) = p (x_{1}) p (x_{2} ∣ x_{1}) p (x_{3} ∣ x_{1}, x_{2}) \dots p (x_{L} ∣ x_{1 : L - 1}) = \prod_{i = 1}^{L} p (x_{i} ∣ x_{1 : i - 1}) .$

eg：p(𝗍𝗁𝖾,𝗆𝗈𝗎𝗌𝖾,𝖺𝗍𝖾,𝗍𝗁𝖾,𝖼𝗁𝖾𝖾𝗌𝖾) = p ( 𝗍𝗁𝖾 ) p ( 𝗆𝗈𝗎𝗌𝖾 ∣ 𝗍𝗁𝖾 ) p ( 𝖺𝗍𝖾 ∣ 𝗍𝗁𝖾 , 𝗆𝗈𝗎𝗌𝖾 ) p ( 𝗍𝗁𝖾 ∣ 𝗍𝗁𝖾 , 𝗆𝗈𝗎𝗌𝖾 , 𝖺𝗍𝖾 ) p ( 𝖼𝗁𝖾𝖾𝗌𝖾 ∣ 𝗍𝗁𝖾 , 𝗆𝗈𝗎𝗌𝖾 , 𝖺𝗍𝖾 , 𝗍𝗁𝖾 ) .

所以我们可以进一步表达它的“生成”的办法,T 是退火算法的温度，描述它的“偏离当前情况的积极性”

for i = 1, \dots, L : x_{i} \sim p (x_{i} ∣ x_{1 : i - 1})^{1/ T},

条件生成：更一般地说，我们可以通过指定一些前缀序列 $x_{1 : i}$ （称为提示）并对其余 $x_{i + 1 : L}$ 序列（称为 prompt）进行采样来执行条件生成。示例如下：

历史

Information theory, entropy of English, n-gram models

和可汗学院对熵¹的阐释有相同之处

在香农的信息论里，将熵定义为

H (p) = x \sum p (x) lo g \frac{1}{p ( x )} .

熵衡量把一个采样 $x \sim p$ 压缩为位串所需的预期位数，无论使用什么算法都一样

The lower the entropy, the more “structured” the sequence is, and the shorter the code length.
熵越低，序列越“结构化”，代码长度越短。
Intuitively, $lo g \frac{1}{p ( x )}$ is the length of the code used to represent an element x that occurs with probability p(x).
直观地说， $lo g \frac{1}{p ( x )}$ 是用于表示以概率 p(x) 出现的元素 x 的代码长度。
If $p (x) = \frac{1}{8}$ , we should allocate $lo g_{2} (8) = 3$ bits (equivalently, log⁡(8)=2.08 nats).
如果 p(x)=1/8 ，我们应该分配 log2⁡(8)=3 位（相当于 log⁡(8)=2.08 nats）。**

香农熵实际上是编码和通信的上限，怎么触及上限仍然是我们目前在研究的一个问题

香农特别地对英语的熵特别感兴趣。这意味着我们想象存在一个“真实”的分布 p（其存在是值得怀疑的，但它仍然是一个有用的数学抽象），它可以输出英语文本 x∼p 的样本。

交叉熵：它测量使用模型 q(representing $x$ with a code of length $\frac{1}{q ( x )}$ ).给出的压缩方案，对样本 x∼p 进行编码所需的预期位数

H (p, q) = x \sum p (x) lo g \frac{1}{q ( x )},

这样，我们可以通过语言模型来估计（人类语言模型的）熵，因为交叉熵的上限是 H(p,q)≥H(p)，而 p（上述的理念中的真实分布）无法访问，所以我们通过构建更好的模型 q 来获得更好的熵估计

这种估计的最佳例子就是 shannon game，在这里这个更好的模型 q 由人类提供

the mouse ate my ho_

人类不擅长提供任意文本的校准概率，因此在香农游戏中，人类语言模型会反复尝试猜测下一个字母，并记录猜测的次数。

这一 idea 的下游应用就是 N-gram 模型，例子如下：

speech recognition in the 1970s (input: acoustic signal, output: text), and
machine translation in the 1990s (input: text in a source language, output: text in a target language).

Noisy channel model. The dominant paradigm for solving these tasks then was the noisy channel model. Taking speech recognition as an example:
嘈杂的通道模型。当时，解决这些任务的主要范式是噪声信道模型。以语音识别为例：

We posit that there is some text sampled from some distribution p.
我们假设有一些文本是从某个分布中 p** 抽样的。**
This text becomes realized to speech (acoustic signals).
该文本被实现为语音（声音信号）。
Then given the speech, we wish to recover the (most likely) text. This can be done via Bayes rule:
然后，鉴于演讲，我们希望恢复（最有可能的）文本。这可以通过贝叶斯规则来完成：

N-gram 模型。在 n-gram 模型中，令牌 $x_{i}$ 的预测仅取决于最后一个 n−1 字符 $x_{i - (n - 1) : i - 1}$ ，而不是完整的历史记录：

p (x_{i} ∣ x_{1 : i - 1}) = p (x_{i} ∣ x_{i - (n - 1) : i - 1}) .

右侧的概率这些概率是根据各种 n 元语法的次数计算的，出自某些大型语料库，并且通过平滑算法来防止过拟合

Fitting n-gram models to data is extremely computationally cheap and scalable，As a result, n-gram models were trained on massive amount of text. For example, Brants et al. (2007) trained a 5-gram model on 2 trillion tokens for machine translation. In comparison, GPT-3 was trained on only 300 billion tokens.

但是从根本的原理来讲，n-gram 具有局限性，如果 n 太小，则模型将无法捕获长程依赖关系；但是，如果 n 太大，则在统计学上无法获得对概率的良好估计，甚至词频会为 0

因此，语言模型仅限于语音识别和机器翻译等任务，在这些任务中，声学信号或源文本提供了足够的信息，仅捕获本地依赖关系（而无法捕获远程依赖关系）并不是一个大问题。

Neural language models

语言模型向前迈出的重要一步是神经网络的引入

$p (cheese ∣ ate, the) = some-neural-network (ate, the, cheese) .$

请注意，上下文长度仍然受 n 限制，但现在可以统计上可行来估计更大的 n 值的神经语言模型。

现在，主要的挑战是训练神经网络的计算成本要高得多。他们只用 1400 万个单词训练了一个模型，并表明它的表现优于在相同数据量上训练的 n-gram 模型。但是，由于 n-gram 模型更具可扩展性，并且数据不是瓶颈，因此 n-gram 模型至少在十年内继续占据主导地位。

两个 neural language modeling 的重大进展:RNN 和 Transformers

递归神经网络（RNN），包括长短期记忆（LSTM），允许 token $x_{i}$ 的条件分布依赖于整个上下文 $x_{1 : i - 1}$ （有效 n=∞ ），但这些很难训练。
Transformers 是一种较新的架构（2017 年为机器翻译开发），它再次恢复了具有固定上下文长度 n ，但更容易训练（并利用了 GPU 的并行性）。此外，对于许多应用程序来说， n 可以做得“足够大”（使用 n=2048 GPT-3 ）。

Summary

Language models were first studied in the context of information theory, and can be used to estimate the entropy of English.
语言模型最初是在信息论的背景下研究的，可用于估计英语的熵。
N-gram models are extremely computationally efficient and statistically inefficient.
N-gram 模型在计算效率极高，在统计上效率低下。
N-gram models are useful for short context lengths in conjunction with another model (acoustic model for speech recognition or translation model for machine translation).
N-gram 模型与另一个模型（用于语音识别的声学模型或用于机器翻译的翻译模型）结合使用，可用于较短的上下文长度。
Neural language models are statistically efficient but computationally inefficient.
神经语言模型在统计上是有效的，但在计算上是低效的。
Over time, training large neural networks has become feasible enough that neural language models have become the dominant paradigm.
随着时间的流逝，训练大型神经网络已经变得足够可行，以至于神经语言模型已成为主导范式。

今日 LLM

今日的 LLM 大小增大，产生涌现(Emergence) 现象

LLM 的能力很神奇：prompt⇝completion；上下文学习

上下文学习 eg：

我们想获得直接的答案（一个单词）而不不是下述句子：

Input: Where is Stanford University?输入：斯坦福大学在哪里？Output: 输出： Stanford University is in California.
斯坦福大学位于加利福尼亚州。

则我们可以给 gpt 例子，它可以学会类比

Input: Where is MIT? 输入：麻省理工学院在哪里？Output: Cambridge 输出： Cambridge

Input: Where is University of Washington?

**
输入：华盛顿大学在哪里？
Output: Seattle** **输出：西雅图 Input: Where is Stanford University?

输入：斯坦福大学在哪里？
Output:** 输出： Stanford** **斯坦福

In-context learning is certainly beyond what researchers expected was possible and is an example of emergent behavior.

应用和风险

Structure of this course

课程结构

你带我走吧 😭

This course will be structured like an onion:
本课程的结构将像洋葱一样：

Behavior of large language models: We will start at the outer layer where we only have blackbox API access to the model (as we’ve had so far). Our goal is to understand the behavior of these objects called large language models, as if we were a biologist studying an organism. Many questions about capabilities and harms can be answered at this level.
大型语言模型的行为：我们将从外层开始，在那里我们只能访问模型的黑盒 API（到目前为止）。我们的目标是理解这些被称为大型语言模型的物体的行为，就好像我们是研究生物体的生物学家一样。许多关于能力和危害的问题都可以在这个级别上得到回答。
Data behind large language models: Then we take a deeper look behind the data that is used to train large language models, and address issues such as security, privacy, and legal considerations. Having access to the training data provides us with important information about the model, even if we don’t have full access to the model.
大型语言模型背后的数据：然后，我们更深入地研究用于训练大型语言模型的数据背后的问题，并解决安全、隐私和法律注意事项等问题。访问训练数据为我们提供了有关模型的重要信息，即使我们没有对模型的完全访问权限。
Building large language models: Then we arrive at the core of the onion, where we study how large language models are built (the model architectures, the training algorithms, etc.).
构建大型语言模型：然后我们到达洋葱的核心，在那里我们研究如何构建大型语言模型（模型架构、训练算法等）。【这部分比较重要，我们可以仔细研究一下】
Beyond large language models: Finally, we end the course with a look beyond language models. A language model is just a distribution over a sequence of tokens. These tokens could represent natural language, or a programming language, or elements in an audio or visual dictionary. Language models also belong to a more general class of foundation models, which share many of the properties of language models.
超越大型语言模型：最后，我们以超越语言模型来结束课程。语言模型只是标记序列的分布。这些标记可以表示自然语言、编程语言或音频或视频词典中的元素。语言模型也属于更通用的基础模型类，它们共享语言模型的许多属性。

熵

https://www.khanacademy.org/computing/computer-science/informationtheory/moderninfotheory/v/information-entropy

引入更多的可预测信息，熵下降，意味着在“问答游戏”里，我们需要问更少的问题，以猜测结果 ↩

基本原理

历史

Information theory, entropy of English, n-gram models

Neural language models

Summary

今日 LLM

Structure of this course

熵

相关帖子

用 SPCT 给奖励模型来次“升级”：能自省、会点评，还能越算越准

DeepSeek-V3-0324 推荐温度为 0.3，以及奇葩的温度缩放机制

希望粘贴链接时自动替换锚文本

文档树能否支持显示闪卡数量

思源媒体播放器 v0.2.6 更新（支持 PotPlayer 和浏览器跳转播放）

打开大纲动画太卡了

希望优化备注功能

欢迎来到这里！