# CPM

## Overview

CPM モデルは、Zhengyan Zhang、Xu Han、Hao Zhou、Pei Ke、Yuxian Gu によって [CPM: A Large-scale Generative Chinese Pre-trained Language Model](https://huggingface.co/papers/2012.00413) で提案されました。葉徳明、秦裕佳、
Yusheng Su、Haozhe Ji、Jian Guan、Fanchao Qi、Xiaozi Wang、Yanan Zheng、Guoyang Zeng、Huanqi Cao、Shengqi Chen、
Daixuan Li、Zhenbo Sun、Zhiyuan Liu、Minlie Huang、Wentao Han、Jie Tang、Juanzi Li、Xiaoyan Zhu、Maosong Sun。

論文の要約は次のとおりです。

*事前トレーニングされた言語モデル (PLM) は、さまざまな下流の NLP タスクに有益であることが証明されています。最近ではGPT-3、
1,750億個のパラメータと570GBの学習データを備え、数回の撮影（1枚でも）の容量で大きな注目を集めました
ゼロショット）学習。ただし、GPT-3 を適用して中国語の NLP タスクに対処することは依然として困難です。
GPT-3 の言語は主に英語であり、パラメーターは公開されていません。この技術レポートでは、
大規模な中国語トレーニング データに対する生成的事前トレーニングを備えた中国語事前トレーニング済み言語モデル (CPM)。最高に
私たちの知識の限りでは、26 億のパラメータと 100GB の中国語トレーニング データを備えた CPM は、事前トレーニングされた中国語としては最大のものです。
言語モデルは、会話、エッセイの作成、
クローゼテストと言語理解。広範な実験により、CPM が多くの環境で優れたパフォーマンスを達成できることが実証されています。
少数ショット (ゼロショットでも) 学習の設定での NLP タスク。*

このモデルは [canwenxu](https://huggingface.co/canwenxu) によって提供されました。オリジナルの実装が見つかります
ここ: https://github.com/TsinghuaAI/CPM-Generate


<Tip>

CPM のアーキテクチャは、トークン化方法を除いて GPT-2 と同じです。詳細については、[GPT-2 ドキュメント](openai-community/gpt2) を参照してください。
API リファレンス情報。

</Tip>

## CpmTokenizer[[transformers.CpmTokenizer]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.CpmTokenizer</name><anchor>transformers.CpmTokenizer</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L35</source><parameters>[{"name": "vocab_file", "val": ""}, {"name": "do_lower_case", "val": " = False"}, {"name": "remove_space", "val": " = True"}, {"name": "keep_accents", "val": " = False"}, {"name": "bos_token", "val": " = '<s>'"}, {"name": "eos_token", "val": " = '</s>'"}, {"name": "unk_token", "val": " = '<unk>'"}, {"name": "sep_token", "val": " = '<sep>'"}, {"name": "pad_token", "val": " = '<pad>'"}, {"name": "cls_token", "val": " = '<cls>'"}, {"name": "mask_token", "val": " = '<mask>'"}, {"name": "additional_special_tokens", "val": " = ['<eop>', '<eod>']"}, {"name": "sp_model_kwargs", "val": ": typing.Optional[dict[str, typing.Any]] = None"}, {"name": "**kwargs", "val": ""}]</parameters></docstring>
Runs pre-tokenization with Jieba-RS segmentation tool. It is used in CPM models.


<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>build_inputs_with_special_tokens</name><anchor>transformers.CpmTokenizer.build_inputs_with_special_tokens</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L241</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs to which the special tokens will be added.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [input IDs](../glossary#input-ids) with the appropriate special tokens.</retdesc></docstring>

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An XLNet sequence has the following format:

- single sequence: `X <sep> <cls>`
- pair of sequences: `A <sep> B <sep> <cls>`








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>convert_tokens_to_string</name><anchor>transformers.CpmTokenizer.convert_tokens_to_string</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L235</source><parameters>[{"name": "tokens", "val": ""}]</parameters></docstring>
Converts a sequence of tokens (strings for sub-words) in a single string.

</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>create_token_type_ids_from_sequences</name><anchor>transformers.CpmTokenizer.create_token_type_ids_from_sequences</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L296</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).</retdesc></docstring>

Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLNet
<ExampleCodeBlock anchor="transformers.CpmTokenizer.create_token_type_ids_from_sequences.example">

sequence pair mask has the following format:

```
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |
```

</ExampleCodeBlock>

If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>get_special_tokens_mask</name><anchor>transformers.CpmTokenizer.get_special_tokens_mask</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm.py#L267</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}, {"name": "already_has_special_tokens", "val": ": bool = False"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.
- **already_has_special_tokens** (`bool`, *optional*, defaults to `False`) --
  Whether or not the token list is already formatted with special tokens for the model.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.</retdesc></docstring>

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer `prepare_for_model` method.








</div></div>

## CpmTokenizerFast[[transformers.CpmTokenizerFast]]

<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>class transformers.CpmTokenizerFast</name><anchor>transformers.CpmTokenizerFast</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm_fast.py#L30</source><parameters>[{"name": "vocab_file", "val": " = None"}, {"name": "tokenizer_file", "val": " = None"}, {"name": "do_lower_case", "val": " = False"}, {"name": "remove_space", "val": " = True"}, {"name": "keep_accents", "val": " = False"}, {"name": "bos_token", "val": " = '<s>'"}, {"name": "eos_token", "val": " = '</s>'"}, {"name": "unk_token", "val": " = '<unk>'"}, {"name": "sep_token", "val": " = '<sep>'"}, {"name": "pad_token", "val": " = '<pad>'"}, {"name": "cls_token", "val": " = '<cls>'"}, {"name": "mask_token", "val": " = '<mask>'"}, {"name": "additional_special_tokens", "val": " = ['<eop>', '<eod>']"}, {"name": "**kwargs", "val": ""}]</parameters></docstring>
Runs pre-tokenization with Jieba-RS segmentation tool. It is used in CPM models.


<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>build_inputs_with_special_tokens</name><anchor>transformers.CpmTokenizerFast.build_inputs_with_special_tokens</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm_fast.py#L148</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs to which the special tokens will be added.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [input IDs](../glossary#input-ids) with the appropriate special tokens.</retdesc></docstring>

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An XLNet sequence has the following format:

- single sequence: `X <sep> <cls>`
- pair of sequences: `A <sep> B <sep> <cls>`








</div>
<div class="docstring border-l-2 border-t-2 pl-4 pt-3.5 border-gray-100 rounded-tl-xl mb-6 mt-8">


<docstring><name>create_token_type_ids_from_sequences</name><anchor>transformers.CpmTokenizerFast.create_token_type_ids_from_sequences</anchor><source>https://github.com/huggingface/transformers/blob/v4.57.0/src/transformers/models/cpm/tokenization_cpm_fast.py#L174</source><parameters>[{"name": "token_ids_0", "val": ": list"}, {"name": "token_ids_1", "val": ": typing.Optional[list[int]] = None"}]</parameters><paramsdesc>- **token_ids_0** (`list[int]`) --
  List of IDs.
- **token_ids_1** (`list[int]`, *optional*) --
  Optional second list of IDs for sequence pairs.</paramsdesc><paramgroups>0</paramgroups><rettype>`list[int]`</rettype><retdesc>List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).</retdesc></docstring>

Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLNet
<ExampleCodeBlock anchor="transformers.CpmTokenizerFast.create_token_type_ids_from_sequences.example">

sequence pair mask has the following format:

```
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence    | second sequence |
```

</ExampleCodeBlock>

If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s).








</div></div>

<EditOnGithub source="https://github.com/huggingface/transformers/blob/main/docs/source/ja/model_doc/cpm.md" />