diff --git a/docs/mindformers/docs/source_en/quick_start/source_code_start.md b/docs/mindformers/docs/source_en/quick_start/source_code_start.md index d7a0929eeac01a9995ef095f6989b0156a783802..82d3dd913f9d57844b7e4ee3b6f04ee7a9bb3c8a 100644 --- a/docs/mindformers/docs/source_en/quick_start/source_code_start.md +++ b/docs/mindformers/docs/source_en/quick_start/source_code_start.md @@ -20,11 +20,12 @@ Word list download link: [tokenizer.model](https://ascend-repo-modelzoo.obs.cn-e 2. Data Preprocessing + The following command needs to be executed in the MindFormers root directory: + 1. Execute [mindformers/tools/dataset_preprocess/llama/alpaca_converter.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/dataset_preprocess/llama/alpaca_converter.py), and use the fastchat tool to add prompt templates to convert the raw dataset into a multi-round conversation format. ```shell - - python alpaca_converter.py \ + python mindformers/tools/dataset_preprocess/llama/alpaca_converter.py \ --data_path /{path}/alpaca_data.json \ --output_path /{path}/alpaca-data-conversation.json ``` @@ -37,8 +38,8 @@ Word list download link: [tokenizer.model](https://ascend-repo-modelzoo.obs.cn-e 2. Execute [mindformers/tools/dataset_preprocess/llama/llama_preprocess.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/dataset_preprocess/llama/llama_preprocess.py), and generate MindRecord data and convert data with prompt templates to MindRecord format. ```shell - # This tool relies on the fschat toolkit to parse prompt templates, please install fschat >= 0.2.13 python = 3.9 in advance. - python llama_preprocess.py \ + # This tool relies on the fschat toolkit to parse prompt templates, please install fschat >= 0.2.13 in advance. + python mindformers/tools/dataset_preprocess/llama/llama_preprocess.py \ --dataset_type qa \ --input_glob /{path}/alpaca-data-conversation.json \ --model_file /{path}/tokenizer.model \ @@ -67,7 +68,7 @@ Word list download link: [tokenizer.model](https://ascend-repo-modelzoo.obs.cn-e ## Initiating Fine-tuning -Use the `run_mindformer.py` unified script to pull up tasks: +In the MindFormers root directory, use the `run_mindformer.py` unified script to pull up tasks: - Specify the `config` path `configs/llama2/lora_llama2_7b.yaml` via `--config`. - Specify dataset path `/{path}/alpaca-fastchat4096.mindrecord` via `-train_dataset_dir`. diff --git a/docs/mindformers/docs/source_en/usage/sft_tuning.md b/docs/mindformers/docs/source_en/usage/sft_tuning.md index 4b16db7f75bb1610232bea081b18470e73bcaf6f..adff97d2644a0dd8786e4bd8601e9ba7e4710aca 100644 --- a/docs/mindformers/docs/source_en/usage/sft_tuning.md +++ b/docs/mindformers/docs/source_en/usage/sft_tuning.md @@ -102,7 +102,7 @@ The following uses the alpaca dataset as an example. After downloading the datas output_path: path for storing output files. ``` -2. Run the [llama_preprocess.py script](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/dataset_preprocess/llama/llama_preprocess.py) in MindFormers to convert the data into the MindRecord format. This operation depends on the fastchat tool package to parse the prompt template. You need to install fastchat 0.2.13 or later and Python 3.9 in advance. +2. Run the [llama_preprocess.py script](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/dataset_preprocess/llama/llama_preprocess.py) in MindFormers to convert the data into the MindRecord format. This operation depends on the fastchat tool package to parse the prompt template. You need to install fastchat 0.2.13 or later in advance. ```bash python llama_preprocess.py \ diff --git a/docs/mindformers/docs/source_zh_cn/quick_start/source_code_start.md b/docs/mindformers/docs/source_zh_cn/quick_start/source_code_start.md index cdf8427256458156410f1abe4553ddfe6f6bc51f..0f43ee8861e8bb85bf8d48aa64920ee243c33e9c 100644 --- a/docs/mindformers/docs/source_zh_cn/quick_start/source_code_start.md +++ b/docs/mindformers/docs/source_zh_cn/quick_start/source_code_start.md @@ -20,11 +20,12 @@ MindFormers提供已经转换完成的预训练权重、词表文件用于预训 2. 数据预处理 + 需要在MindFormers根目录下执行以下操作: + 1. 执行[mindformers/tools/dataset_preprocess/llama/alpaca_converter.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/dataset_preprocess/llama/alpaca_converter.py),使用fastchat工具添加prompt模板,将原始数据集转换为多轮对话格式。 ```shell - - python alpaca_converter.py \ + python mindformers/tools/dataset_preprocess/llama/alpaca_converter.py \ --data_path /{path}/alpaca_data.json \ --output_path /{path}/alpaca-data-conversation.json ``` @@ -37,8 +38,8 @@ MindFormers提供已经转换完成的预训练权重、词表文件用于预训 2. 执行[mindformers/tools/dataset_preprocess/llama/llama_preprocess.py](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/dataset_preprocess/llama/llama_preprocess.py),生成MindRecord数据,将带有prompt模板的数据转换为MindRecord格式。 ```shell - # 此工具依赖fschat工具包解析prompt模板,请提前安装fschat >= 0.2.13 python = 3.9 - python llama_preprocess.py \ + # 此工具依赖fschat工具包解析prompt模板,请提前安装fschat >= 0.2.13 + python mindformers/tools/dataset_preprocess/llama/llama_preprocess.py \ --dataset_type qa \ --input_glob /{path}/alpaca-data-conversation.json \ --model_file /{path}/tokenizer.model \ @@ -67,7 +68,7 @@ MindFormers提供已经转换完成的预训练权重、词表文件用于预训 ## 启动微调 -使用`run_mindformer.py`统一脚本拉起任务: +在MindFormers根目录下,使用`run_mindformer.py`统一脚本拉起任务: - 通过 `--config` 指定`config`路径 `configs/llama2/lora_llama2_7b.yaml`。 - 通过 `--train_dataset_dir` 指定数据集路径 `/{path}/alpaca-fastchat4096.mindrecord`。 diff --git a/docs/mindformers/docs/source_zh_cn/usage/sft_tuning.md b/docs/mindformers/docs/source_zh_cn/usage/sft_tuning.md index bc1172ef82f3c69dce510df6e255d36141cd9473..3df756a9b147cb81cda9341aa48ffa10fa3692ee 100644 --- a/docs/mindformers/docs/source_zh_cn/usage/sft_tuning.md +++ b/docs/mindformers/docs/source_zh_cn/usage/sft_tuning.md @@ -102,7 +102,7 @@ MindFormers提供**WikiText2**作为预训练数据集,**alpaca**作为微调 output_path: 输出文件的保存路径 ``` -2. 执行MindFormers中的[llama_preprocess.py脚本](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/dataset_preprocess/llama/llama_preprocess.py),将数据转换为MindRecord格式。该操作依赖fastchat工具包解析prompt模板, 请提前安装fastchat >= 0.2.13 python = 3.9。 +2. 执行MindFormers中的[llama_preprocess.py脚本](https://gitee.com/mindspore/mindformers/blob/dev/mindformers/tools/dataset_preprocess/llama/llama_preprocess.py),将数据转换为MindRecord格式。该操作依赖fastchat工具包解析prompt模板, 请提前安装fastchat >= 0.2.13。 ```bash python llama_preprocess.py \