{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "0151820b-1ed6-431d-8a87-23e24ad2dcde",
      "metadata": {
        "id": "0151820b-1ed6-431d-8a87-23e24ad2dcde"
      },
      "source": [
        "# <center> Готовим данные для файнтюнинга LLM. 🦙 </center>\n",
        "\n",
        "<img src='https://github.com/a-milenkin/LLM_practical_course/blob/main/images/lama_docs.webp?raw=1' align=\"right\" width=\"500\" height=\"428\" >\n",
        "\n",
        "В этом ноутбуке подготовим датасет для файнтюнинга LLM и загрузим его на `HuggingFace`. <br>\n",
        "\n",
        "В качестве источника данных скачаем все посты из Telegram-канала [Datafeeling](https://t.me/datafeeling), через который многие нашли этот курс. В дальнейшем, с помощью этого датасета зафайнтюним модель писать посты в стиле Алерона. <br>\n",
        "\n",
        "**Чтобы скачать все посты любого публичного TG-канала:**\n",
        "* в десктоп приложении Telegram переходим к канал\n",
        "* нажимаем три точки\n",
        "* выбираем - \"Экспорт истории чата\"\n",
        "* там выбираем формат (HTML или json), путь куда сохраняем и какой контент канала хотим скачать\n",
        "* получаем все посты в одном файле (в нашем случае json)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c3e52fc5-ecb9-4a92-a943-afb96ac784c4",
      "metadata": {
        "id": "c3e52fc5-ecb9-4a92-a943-afb96ac784c4",
        "outputId": "e8c5791f-0b12-49d0-e785-bbc7d48b67f4"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
            "sphinx-rtd-theme 2.0.0 requires docutils<0.21, but you have docutils 0.21.2 which is incompatible.\n",
            "mxnet 1.9.1 requires graphviz<0.9.0,>=0.8.1, but you have graphviz 0.20.3 which is incompatible.\n",
            "bentoml 0.13.2+6.g2dc5913c requires sqlalchemy<1.4.0,>=1.3.0, but you have sqlalchemy 2.0.29 which is incompatible.\n",
            "anomalib 0.7.0 requires pytorch-lightning<1.10.0,>=1.7.0, but you have pytorch-lightning 2.1.4 which is incompatible.\u001b[0m\u001b[31m\n",
            "\u001b[0m"
          ]
        }
      ],
      "source": [
        "!pip install datasets langchain langchain-openai openai -q"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "734683a0-e9ad-4e46-9537-9e4fc2e1235a",
      "metadata": {
        "jupyter": {
          "source_hidden": true
        },
        "id": "734683a0-e9ad-4e46-9537-9e4fc2e1235a"
      },
      "outputs": [],
      "source": [
        "# Для работы в колабе загрузите наш скрипт для использования ChatGPT с ключом курса и необходимые файлы!\n",
        "# !mkdir ../data/\n",
        "# !wget -P ../data https://raw.githubusercontent.com/a-milenkin/LLM_practical_course/main/data/result.json\n",
        "# !wget https://raw.githubusercontent.com/a-milenkin/LLM_practical_course/main/notebooks/utils.py"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "56e5b2d5-8c77-465a-9cef-2f58f120f4e0",
      "metadata": {
        "id": "56e5b2d5-8c77-465a-9cef-2f58f120f4e0"
      },
      "outputs": [],
      "source": [
        "import os\n",
        "from getpass import getpass\n",
        "import pandas as pd\n",
        "import json\n",
        "import warnings\n",
        "warnings.filterwarnings('ignore')"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "6cf8eb32-e0b4-4c68-9a15-8c35102f2033",
      "metadata": {
        "id": "6cf8eb32-e0b4-4c68-9a15-8c35102f2033",
        "outputId": "9d6f0bd4-ef42-4d6a-f6b0-3f75fec8ba6b"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'id': 3,\n",
              " 'type': 'message',\n",
              " 'date': '2021-03-21T20:26:01',\n",
              " 'date_unixtime': '1616343961',\n",
              " 'edited': '2021-04-11T00:37:28',\n",
              " 'edited_unixtime': '1618087048',\n",
              " 'from': '🏆 Data Feeling 🤹',\n",
              " 'from_id': 'channel1457597091',\n",
              " 'text': 'Всем привет!\\n\\nРешил запустить свой канал. Буду рассказывать здесь про свой опыт в Data Science и лайфхаки',\n",
              " 'text_entities': [{'type': 'plain',\n",
              "   'text': 'Всем привет!\\n\\nРешил запустить свой канал. Буду рассказывать здесь про свой опыт в Data Science и лайфхаки'}]}"
            ]
          },
          "execution_count": 10,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# указываем путь к файлу, скаченному из телеграма\n",
        "js_path = '../data/result.json'\n",
        "\n",
        "# Load the JSON file\n",
        "with open(js_path, 'r') as file:\n",
        "    data = json.load(file)\n",
        "\n",
        "# посмотрим на первое сообщение\n",
        "data['messages'][2]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "0ab52752-9b8f-4216-b0d8-9cc63d0b908d",
      "metadata": {
        "id": "0ab52752-9b8f-4216-b0d8-9cc63d0b908d"
      },
      "source": [
        "<div class=\"alert alert-success\">\n",
        "    \n",
        "Видим, что помимо текста поста в сообщениях содержится ещё много дополнительной информации. Для наших целей извлечём только тексты."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "cf567da8-902c-4090-b244-1ac707dc8680",
      "metadata": {
        "id": "cf567da8-902c-4090-b244-1ac707dc8680"
      },
      "outputs": [],
      "source": [
        "txt = []\n",
        "for message in data['messages']:\n",
        "    if isinstance(message['text'], str):\n",
        "        txt.append(message['text'])\n",
        "    else:\n",
        "        temp_txt = ''\n",
        "        for m in message['text']:\n",
        "            if isinstance(m, str):\n",
        "                temp_txt += m\n",
        "            else:\n",
        "                if isinstance(m, dict) and 'text' in m.keys():\n",
        "                    temp_txt += m['text']\n",
        "        txt.append(temp_txt)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1a08b269-bb9e-4277-b457-78272c33ca69",
      "metadata": {
        "id": "1a08b269-bb9e-4277-b457-78272c33ca69",
        "outputId": "29a9bed6-9343-4610-aa0c-3d928b4140e3"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>text</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td></td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td></td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Всем привет!\\n\\nРешил запустить свой канал. Бу...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Когда начинаешь погружаться в DS - начинаешь р...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>#лайфхаки\\nПоймал себя на периодическом подгля...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                                                text\n",
              "0                                                   \n",
              "1                                                   \n",
              "2  Всем привет!\\n\\nРешил запустить свой канал. Бу...\n",
              "3  Когда начинаешь погружаться в DS - начинаешь р...\n",
              "4  #лайфхаки\\nПоймал себя на периодическом подгля..."
            ]
          },
          "execution_count": 16,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df = pd.DataFrame({\"text\": txt})\n",
        "df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "10d805b1-bc70-4ddb-a1a9-d5f378c3e7f7",
      "metadata": {
        "id": "10d805b1-bc70-4ddb-a1a9-d5f378c3e7f7"
      },
      "source": [
        "<div class=\"alert alert-success\">\n",
        "    \n",
        "Видим, что есть посты без текстов - видео или картинки. <br>\n",
        "Преобразуем датасет в нужный формат и почистим от пропусков."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "04a42e1a-978d-4700-8d80-3d93d47b8eea",
      "metadata": {
        "id": "04a42e1a-978d-4700-8d80-3d93d47b8eea",
        "outputId": "e0266d76-9f3a-40b3-c6d4-65f0d32ce8a3"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Instruction</th>\n",
              "      <th>Input</th>\n",
              "      <th>Response</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Write a post on the following topic</td>\n",
              "      <td></td>\n",
              "      <td>Всем привет!\\n\\nРешил запустить свой канал. Бу...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Write a post on the following topic</td>\n",
              "      <td></td>\n",
              "      <td>Когда начинаешь погружаться в DS - начинаешь р...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Write a post on the following topic</td>\n",
              "      <td></td>\n",
              "      <td>#лайфхаки\\nПоймал себя на периодическом подгля...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Write a post on the following topic</td>\n",
              "      <td></td>\n",
              "      <td>Speed up'ал, speed up'аю и буду speed up'ать с...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Write a post on the following topic</td>\n",
              "      <td></td>\n",
              "      <td>За этот месяц раз 5 услышал слово \"экстраполяц...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                           Instruction Input  \\\n",
              "0  Write a post on the following topic         \n",
              "1  Write a post on the following topic         \n",
              "2  Write a post on the following topic         \n",
              "3  Write a post on the following topic         \n",
              "4  Write a post on the following topic         \n",
              "\n",
              "                                            Response  \n",
              "0  Всем привет!\\n\\nРешил запустить свой канал. Бу...  \n",
              "1  Когда начинаешь погружаться в DS - начинаешь р...  \n",
              "2  #лайфхаки\\nПоймал себя на периодическом подгля...  \n",
              "3  Speed up'ал, speed up'аю и буду speed up'ать с...  \n",
              "4  За этот месяц раз 5 услышал слово \"экстраполяц...  "
            ]
          },
          "execution_count": 18,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df = pd.DataFrame({\n",
        "    'Instruction': 'Write a post on the following topic', # Инструкция будет везде одинаковая\n",
        "    'Input': '',                                # поле с темой поста пока оставим пустым\n",
        "    'Response': [t for t in txt if len(t) != 0] # в Response записываем текст поста и убираем пустые строки\n",
        "})\n",
        "\n",
        "df.head()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "ee350083-79c0-46f2-8dfb-65b1062bd905",
      "metadata": {
        "id": "ee350083-79c0-46f2-8dfb-65b1062bd905",
        "outputId": "8782d7f3-ca9e-4910-ef31-99d2c3927c9f"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 587 entries, 0 to 586\n",
            "Data columns (total 3 columns):\n",
            " #   Column       Non-Null Count  Dtype \n",
            "---  ------       --------------  ----- \n",
            " 0   Instruction  587 non-null    object\n",
            " 1   Input        587 non-null    object\n",
            " 2   Response     587 non-null    object\n",
            "dtypes: object(3)\n",
            "memory usage: 13.9+ KB\n"
          ]
        }
      ],
      "source": [
        "# Получилось 587 постов с текстом.\n",
        "df.info()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "2aa24ae9-b679-4321-9fc8-4a38b11515a8",
      "metadata": {
        "id": "2aa24ae9-b679-4321-9fc8-4a38b11515a8"
      },
      "source": [
        "## Определяем тематику поста с помощью ChatGPT 🤖\n",
        "\n",
        "Чтобы самим не заполнять поле input и не читать все 580 сообщений, определяя тему - попросим ЛЛМ сделать это за нас."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "c7baf94a-00d7-42c2-8429-69bedaa51704",
      "metadata": {
        "id": "c7baf94a-00d7-42c2-8429-69bedaa51704",
        "outputId": "96b1ae82-0596-43e2-b368-54ffaef8f8f2"
      },
      "outputs": [
        {
          "name": "stdin",
          "output_type": "stream",
          "text": [
            "Введите ваш OpenAI API ключ ········\n"
          ]
        }
      ],
      "source": [
        "from utils import ChatOpenAI\n",
        "\n",
        "#course_api_key= \"Введите ваш OpenAI API ключ\"\n",
        "course_api_key = getpass(prompt='Введите ваш OpenAI API ключ')\n",
        "\n",
        "# инициализируем языковую модель\n",
        "llm = ChatOpenAI(temperature=0.0, course_api_key=course_api_key)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "a74034dd-2b08-4d27-bb02-3346167ef88f",
      "metadata": {
        "id": "a74034dd-2b08-4d27-bb02-3346167ef88f",
        "outputId": "ec4d2a3f-4cc6-4f0c-9db4-00bc0d9c0f1f"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "100%|██████████| 587/587 [07:25<00:00,  1.32it/s]\n"
          ]
        }
      ],
      "source": [
        "from tqdm import tqdm\n",
        "tqdm.pandas()\n",
        "\n",
        "# Функция для определения тематики поста. Примени её ко всему датасету.\n",
        "def topic_extractor(text):\n",
        "\n",
        "    answer = llm.invoke(f'Extract a phrase that best describes the topic of the following post. \\n Post: {text}')\n",
        "    return str(answer.content)\n",
        "\n",
        "df['Input'] = df['Response'].progress_apply(topic_extractor)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "d539446e-1d83-43d0-a25b-d947fe4e4546",
      "metadata": {
        "id": "d539446e-1d83-43d0-a25b-d947fe4e4546",
        "outputId": "7c052616-b98d-4519-c319-2f8bb826878b"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "<div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>Instruction</th>\n",
              "      <th>Input</th>\n",
              "      <th>Response</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>0</th>\n",
              "      <td>Write a post on the following topic</td>\n",
              "      <td>Запуск канала о Data Science</td>\n",
              "      <td>Всем привет!\\n\\nРешил запустить свой канал. Бу...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>1</th>\n",
              "      <td>Write a post on the following topic</td>\n",
              "      <td>Простые примеры кода для новых методов/алгоритмов</td>\n",
              "      <td>Когда начинаешь погружаться в DS - начинаешь р...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>2</th>\n",
              "      <td>Write a post on the following topic</td>\n",
              "      <td>Полезные расширения для работы с табличными да...</td>\n",
              "      <td>#лайфхаки\\nПоймал себя на периодическом подгля...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>3</th>\n",
              "      <td>Write a post on the following topic</td>\n",
              "      <td>Ways to speed up Pytorch training on Kaggle</td>\n",
              "      <td>Speed up'ал, speed up'аю и буду speed up'ать с...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>4</th>\n",
              "      <td>Write a post on the following topic</td>\n",
              "      <td>Интерпретация моделей в машинном обучении</td>\n",
              "      <td>За этот месяц раз 5 услышал слово \"экстраполяц...</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>"
            ],
            "text/plain": [
              "                           Instruction  \\\n",
              "0  Write a post on the following topic   \n",
              "1  Write a post on the following topic   \n",
              "2  Write a post on the following topic   \n",
              "3  Write a post on the following topic   \n",
              "4  Write a post on the following topic   \n",
              "\n",
              "                                               Input  \\\n",
              "0                       Запуск канала о Data Science   \n",
              "1  Простые примеры кода для новых методов/алгоритмов   \n",
              "2  Полезные расширения для работы с табличными да...   \n",
              "3        Ways to speed up Pytorch training on Kaggle   \n",
              "4          Интерпретация моделей в машинном обучении   \n",
              "\n",
              "                                            Response  \n",
              "0  Всем привет!\\n\\nРешил запустить свой канал. Бу...  \n",
              "1  Когда начинаешь погружаться в DS - начинаешь р...  \n",
              "2  #лайфхаки\\nПоймал себя на периодическом подгля...  \n",
              "3  Speed up'ал, speed up'аю и буду speed up'ать с...  \n",
              "4  За этот месяц раз 5 услышал слово \"экстраполяц...  "
            ]
          },
          "execution_count": 20,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9ae95235-8f78-48db-af38-32cab49f0695",
      "metadata": {
        "id": "9ae95235-8f78-48db-af38-32cab49f0695"
      },
      "source": [
        "<div class=\"alert alert-success\">\n",
        "    \n",
        "Видим, что модель отлично справилась с задачей - тематика хорошо определена."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "f844bc06-0623-43d0-9130-66551a097081",
      "metadata": {
        "id": "f844bc06-0623-43d0-9130-66551a097081"
      },
      "outputs": [],
      "source": [
        "# Сохраняем датасет\n",
        "df.to_csv('../data/prepared_dataset.csv', index=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "812e95c4-56c1-49cf-8f6f-f80c853fdf13",
      "metadata": {
        "id": "812e95c4-56c1-49cf-8f6f-f80c853fdf13"
      },
      "source": [
        "# Загружаем датасет на HuggingFace 🤗"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "e01a4805-7b7c-4db7-b3e9-0d86ce0e599b",
      "metadata": {
        "id": "e01a4805-7b7c-4db7-b3e9-0d86ce0e599b"
      },
      "source": [
        "<div class=\"alert alert-success\">\n",
        "    \n",
        "Теперь переходим на [HuggingFace](https://huggingface.co) и создаем новый датасет, затем в него подгружаем получившийся файл. <br>\n",
        "Теперь датасет доступен всем по ссылке: https://huggingface.co/datasets/Ivanich/datafeeling_posts"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "e00a298c-5831-4068-bc8e-e5cb9b1380f4",
      "metadata": {
        "id": "e00a298c-5831-4068-bc8e-e5cb9b1380f4",
        "outputId": "4f3827be-f90a-4edd-d636-5c10d776e703",
        "colab": {
          "referenced_widgets": [
            "aa6557d3dc84484a82a3ede5a7cb64f5",
            "5d754fe6aab047f6b85026cd0d27ec6c",
            "3cf694ee03f140cc94889e099db74697"
          ]
        }
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "aa6557d3dc84484a82a3ede5a7cb64f5",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading readme:   0%|          | 0.00/157 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "5d754fe6aab047f6b85026cd0d27ec6c",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading data:   0%|          | 0.00/662k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "3cf694ee03f140cc94889e099db74697",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Generating train split:   0%|          | 0/587 [00:00<?, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "# Попробуем его подгрузить прямо с HuggingFace\n",
        "from datasets import load_dataset\n",
        "\n",
        "dataset = load_dataset(\"Ivanich/datafeeling_posts\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "1327f2a0-785b-47ba-81a5-80de41fa6549",
      "metadata": {
        "id": "1327f2a0-785b-47ba-81a5-80de41fa6549",
        "outputId": "43520c78-162b-473b-ef50-1e232fcb7789"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "DatasetDict({\n",
              "    train: Dataset({\n",
              "        features: ['Instruction', 'Input', 'Response'],\n",
              "        num_rows: 587\n",
              "    })\n",
              "})"
            ]
          },
          "execution_count": 24,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "dataset"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "56b29e2c-5f70-444c-963d-5b1b93dfda86",
      "metadata": {
        "id": "56b29e2c-5f70-444c-963d-5b1b93dfda86",
        "outputId": "4919935f-ebff-4cec-d487-c1124fee53ba"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "{'Instruction': 'Write a post on the following topic',\n",
              " 'Input': 'Запуск канала о Data Science',\n",
              " 'Response': 'Всем привет!\\n\\nРешил запустить свой канал. Буду рассказывать здесь про свой опыт в Data Science и лайфхаки'}"
            ]
          },
          "execution_count": 26,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "dataset['train'][0]"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "be699ae3-8e44-4540-9d98-c35503dad107",
      "metadata": {
        "id": "be699ae3-8e44-4540-9d98-c35503dad107"
      },
      "source": [
        "### В следующем ноутбуке c помощью этого датасета будем файнтюнить Llama 3.1"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "cv",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.12"
    },
    "colab": {
      "provenance": []
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}