> ## Documentation Index
> Fetch the complete documentation index at: https://friendli.ai/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Dataset Specifications and Upload Guide

> Upload and manage datasets for Friendli Dedicated Endpoints. Covers supported formats, size limits, splits, versioning, and the upload process.

export const RoundedBorderBox = ({children, caption}) => <div className="rounded-border-box">
    {children}
    {caption && <p className="text-sm text-gray-700 dark:text-gray-400">{caption}</p>}
  </div>;

## Uploading Datasets

This document explains how to upload datasets. On Friendli, you can upload datasets via the web interface or the SDK.

<Tabs>
  <Tab title="Uploading via Web Interface">
    You can easily upload datasets through the web interface. Files in `.jsonl` and `.parquet` formats are supported, and each dataset should be structured as follows:

    ### Conversation

    This is the most basic dataset format. The `role` field can be `system`, `user`, or `assistant`.

    ```jsonl theme={null}
    {"messages": [{"role": "...", "content": "..."}]}
    ```

    ### Alpaca (Beta)

    Two types of Alpaca datasets are supported as shown below.\
    For compatibility with the Conversation format, they are automatically converted according to a template during upload. If you do not want automatic conversion, please convert to the Conversation format before uploading, or use the SDK to upload.

    ```jsonl theme={null}
    {"instruction": "...", "output": "..."}
    {"instruction": "...", "input": "...", "output": "..."}
    ```

    ### Multi-Modal (Image)

    For multi-modal inputs, the following three formats are supported for compatibility.\
    Currently, the web interface does not support `local path`, `base64`, or `PIL.Image` objects. For these cases, please use the SDK to upload.

    ```jsonl theme={null}
    {"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image", "image": "https://example.com/image.jpg"}]}]}
    {"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image", "image_url": "https://example.com/image.jpg"}]}]}
    {"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}]}]}
    ```

    ### How to Upload a Dataset

    First, go to [Friendli Suite > Labs > Datasets](https://friendli.ai/suite/~/datasets).
    Click the **'New Dataset'** button to start the upload process.\
    From the dropdown, select **'Upload a file directly'** option.

    <RoundedBorderBox>
      <img alt="Uploading a Dataset Step 1" src="https://mintcdn.com/friendliai/iNoQgE8-CG9tce1q/static/images/guides/dataset/dataset-step-1.png?fit=max&auto=format&n=iNoQgE8-CG9tce1q&q=85&s=788b26e54896bb4f456c0e0832c89bb6" width="1447" height="705" data-path="static/images/guides/dataset/dataset-step-1.png" />
    </RoundedBorderBox>

    Click the File Upload Area in the Dataset file section, or drag and drop the file you want to upload. Then click the **'Upload'** button to start uploading.

    <RoundedBorderBox>
      <img alt="Uploading a Dataset Step 2" src="https://mintcdn.com/friendliai/iNoQgE8-CG9tce1q/static/images/guides/dataset/dataset-step-2.png?fit=max&auto=format&n=iNoQgE8-CG9tce1q&q=85&s=6d561007cf5e6b7fa5cea96d0fa9b436" width="1447" height="705" data-path="static/images/guides/dataset/dataset-step-2.png" />
    </RoundedBorderBox>

    Friendli uploads the dataset progressively in the background. Once the upload is complete, you can rename it, add splits, and preview each split.

    <RoundedBorderBox>
      <img alt="Uploading a Dataset Step 3" src="https://mintcdn.com/friendliai/iNoQgE8-CG9tce1q/static/images/guides/dataset/dataset-step-3.png?fit=max&auto=format&n=iNoQgE8-CG9tce1q&q=85&s=1ab44e7f34124ae410928127656aa625" width="1447" height="705" data-path="static/images/guides/dataset/dataset-step-3.png" />
    </RoundedBorderBox>
  </Tab>

  <Tab title="Uploading via SDK">
    ### Prerequisites

    1. Head to [Friendli Suite](https://friendli.ai/suite) and create an account.
    2. Issue a **Personal API Key** by going to [Personal Settings > API Keys](https://friendli.ai/suite/~/setting/keys).
       Make sure to copy and store it securely in a safe place as you won't be able to see it again after refreshing the page.\
       For detailed instructions, see [Personal API Keys](/guides/suite/personal-api-keys).

    ### Step 1. Prepare Your Dataset

    Your dataset should be a conversational dataset in `.jsonl` or `.parquet` format, where each line represents a sequence of messages. Each message in the conversation should include a `"role"` (e.g., `system`, `user`, or `assistant`) and `"content"`. For VLM fine-tuning, user content can contain both text and image data (Note that for image data, we support URL and Base64).

    Here's an example of what it should look like. Note that it's one line but beautified for readability:

    ```json theme={null}
    {
      "messages": [
        {
          "role": "system",
          "content": "You are a helpful assistant."
        },
        {
          "role": "user",
          "content": [
            {
              "type": "image",
              "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
            },
            {
              "type": "image",
              "image": "data:image/png;base64,<base64-encoded-data>"
            },
            {
              "type": "text",
              "text": "Describe this image in detail."
            }
          ]
        },
        {
          "role": "assistant",
          "content": "The image is a bee."
        }
      ]
    }
    ```

    <Info>
      You can access our example dataset ['FriendliAI/gsm8k'](https://huggingface.co/datasets/FriendliAI/gsm8k) (for Chat), ['FriendliAI/sample-vision'](https://huggingface.co/datasets/FriendliAI/sample-vision) (for Chat with image) and explore some of our quantized generative AI models on [our Hugging Face page](https://huggingface.co/FriendliAI).
    </Info>

    ### Step 2. Upload Your Dataset

    Once you have prepared your dataset, you can upload it to Friendli using the [Python SDK](/sdk/python-sdk).

    #### Install the Python SDK

    First, install the Friendli Python SDK:

    ```bash theme={null}
    # Using pip
    pip install friendli

    # Using poetry
    poetry add friendli
    ```

    #### Upload Your Dataset

    Use the following code to create a dataset and upload your samples:

    ```python theme={null}
    import os

    from friendli.friendli import SyncFriendli
    from friendli.models import Sample

    TEAM_ID = os.environ["FRIENDLI_TEAM_ID"]
    PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"]
    TOKEN = os.environ["API_KEY"]

    # Read dataset file and parse each line as a Sample
    with open("dataset.jsonl", "rb") as f:
        data = [Sample.model_validate_json(line) for line in f]

    with SyncFriendli(
        token=TOKEN,
        x_friendli_team=TEAM_ID,
    ) as friendli:
        # Create a new dataset with TEXT and IMAGE modalities
        with friendli.dataset.create(
            modality=["TEXT", "IMAGE"],
            name="my-vlm-dataset", # name of the dataset
            project_id=PROJECT_ID,
        ) as dataset:
            # Upload samples to the dataset
            # Each line from your dataset file becomes a separate sample
            dataset.upload_samples(
                samples=data,
                split="train",  # name of the split to upload to
            )
    ```

    #### How It Works

    Friendli Python SDK doesn't upload your entire dataset file at once. Instead, it processes your dataset more efficiently:

    1. **Reads your dataset file line by line**: Each line is parsed as a `Sample` object containing a conversation with messages.

    2. **Creates a dataset**: A new dataset is created in your Friendli project with the specified modalities (`TEXT` and `IMAGE`).

    3. **Uploads each conversation as a separate sample**: Rather than uploading the entire file, each conversation (line in the dataset file) becomes an individual sample in the dataset.

    4. **Organizes by splits**: Samples are organized into splits like "train", "validation", or "test" for different purposes.

    #### Environment Variables

    Make sure to set the required environment variables:

    ```bash theme={null}
    export API_KEY="your-api-key"
    export FRIENDLI_TEAM_ID="your-team-id"
    export FRIENDLI_PROJECT_ID="your-project-id"
    ```

    You can find your Team ID and Project ID in the URL of Friendli Suite, formatted as `https://friendli.ai/<teamId>/<projectId>/...`.

    #### View Your Dataset

    To view and edit the datasets you've uploaded, visit [Friendli Suite > Datasets](https://friendli.ai/suite/~/datasets).

    <RoundedBorderBox>
      <img alt="View Datasets in Friendli Suite" src="https://mintcdn.com/friendliai/_2og0Pv8F7JaN5sy/static/images/guides/tutorials/how-to-fine-tune-vlm/datasets.png?fit=max&auto=format&n=_2og0Pv8F7JaN5sy&q=85&s=81e62408addf444a08b4a1d84b4222ea" width="2940" height="1596" data-path="static/images/guides/tutorials/how-to-fine-tune-vlm/datasets.png" />
    </RoundedBorderBox>

    <br />

    <RoundedBorderBox>
      <img alt="View Dataset in Friendli Suite" src="https://mintcdn.com/friendliai/_2og0Pv8F7JaN5sy/static/images/guides/tutorials/how-to-fine-tune-vlm/dataset.png?fit=max&auto=format&n=_2og0Pv8F7JaN5sy&q=85&s=116621b89571d85412a91da1ee1cc8b6" width="2940" height="1596" data-path="static/images/guides/tutorials/how-to-fine-tune-vlm/dataset.png" />
    </RoundedBorderBox>
  </Tab>
</Tabs>
