Skip to main content

Introduction

This guide will walk you through deploying Friendli Container as an Amazon EKS Add-on to enable real-time inference on your Kubernetes cluster.By utilizing Friendli Container in your EKS environment, you’ll benefit from the Friendli Engine’s speed and resource efficiency.We’ll explore how to configure GPU nodes, install the add-on, and create inference deployments using Kubernetes manifests.
Walking through this tutorial is easier with eksctl and AWS CLI tools. Please visit the eksctl documentation and AWS CLI homepage for the installation guides.

General Workflow

  1. Add GPU Node Group: Create a GPU-enabled node group in your EKS cluster with instances like g6.xlarge or g5.2xlarge.
  2. Configure Friendli Container EKS add-on: Subscribe to the Friendli Container add-on from the AWS Marketplace and configure IRSA for license validation.
  3. Create Friendli Deployment: Deploy your model using FriendliDeployment custom resource.
  4. Run Inference: Send inference requests to your deployed model.

Prerequisite

  • AWS account with permissions for EKS, IAM, EC2 operations
  • eksctl and AWS CLI tools installed and configured
  • kubectl configured to access your EKS cluster
  • (Optional) Hugging Face token if deploying gated/private models. Hugging Face token docs

1. Add GPU Node Group to your EKS Cluster

You need an active Amazon EKS cluster. To create a cluster, consult the Amazon EKS documentation on creating an EKS cluster.
Friendli Container EKS-addon requires Kubernetes version 1.29 or later.
When selecting the AWS region for your new EKS cluster, availability of GPU instances is one of the key factors to consider. You can check instance availability here.
Supported NVIDIA DeviceAWS EC2 Instance Type
B200P6 instances
H200P5 instances
H100P5 instances
A100P4 instances
L40SG6e instances
A10GG5 instances
L4G6 instances
If you’re going to use multi-GPU VM instance types, installing the NVIDIA GPU Operator is highly recommended for proper resource management. You can consult the guide from NVIDIA GPU Operator, and an example of installing a GPU operator using helm can be found here. The tutorial assumes the following EKS Add-ons are installed in your cluster. You can click “Get more add-ons” button in the “AWS add-ons” section to install them.
  • Amazon VPC CNI
  • CoreDNS
  • kube-proxy
  • Amazon EKS Pod Identity Agent
Now let’s add GPU Node Group to your EKS cluster.
  • Open Amazon EKS console and choose the cluster that you want to create a node group in.
  • Select the “Compute” tab and click “Add node group”.
  • Configure the new node group by entering the name, Node IAM role, and other information. You can click “Create recommended role” to create IAM role. Click “Next”.
  • On the next page, select “Amazon Linux 2023 (x86_64) Nvidia” for AMI type.
  • Select the appropriate instance type for the GPU device of your choice.
    • Suggested instance type for this tutorial is g6.2xlarge.
  • Configure the disk size. It should be large enough to download the model you want to deploy.
    • Suggested disk size for this tutorial is 100GB.
  • Configure the desired node group size.
  • Go through the rest of the steps, review the changes and click “Create”.

2. Configure Friendli Container EKS add-on

  • Open Amazon EKS console and choose the cluster that you want to configure.
  • Select the “Add-ons” tab and click “Get more add-ons”.
  • Scroll down and under the section “AWS Marketplace add-ons”, search and check “Friendli Container”, and click “Next”.
  • Click “Next”, Review your settings, and click “Create”
Now you need to allow the Kubernetes ServiceAccount to contact AWS Marketplace for license validation. Execute the following commands, replacing <REGION> with the AWS region you created the cluster and <CLUSTER> with the EKS cluster name.
eksctl utils associate-iam-oidc-provider --region <REGION> --cluster <CLUSTER> --approve

eksctl create iamserviceaccount --region <REGION> --cluster <CLUSTER> \
  --namespace default --name default \
  --role-name AWSMarketplaceMeteringAccessForFriendliContainer \
  --attach-policy-arn arn:aws:iam::aws:policy/AWSMarketplaceMeteringFullAccess \
  --approve --override-existing-serviceaccounts
The commands above configure IAM roles for service accounts (IRSA) for the Kubernetes ServiceAccount default in the default namespace to exercise AWSMarketplaceMeteringFullAccess policy on your behalf. Click here to learn more about IRSA.

3. Create Friendli Deployment

You need to be able to use the “kubectl” CLI tool to access your EKS cluster. Consult this guide from AWS for more details.
To deploy a private or gated model in the HuggingFace model hub, you need to create a HuggingFace access token with “read” permission.Then create a Kubernetes secret.kubectl create secret generic hf-secret --from-literal token=YOUR_TOKEN_HERE
FriendliDeployment is Kubernetes custom resource that lets you easily create Friendli Inference Deployments without configuring Kubernetes low-level resources like pods, services, and deployments. Below is a sample FriendlDeployment to deploy Meta Llama 3.1 8b on one g6.2xlarge instance.
apiVersion: friendli.ai/v1alpha1
kind: FriendliDeployment
metadata:
  namespace: default
  name: friendlideployment-sample
spec:
  model:
    huggingFace:
      repository: meta-llama/Llama-3.1-8B-Instruct

      # "token:" section is not needed if the model is
      # a public one.
      token:
        name: hf-secret
        key: token

  resources:

    nodeSelector:
      # Use the name of the node group you want to use.
      eks.amazonaws.com/nodegroup: <NODE GROUP NAME>

    numGPUs: 1
    requests:
      cpu: "6"
      ephemeral-storage: 30Gi
      memory: 25Gi
    limits:
      cpu: "6"
      ephemeral-storage: 30Gi
      memory: 25Gi
  deploymentStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
  service:
    inferencePort: 6000
You can modify this YAML file for your use case.
  • The “token:” section under spec.model.huggingFace refers to the Kubernetes secret you created for storing the HuggingFace access token. If accessing your model does not require an access token, you can omit the “token:” section entirely.
  • In the example above, the node selector is eks.amazonaws.com/nodegroup: <NODE GROUP NAME>. Replace the node selector key to match the name of your node group.
  • CPU and memory resource requirements are adjusted to g6.2xlarge instance and you may need to edit those values if you used different instance type.
If your cluster has NVIDIA GPU Operator installed, you need to put “nvidia.com/gpu” resource in “requests:” and “limits:” section, as GPU nodes will advertise that they have “nvidia.com/gpu” resource alongside ordinary resources like “cpu” and “memory”. You can omit “numGPUs” from your FriendliDeployment.Below is the equivalent example as above for the GPU Operator-enabled cluster.
  resources:
    nodeSelector:
      # Use the name of the node group you want to use.
      eks.amazonaws.com/nodegroup: <NODE GROUP NAME>
    requests:
      cpu: "6"
      ephemeral-storage: 30Gi
      memory: 25Gi
      nvidia.com/gpu: "1"
    limits:
      cpu: "6"
      ephemeral-storage: 30Gi
      memory: 25Gi
      nvidia.com/gpu: "1"
Save your YAML file as “friendlideployment.yaml”, and execute kubectl apply -f friendlideployment.yaml
$ kubectl apply -f friendlideployment.yaml
friendlideployment.friendli.ai/friendlideployment-sample created

$ kubectl get pods -n default
NAME                                         READY   STATUS    RESTARTS   AGE
friendlideployment-sample-7d7b877c77-zjgqq   2/2     Running   0          3m18s

$ kubectl get services -n default
NAME                        TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
friendlideployment-sample   ClusterIP   172.20.95.224   <none>        6000/TCP   18m
kubernetes                  ClusterIP   172.20.0.1      <none>        443/TCP    28h
Now you can port-forward to the service to connect to the service from your PC.
$ kubectl port-forward -n default svc/friendlideployment-sample 6000
Forwarding from 127.0.0.1:6000 -> 6000
Forwarding from [::1]:6000 -> 6000
In another terminal, use the curl tool to send an inference request.
$ curl http://localhost:6000/v1/completions -H 'Content-Type: application/json' --data-raw '{"prompt": "Hi!", max_tokens: 10, stream: false}'
{"choices":[{"finish_reason":"length","index":0,"seed":15349211611234757311,"text":" I'm Alex, and I'm excited to share","tokens":[358,2846,8683,11,323,358,2846,12304,311,4430]}],"id":"cmpl-b2e4b4cba711448c847ab89d763588da","object":"text_completion","usage":{"completion_tokens":10,"prompt_tokens":3,"total_tokens":13}}
For more information about Friendli Container usage, check our documentation and contact us for inquiries.

Cleaning up

You can remove the FriendliDeployment using the kubectl CLI tool.
$ kubectl delete friendlideployment -n default friendlideployment-sample
friendlideployment.friendli.ai "friendlideployment-sample" deleted
You may also want to scale down or delete your GPU node group to avoid being charged for unused GPU instances. By following these guides, you’ll be able to seamlessly deploy your models using Friendli Container as an EKS Add-on and leverage its capabilities for real-time inference on your Kubernetes cluster.