Download DeepSeek AI Models

Access DeepSeek's state-of-the-art AI models for local deployment and integration into your applications.

Available Models

Choose from our range of powerful AI models tailored for different use cases.

DeepSeek-V3-0324

The latest version of our flagship model, featuring enhanced reasoning capabilities and improved multilingual support. Released on March 24, 2025, this model represents our most advanced AI system with superior performance across a wide range of tasks.

DeepSeek-V3-0324 Models

ModelTotal ParamsActivated ParamsContext LengthDownload
DeepSeek-V3-0324660B37B128KDownload

DeepSeek-V3-0324 uses the same base model as the previous DeepSeek-V3, with only improvements in post-training methods. For private deployment, you only need to update the checkpoint and tokenizer_config.json (tool calls related changes). The model has approximately 660B parameters, and the open-source version offers a 128K context length (while the web, app, and API provide 64K context).

How to Run Locally

DeepSeek models can be deployed locally using various hardware and open-source community software.

1. DeepSeek-V3 Deployment

DeepSeek-V3 can be deployed locally using the following hardware and open-source community software:

  1. DeepSeek-Infer Demo: DeepSeek provides a simple and lightweight demo for FP8 and BF16 inference.
  2. SGLang: Fully support the DeepSeek-V3 model in both BF16 and FP8 inference modes, with Multi-Token Prediction coming soon.[1 ]
  3. LMDeploy: Enables efficient FP8 and BF16 inference for local and cloud deployment.
  4. TensorRT-LLM: Currently supports BF16 inference and INT4/8 quantization, with FP8 support coming soon.
  5. vLLM: Support DeepSeek-V3 model with FP8 and BF16 modes for tensor parallelism and pipeline parallelism.
  6. AMD GPU: Enables running the DeepSeek-V3 model on AMD GPUs via SGLang in both BF16 and FP8 modes.
  7. Huawei Ascend NPU: Supports running DeepSeek-V3 on Huawei Ascend devices.

Since FP8 training is natively adopted in our framework, we only provide FP8 weights. If you require BF16 weights for experimentation, you can use the provided conversion script to perform the transformation.

Here is an example of converting FP8 weights to BF16:

cd inference
python fp8_cast_bf16.py --input-fp8-hf-path /path/to/fp8_weights --output-bf16-hf-path /path/to/bf16_weights

NOTE

Hugging Face's Transformers has not been directly supported yet.

1.1 Inference with DeepSeek-Infer Demo (example only)

System Requirements

NOTE

Linux with Python 3.10 only. Mac and Windows are not supported.

Dependencies:

torch==2.4.1
triton==3.0.0
transformers==4.46.3
safetensors==0.4.5
Model Weights & Demo Code Preparation

First, clone the DeepSeek-V3 GitHub repository:

git clone https://github.com/deepseek-ai/DeepSeek-V3.git

Navigate to the `inference` folder and install dependencies listed in `requirements.txt`. Easiest way is to use a package manager like `conda` or `uv` to create a new virtual environment and install the dependencies.

cd DeepSeek-V3/inference
pip install -r requirements.txt

Download the model weights from Hugging Face, and put them into `/path/to/DeepSeek-V3` folder.

Model Weights Conversion

Convert Hugging Face model weights to a specific format:

python convert.py --hf-ckpt-path /path/to/DeepSeek-V3 --save-path /path/to/DeepSeek-V3-Demo --n-experts 256 --model-parallel 16
Run

Then you can chat with DeepSeek-V3:

torchrun --nnodes 2 --nproc-per-node 8 --node-rank $RANK --master-addr $ADDR generate.py --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --interactive --temperature 0.7 --max-new-tokens 200

Or batch inference on a given file:

torchrun --nnodes 2 --nproc-per-node 8 --node-rank $RANK --master-addr $ADDR generate.py --ckpt-path /path/to/DeepSeek-V3-Demo --config configs/config_671B.json --input-file $FILE

1.2 Inference with SGLang (recommended)

SGLang SGLang currently supports MLA optimizations, DP Attention, FP8 (W8A8), FP8 KV Cache, and Torch Compile, delivering state-of-the-art latency and throughput performance among open-source frameworks.[1 ][2 ][3 ]

Notably, SGLang v0.4.1 fully supports running DeepSeek-V3 on both NVIDIA and AMD GPUs, making it a highly versatile and robust solution.[1 ]

SGLang also supports multi-node tensor parallelism, enabling you to run this model on multiple network-connected machines.[1 ]

Multi-Token Prediction (MTP) is in development, and progress can be tracked in the optimization plan.[1 ]

Here are the launch instructions from the SGLang team:[1 ]

1.3 Inference with LMDeploy (recommended)

LMDeploy LMDeploy, a flexible and high-performance inference and serving framework tailored for large language models, now supports DeepSeek-V3. It offers both offline pipeline processing and online deployment capabilities, seamlessly integrating with PyTorch-based workflows.[1 ]

For comprehensive step-by-step instructions on running DeepSeek-V3 with LMDeploy, please refer to here:[1 ]

1.4 Inference with TRT-LLM (recommended)

TensorRT-LLM TensorRT-LLM now supports the DeepSeek-V3 model, offering precision options such as BF16 and INT4/INT8 weight-only. Support for FP8 is currently in progress and will be released soon. You can access the custom branch of TRTLLM specifically for DeepSeek-V3 support through the following link to experience the new features directly:[1 ][2 ]

1.5 Inference with vLLM (recommended)

vLLM vLLM v0.6.6 supports DeepSeek-V3 inference for FP8 and BF16 modes on both NVIDIA and AMD GPUs. Aside from standard techniques, vLLM offers pipeline parallelism allowing you to run this model on multiple machines connected by networks. For detailed guidance, please refer to the vLLM instructions. Please feel free to follow the enhancement plan as well.[1 ][2 ][3 ]

1.6 Recommended Inference Functionality with AMD GPUs

In collaboration with the AMD team, DeepSeek has achieved Day-One support for AMD GPUs using SGLang, with full compatibility for both FP8 and BF16 precision. For detailed guidance, please refer to the SGLang instructions.[1 ]

1.7 Recommended Inference Functionality with Huawei Ascend NPUs

The MindIE framework from the Huawei Ascend community has successfully adapted the BF16 version of DeepSeek-V3. For step-by-step guidance on Ascend NPUs, please follow the instructions here.[1 ][2 ]

2. DeepSeek-R1 Deployment

2.1 DeepSeek-R1 Models

Please visit the DeepSeek-V3 deployment section above for more information about running DeepSeek-R1 locally.

NOTE

Hugging Face's Transformers has not been directly supported yet.

2.2 DeepSeek-R1-Distill Models

DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models.

For instance, you can easily start a service using vLLM:[1 ]

vllm serve deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --tensor-parallel-size 2 --max-model-len 32768 --enforce-eager

You can also easily start a service using SGLang:[1 ]

python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --trust-remote-code --tp 2

2.3 Usage Recommendations

We recommend adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:

  1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.
  2. Avoid adding a system prompt; all instructions should be contained within the user prompt.
  3. For mathematical problems, it is advisable to include a directive in your prompt such as: 'Please reason step by step, and put your final answer within boxed.'
  4. When evaluating model performance, it is recommended to conduct multiple tests and average the results.

Additionally, we have observed that the DeepSeek-R1 series models tend to bypass thinking pattern (i.e., outputting <think></think>) when responding to certain queries, which can adversely affect the model's performance.To ensure that the model engages in thorough reasoning, we recommend enforcing the model to initiate its response with <think></think> at the beginning of every output.

3. DeepSeek-V3-0324 Deployment

DeepSeek-V3-0324 uses the same base model as the previous DeepSeek-V3, with only improvements in post-training methods. For private deployment, you only need to update the checkpoint and tokenizer_config.json (tool calls related changes).

The deployment options and frameworks for DeepSeek-V3-0324 are identical to those for DeepSeek-V3 described in section 1. All the same toolkits (SGLang, LMDeploy, TensorRT-LLM, vLLM) support DeepSeek-V3-0324 with the same configuration options.

License Information

Information about the licenses under which DeepSeek models are released

DeepSeek-V3-0324

MIT License

Consistent with DeepSeek-R1, our open-source repository (including model weights) uniformly adopts the MIT License, and allows users to leverage model outputs and distillation methods to train other models.

View License

DeepSeek-V3

MIT License

This code repository is licensed under the MIT License. The use of DeepSeek-V3 Base/Chat models is subject to the Model License. DeepSeek-V3 series (including Base and Chat) supports commercial use.

View License

DeepSeek-R1

MIT License

This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs. Please note that models like DeepSeek-R1-Distill-Qwen and DeepSeek-R1-Distill-Llama are derived from their respective base models with their original licenses.

View License

Disclaimer

DeepSeek models are provided "as is" without any express or implied warranties. Users should use the models at their own risk and ensure compliance with relevant laws and regulations. DeepSeek is not liable for any damages resulting from the use of these models.