Kimi k1.5 is a cutting-edge large language model (LLM) developed by Moonshot AI, a Chinese AI company established in 2023.
This open-source, multimodal model excels in processing both text and visual data, offering advanced reasoning capabilities across various domains.
Key Features of Kimi k1.5:
- Multimodal Integration: Kimi k1.5 is trained on both text and vision data, enabling it to jointly reason over these modalities.
- Extended Context Window: With an enhanced 128K context window, it can process large amounts of information in a single prompt, facilitating more coherent and contextually aware responses.
- Advanced Reasoning Performance: The model achieves state-of-the-art results on multiple benchmarks, including a score of 77.5 on AIME, 96.2 on MATH 500, and ranking in the 94th percentile on Codeforces, matching OpenAI’s o1 model.
- Reinforcement Learning Optimization: Kimi k1.5 employs improved policy optimization techniques, such as a variant of online mirror descent, to enhance its reasoning capabilities without relying on complex methods like Monte Carlo tree search.
- Simplified Framework: The model’s design allows for effective learning without the need for intricate techniques, focusing on long-context scaling and robust policy optimization.
Accessing Kimi k1.5
Kimi k1.5 is freely accessible through Moonshot AI’s platform. Users can interact with the model via the chat interface on the official website.
Kimi k1.5 represents a significant advancement in AI, offering robust multimodal processing and reasoning capabilities that rival leading models in the industry.
Kimi k1.5 Acronyms
- CoT: Chain of Thought. It refers to a sequence of intermediate steps used to bridge the problem and the solution in complex reasoning tasks.
- LLMs: Large Language Models. These are AI models trained on large amounts of text data to generate human-like text.
- RL: Reinforcement Learning. A type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize some notion of cumulative reward.
- EM: Exact Match. A metric used to evaluate the performance of models, indicating the proportion of predictions that exactly match the ground truth.
- Pass@1: A metric used in some benchmarks to measure the proportion of problems that the model solves correctly on the first attempt.
- AIME: American Invitational Mathematics Examination. A prestigious, invitation-only math contest for top high school students.
- MATH 500: A comprehensive mathematics benchmark containing 500 problems on various mathematics topics.
- Codeforces: A well-known online judge platform for evaluating coding models.
- MathVista: A benchmark that integrates challenges from a variety of mathematical and visual tasks.
- MMMU: Massive Multidiscipline Multimodal Understanding. A benchmark encompassing a collection of multimodal questions from various academic fields.
Approach: Reinforcement Learning with LLMs
RL Prompt Set Curation
The quality and diversity of the RL prompt set are critical for effective reinforcement learning.
The authors designed a prompt set that covers a wide array of disciplines, including STEM, coding, and general reasoning, with balanced difficulty levels.
They employed automatic filters and a tagging system to ensure diverse coverage and accurate evaluability. They then developed a model-based approach to assess the difficulty of each prompt and exclude prompts prone to reward hacking.
Long-CoT Supervised Fine-Tuning

The authors used long-CoT supervised fine-tuning to construct a high-quality warmup dataset.
This approach involves generating detailed reasoning paths through prompt engineering, encapsulating key cognitive processes like planning, evaluation, reflection, and exploration.
This warmup dataset primes the model to internalize these reasoning strategies, improving its ability to generate logically coherent responses.
Reinforcement Learning
The core of the Kimi K1.5 training involves reinforcement learning.
The authors formulated RL with long-CoT and employed a variant of online mirror descent for robust policy optimization.
They introduced a length penalty to control the response length and used curriculum and prioritized sampling strategies to improve training efficiency.
The model is trained to generate CoT that leads to correct answers, with the reward signal derived from the correctness of the final answer.
Long2short: Context Compression for Short-CoT Models
The authors explored methods to transfer the reasoning priors from long-CoT models to short-CoT models, enhancing token efficiency. Techniques such as model merging, shortest rejection sampling, DPO, and long2short RL were employed to achieve this. The long2short RL method demonstrated superior token efficiency compared to other methods.
Experiments and Results
Performance on Benchmarks
The Kimi K1.5 long-CoT model achieved state-of-the-art results on various benchmarks, including:
- AIME 2024: 77.5 (Pass@1)
- MATH 500: 96.2 (EM)
- Codeforces: 94th percentile
- MathVista: 74.9 (Pass@1)
The short-CoT model also outperformed existing models, with results such as:
- AIME 2024: 60.8 (Pass@1)
- MATH 500: 94.6 (EM)
- LiveCodeBench: 47.3 (Pass@1)
Long Context Scaling
The authors demonstrated that scaling the context length is crucial for improving the model’s reasoning capabilities.
They observed a strong correlation between the model’s output context length and its problem-solving performance. The final run of Kimi K1.5 scaled to a 128k context length, achieving continued improvement on hard reasoning benchmarks.
Long2short Methods
The long2short RL algorithm showed the highest token efficiency compared to other methods like DPO and model merging. The k1.5 series models demonstrated superior token efficiency, achieving high performance with fewer tokens.
RL Infrastructure
