{"model_id":"minimax-m1-80k","name":"MiniMax M1 80K","organization":{"id":"minimax","name":"MiniMax","website":"https://www.minimaxi.com/"},"description":"MiniMax-M1 is an open-source, large-scale reasoning model that uses a hybrid-attention architecture for efficient long-context processing. It supports up to a 1 million token context window and 80,000-token reasoning output, matching Gemini 2.5 Pro’s scale while being highly cost-effective. Its Lightning Attention mechanism reduces compute requirements to about 30% of DeepSeek R1’s, and a new reinforcement learning algorithm, CISPO, doubles convergence speed compared to other RL methods. Trained on 512 H800s over three weeks, M1 achieves near state-of-the-art results across software engineering, long-context, and tool-use benchmarks, outperforming most open models and rivaling top closed systems.","release_date":"2025-06-16","announcement_date":"2025-06-16","multimodal":false,"knowledge_cutoff":null,"param_count":456000000000,"training_tokens":7500000000000,"available_in_zeroeval":false,"reviews_count":0,"reviews_avg_rating":0,"license":{"name":"MIT","allow_commercial":true},"model_family":null,"fine_tuned_from":null,"tags":null,"sources":{"api_ref":"https://platform.minimax.io/docs/guides/text-vllm-deployment","playground":null,"paper":"https://arxiv.org/abs/2506.13585","scorecard_blog":"https://www.minimax.io/news/minimax-m1","repo":"https://github.com/MiniMax-AI/MiniMax-M1","weights":"https://huggingface.co/MiniMaxAI/MiniMax-M1-80k"},"benchmarks":[{"benchmark_id":"aime-2024","name":"AIME 2024","description":"American Invitational Mathematics Examination 2024, consisting of 30 challenging mathematical reasoning problems from AIME I and AIME II competitions. Each problem requires an integer answer between 0-999 and tests advanced mathematical reasoning across algebra, geometry, combinatorics, and number theory. Used as a benchmark for evaluating mathematical reasoning capabilities in large language models at Olympiad-level difficulty.","categories":["math","reasoning"],"modality":"text","max_score":1.0,"score":0.86,"normalized_score":0.86,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"aime-2025","name":"AIME 2025","description":"All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.","categories":["math","reasoning"],"modality":"text","max_score":1.0,"score":0.769,"normalized_score":0.769,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"gpqa","name":"GPQA","description":"A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.","categories":["biology","chemistry","general","physics","reasoning"],"modality":"text","max_score":1.0,"score":0.7,"normalized_score":0.7,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":"Diamond subset","verification_date":null,"verification_notes":null},{"benchmark_id":"humanity's-last-exam","name":"Humanity's Last Exam","description":"Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions","categories":["math","reasoning"],"modality":"multimodal","max_score":1.0,"score":0.084,"normalized_score":0.084,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":"no tools","verification_date":null,"verification_notes":null},{"benchmark_id":"livecodebench","name":"LiveCodeBench","description":"LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.","categories":["code","general","reasoning"],"modality":"text","max_score":1.0,"score":0.65,"normalized_score":0.65,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":"24/8-25/5","verification_date":null,"verification_notes":null},{"benchmark_id":"longbench-v2","name":"LongBench v2","description":"LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.","categories":["general","long_context","reasoning","structured_output"],"modality":"text","max_score":1.0,"score":0.615,"normalized_score":0.615,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"math-500","name":"MATH-500","description":"MATH-500 is a subset of the MATH dataset containing 500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.","categories":["math","reasoning"],"modality":"text","max_score":1.0,"score":0.968,"normalized_score":0.968,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"mmlu-pro","name":"MMLU-Pro","description":"A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.","categories":["general","language","math","reasoning"],"modality":"text","max_score":1.0,"score":0.811,"normalized_score":0.811,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"multichallenge","name":"Multi-Challenge","description":"MultiChallenge is a realistic multi-turn conversation evaluation benchmark that challenges frontier LLMs across four key categories: instruction retention (maintaining instructions throughout conversations), inference memory (recalling and connecting details from previous turns), reliable versioned editing (adapting to evolving instructions during collaborative editing), and self-coherence (avoiding contradictions in responses). The benchmark evaluates models on sustained, contextually complex dialogues across diverse topics including travel planning, technical documentation, and professional communication.","categories":["communication","reasoning"],"modality":"text","max_score":1.0,"score":0.447,"normalized_score":0.447,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"openai-mrcr:-2-needle-128k","name":"OpenAI-MRCR: 2 needle 128k","description":"Multi-round Co-reference Resolution (MRCR) benchmark for evaluating an LLM's ability to distinguish between multiple needles hidden in long context. Models are given a long, multi-turn synthetic conversation and must retrieve a specific instance of a repeated request, requiring reasoning and disambiguation skills beyond simple retrieval.","categories":["long_context","reasoning"],"modality":"text","max_score":1.0,"score":0.734,"normalized_score":0.734,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"openai-mrcr:-2-needle-1m","name":"OpenAI-MRCR: 2 needle 1M","description":"Multi-Round Co-reference Resolution benchmark that tests an LLM's ability to distinguish between multiple similar needles hidden in long conversations. Models must reproduce specific instances of content (e.g., 'Return the 2nd poem about tapirs') from multi-turn synthetic conversations, requiring reasoning about context, ordering, and subtle differences between similar outputs.","categories":["long_context","reasoning"],"modality":"text","max_score":1.0,"score":0.562,"normalized_score":0.562,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"simpleqa","name":"SimpleQA","description":"SimpleQA is a factuality benchmark developed by OpenAI that measures the short-form factual accuracy of large language models. The benchmark contains 4,326 short, fact-seeking questions that are adversarially collected and designed to have single, indisputable answers. Questions cover diverse topics from science and technology to entertainment, and the benchmark also measures model calibration by evaluating whether models know what they know.","categories":["factuality","general","reasoning"],"modality":"text","max_score":1.0,"score":0.185,"normalized_score":0.185,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"swe-bench-verified","name":"SWE-Bench Verified","description":"A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.","categories":["code","frontend_development","reasoning"],"modality":"text","max_score":1.0,"score":0.56,"normalized_score":0.56,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"tau-bench-airline","name":"TAU-bench Airline","description":"Part of τ-bench (TAU-bench), a benchmark for Tool-Agent-User interaction in real-world domains. The airline domain evaluates language agents' ability to interact with users through dynamic conversations while following domain-specific rules and using API tools. Agents must handle airline-related tasks and policies reliably.","categories":["communication","reasoning","tool_calling"],"modality":"text","max_score":1.0,"score":0.62,"normalized_score":0.62,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"tau-bench-retail","name":"TAU-bench Retail","description":"A benchmark for evaluating tool-agent-user interaction in retail environments. Tests language agents' ability to handle dynamic conversations with users while using domain-specific API tools and following policy guidelines. Evaluates agents on tasks like order cancellations, address changes, and order status checks through multi-turn conversations.","categories":["communication","reasoning","tool_calling"],"modality":"text","max_score":1.0,"score":0.635,"normalized_score":0.635,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null},{"benchmark_id":"zebralogic","name":"ZebraLogic","description":"ZebraLogic is an evaluation framework for assessing large language models' logical reasoning capabilities through logic grid puzzles derived from constraint satisfaction problems (CSPs). The benchmark consists of 1,000 programmatically generated puzzles with controllable and quantifiable complexity, revealing a 'curse of complexity' where model accuracy declines significantly as problem complexity grows.","categories":["reasoning"],"modality":"text","max_score":1.0,"score":0.868,"normalized_score":0.868,"verified":false,"self_reported":true,"self_reported_source":null,"analysis_method":null,"verification_date":null,"verification_notes":null}],"providers":[{"provider_id":"novita","name":"Novita","website":"https://novita.ai/","deprecated":false,"deprecated_at":null,"pricing":{"input_per_million":0.55,"output_per_million":2.2},"quantization":"bf16","limits":{"max_input_tokens":1000000,"max_output_tokens":40000},"performance":{"throughput":null,"latency":null},"features":{"web_search":null,"function_calling":null,"structured_output":null,"code_execution":null,"batch_inference":null,"finetuning":null},"modalities":{"input":{"text":true,"image":false,"audio":false,"video":false},"output":{"text":true,"image":false,"audio":false,"video":false}}}],"benchmark_rankings":[{"benchmark_id":"aime-2024","benchmark_name":"AIME 2024","models":[{"model_id":"grok-3-mini","model_name":"Grok-3 Mini","score":0.958,"rank":1,"is_current_model":false},{"model_id":"o4-mini","model_name":"o4-mini","score":0.934,"rank":2,"is_current_model":false},{"model_id":"longcat-flash-thinking","model_name":"LongCat-Flash-Thinking","score":0.933,"rank":3,"is_current_model":false},{"model_id":"grok-3","model_name":"Grok-3","score":0.933,"rank":3,"is_current_model":false},{"model_id":"gemini-2.5-pro","model_name":"Gemini 2.5 Pro","score":0.92,"rank":5,"is_current_model":false},{"model_id":"o3-2025-04-16","model_name":"o3","score":0.916,"rank":6,"is_current_model":false},{"model_id":"deepseek-r1-0528","model_name":"DeepSeek-R1-0528","score":0.914,"rank":7,"is_current_model":false},{"model_id":"glm-4.5","model_name":"GLM-4.5","score":0.91,"rank":8,"is_current_model":false},{"model_id":"ministral-14b-latest","model_name":"Ministral 3 (14B Reasoning 2512)","score":0.898,"rank":9,"is_current_model":false},{"model_id":"glm-4.5-air","model_name":"GLM-4.5-Air","score":0.894,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.86,"rank":15,"is_current_model":true}]},{"benchmark_id":"aime-2025","benchmark_name":"AIME 2025","models":[{"model_id":"gemini-3-pro-preview","model_name":"Gemini 3 Pro","score":1.0,"rank":1,"is_current_model":false},{"model_id":"gpt-5.2-pro-2025-12-11","model_name":"GPT-5.2 Pro","score":1.0,"rank":1,"is_current_model":false},{"model_id":"gpt-5.2-2025-12-11","model_name":"GPT-5.2","score":1.0,"rank":1,"is_current_model":false},{"model_id":"kimi-k2-thinking-0905","model_name":"Kimi K2-Thinking-0905","score":1.0,"rank":1,"is_current_model":false},{"model_id":"grok-4-heavy","model_name":"Grok-4 Heavy","score":1.0,"rank":1,"is_current_model":false},{"model_id":"claude-opus-4-6","model_name":"Claude Opus 4.6","score":0.9979,"rank":6,"is_current_model":false},{"model_id":"gemini-3-flash-preview","model_name":"Gemini 3 Flash","score":0.997,"rank":7,"is_current_model":false},{"model_id":"gpt-5.1-high-2025-11-12","model_name":"GPT-5.1 High","score":0.996,"rank":8,"is_current_model":false},{"model_id":"longcat-flash-thinking-2601","model_name":"LongCat-Flash-Thinking-2601","score":0.996,"rank":8,"is_current_model":false},{"model_id":"nemotron-3-nano-30b-a3b","model_name":"Nemotron 3 Nano (30B A3B)","score":0.992,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.769,"rank":68,"is_current_model":true}]},{"benchmark_id":"gpqa","benchmark_name":"GPQA","models":[{"model_id":"gemini-3.1-pro-preview","model_name":"Gemini 3.1 Pro","score":0.943,"rank":1,"is_current_model":false},{"model_id":"gpt-5.2-pro-2025-12-11","model_name":"GPT-5.2 Pro","score":0.932,"rank":2,"is_current_model":false},{"model_id":"gpt-5.4","model_name":"GPT-5.4","score":0.928,"rank":3,"is_current_model":false},{"model_id":"gpt-5.2-2025-12-11","model_name":"GPT-5.2","score":0.924,"rank":4,"is_current_model":false},{"model_id":"gemini-3-pro-preview","model_name":"Gemini 3 Pro","score":0.919,"rank":5,"is_current_model":false},{"model_id":"claude-opus-4-6","model_name":"Claude Opus 4.6","score":0.913,"rank":6,"is_current_model":false},{"model_id":"qwen3.6-plus","model_name":"Qwen3.6 Plus","score":0.904,"rank":7,"is_current_model":false},{"model_id":"gemini-3-flash-preview","model_name":"Gemini 3 Flash","score":0.904,"rank":7,"is_current_model":false},{"model_id":"claude-sonnet-4-6","model_name":"Claude Sonnet 4.6","score":0.899,"rank":9,"is_current_model":false},{"model_id":"seed-2.0-pro","model_name":"Seed 2.0 Pro","score":0.889,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.7,"rank":98,"is_current_model":true}]},{"benchmark_id":"humanity's-last-exam","benchmark_name":"Humanity's Last Exam","models":[{"model_id":"claude-opus-4-6","model_name":"Claude Opus 4.6","score":0.531,"rank":1,"is_current_model":false},{"model_id":"gemini-3.1-pro-preview","model_name":"Gemini 3.1 Pro","score":0.514,"rank":2,"is_current_model":false},{"model_id":"kimi-k2-thinking-0905","model_name":"Kimi K2-Thinking-0905","score":0.51,"rank":3,"is_current_model":false},{"model_id":"grok-4-heavy","model_name":"Grok-4 Heavy","score":0.507,"rank":4,"is_current_model":false},{"model_id":"kimi-k2.5","model_name":"Kimi K2.5","score":0.502,"rank":5,"is_current_model":false},{"model_id":"claude-sonnet-4-6","model_name":"Claude Sonnet 4.6","score":0.49,"rank":6,"is_current_model":false},{"model_id":"qwen3.5-27b","model_name":"Qwen3.5-27B","score":0.485,"rank":7,"is_current_model":false},{"model_id":"qwen3.5-122b-a10b","model_name":"Qwen3.5-122B-A10B","score":0.475,"rank":8,"is_current_model":false},{"model_id":"qwen3.5-35b-a3b","model_name":"Qwen3.5-35B-A3B","score":0.474,"rank":9,"is_current_model":false},{"model_id":"gemini-3-pro-preview","model_name":"Gemini 3 Pro","score":0.458,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.084,"rank":55,"is_current_model":true}]},{"benchmark_id":"livecodebench","benchmark_name":"LiveCodeBench","models":[{"model_id":"deepseek-reasoner","model_name":"DeepSeek-V3.2 (Thinking)","score":0.833,"rank":1,"is_current_model":false},{"model_id":"minimax-m2","model_name":"MiniMax M2","score":0.83,"rank":2,"is_current_model":false},{"model_id":"longcat-flash-thinking-2601","model_name":"LongCat-Flash-Thinking-2601","score":0.828,"rank":3,"is_current_model":false},{"model_id":"nemotron-3-super-120b-a12b","model_name":"Nemotron 3 Super (120B A12B)","score":0.8119,"rank":4,"is_current_model":false},{"model_id":"grok-3-mini","model_name":"Grok-3 Mini","score":0.804,"rank":5,"is_current_model":false},{"model_id":"grok-4-fast","model_name":"Grok 4 Fast","score":0.8,"rank":6,"is_current_model":false},{"model_id":"longcat-flash-thinking","model_name":"LongCat-Flash-Thinking","score":0.794,"rank":7,"is_current_model":false},{"model_id":"grok-3","model_name":"Grok-3","score":0.794,"rank":7,"is_current_model":false},{"model_id":"grok-4-heavy","model_name":"Grok-4 Heavy","score":0.794,"rank":7,"is_current_model":false},{"model_id":"grok-4","model_name":"Grok-4","score":0.79,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.65,"rank":22,"is_current_model":true}]},{"benchmark_id":"longbench-v2","benchmark_name":"LongBench v2","models":[{"model_id":"qwen3.5-397b-a17b","model_name":"Qwen3.5-397B-A17B","score":0.632,"rank":1,"is_current_model":false},{"model_id":"qwen3.6-plus","model_name":"Qwen3.6 Plus","score":0.62,"rank":2,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.615,"rank":3,"is_current_model":true},{"model_id":"kimi-k2.5","model_name":"Kimi K2.5","score":0.61,"rank":4,"is_current_model":false},{"model_id":"minimax-m1-40k","model_name":"MiniMax M1 40K","score":0.61,"rank":4,"is_current_model":false},{"model_id":"qwen3.5-27b","model_name":"Qwen3.5-27B","score":0.606,"rank":6,"is_current_model":false},{"model_id":"mimo-v2-flash","model_name":"MiMo-V2-Flash","score":0.606,"rank":6,"is_current_model":false},{"model_id":"qwen3.5-122b-a10b","model_name":"Qwen3.5-122B-A10B","score":0.602,"rank":8,"is_current_model":false},{"model_id":"qwen3.5-35b-a3b","model_name":"Qwen3.5-35B-A3B","score":0.59,"rank":9,"is_current_model":false},{"model_id":"qwen3.5-9b","model_name":"Qwen3.5-9B","score":0.552,"rank":10,"is_current_model":false}]},{"benchmark_id":"math-500","benchmark_name":"MATH-500","models":[{"model_id":"longcat-flash-thinking","model_name":"LongCat-Flash-Thinking","score":0.992,"rank":1,"is_current_model":false},{"model_id":"sarvam-105b","model_name":"Sarvam-105B","score":0.986,"rank":2,"is_current_model":false},{"model_id":"glm-4.5","model_name":"GLM-4.5","score":0.982,"rank":3,"is_current_model":false},{"model_id":"glm-4.5-air","model_name":"GLM-4.5-Air","score":0.981,"rank":4,"is_current_model":false},{"model_id":"nvidia-nemotron-nano-9b-v2","model_name":"Nemotron Nano 9B v2","score":0.978,"rank":5,"is_current_model":false},{"model_id":"kimi-k2-instruct","model_name":"Kimi K2 Instruct","score":0.974,"rank":6,"is_current_model":false},{"model_id":"kimi-k2-instruct-0905","model_name":"Kimi K2-Instruct-0905","score":0.974,"rank":6,"is_current_model":false},{"model_id":"sarvam-30b","model_name":"Sarvam-30B","score":0.97,"rank":8,"is_current_model":false},{"model_id":"llama-3.1-nemotron-ultra-253b-v1","model_name":"Llama 3.1 Nemotron Ultra 253B v1","score":0.97,"rank":8,"is_current_model":false},{"model_id":"longcat-flash-lite","model_name":"LongCat-Flash-Lite","score":0.968,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.968,"rank":10,"is_current_model":true}]},{"benchmark_id":"mmlu-pro","benchmark_name":"MMLU-Pro","models":[{"model_id":"qwen3.6-plus","model_name":"Qwen3.6 Plus","score":0.885,"rank":1,"is_current_model":false},{"model_id":"minimax-m2.1","model_name":"MiniMax M2.1","score":0.88,"rank":2,"is_current_model":false},{"model_id":"qwen3.5-397b-a17b","model_name":"Qwen3.5-397B-A17B","score":0.878,"rank":3,"is_current_model":false},{"model_id":"kimi-k2.5","model_name":"Kimi K2.5","score":0.871,"rank":4,"is_current_model":false},{"model_id":"ernie-5.0","model_name":"ERNIE 5.0","score":0.87,"rank":5,"is_current_model":false},{"model_id":"qwen3.5-122b-a10b","model_name":"Qwen3.5-122B-A10B","score":0.867,"rank":6,"is_current_model":false},{"model_id":"qwen3.5-27b","model_name":"Qwen3.5-27B","score":0.861,"rank":7,"is_current_model":false},{"model_id":"qwen3.5-35b-a3b","model_name":"Qwen3.5-35B-A3B","score":0.853,"rank":8,"is_current_model":false},{"model_id":"gemma-4-31b-it","model_name":"Gemma 4 31B","score":0.852,"rank":9,"is_current_model":false},{"model_id":"deepseek-v3.2-exp","model_name":"DeepSeek-V3.2-Exp","score":0.85,"rank":10,"is_current_model":false},{"model_id":"deepseek-r1-0528","model_name":"DeepSeek-R1-0528","score":0.85,"rank":10,"is_current_model":false},{"model_id":"deepseek-reasoner","model_name":"DeepSeek-V3.2 (Thinking)","score":0.85,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.811,"rank":35,"is_current_model":true}]},{"benchmark_id":"multichallenge","benchmark_name":"Multi-Challenge","models":[{"model_id":"gpt-5-2025-08-07","model_name":"GPT-5","score":0.696,"rank":1,"is_current_model":false},{"model_id":"qwen3.5-397b-a17b","model_name":"Qwen3.5-397B-A17B","score":0.676,"rank":2,"is_current_model":false},{"model_id":"step3-vl-10b","model_name":"Step3-VL-10B","score":0.626,"rank":3,"is_current_model":false},{"model_id":"qwen3.5-122b-a10b","model_name":"Qwen3.5-122B-A10B","score":0.615,"rank":4,"is_current_model":false},{"model_id":"qwen3.5-27b","model_name":"Qwen3.5-27B","score":0.608,"rank":5,"is_current_model":false},{"model_id":"o3-2025-04-16","model_name":"o3","score":0.604,"rank":6,"is_current_model":false},{"model_id":"qwen3.5-35b-a3b","model_name":"Qwen3.5-35B-A3B","score":0.6,"rank":7,"is_current_model":false},{"model_id":"nemotron-3-super-120b-a12b","model_name":"Nemotron 3 Super (120B A12B)","score":0.5523,"rank":8,"is_current_model":false},{"model_id":"qwen3.5-9b","model_name":"Qwen3.5-9B","score":0.545,"rank":9,"is_current_model":false},{"model_id":"kimi-k2-instruct-0905","model_name":"Kimi K2-Instruct-0905","score":0.541,"rank":10,"is_current_model":false},{"model_id":"kimi-k2-instruct","model_name":"Kimi K2 Instruct","score":0.541,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.447,"rank":13,"is_current_model":true}]},{"benchmark_id":"simpleqa","benchmark_name":"SimpleQA","models":[{"model_id":"deepseek-v3.2-exp","model_name":"DeepSeek-V3.2-Exp","score":0.971,"rank":1,"is_current_model":false},{"model_id":"grok-4-fast","model_name":"Grok 4 Fast","score":0.95,"rank":2,"is_current_model":false},{"model_id":"deepseek-v3.1","model_name":"DeepSeek-V3.1","score":0.934,"rank":3,"is_current_model":false},{"model_id":"deepseek-r1-0528","model_name":"DeepSeek-R1-0528","score":0.923,"rank":4,"is_current_model":false},{"model_id":"ernie-5.0","model_name":"ERNIE 5.0","score":0.75,"rank":5,"is_current_model":false},{"model_id":"gemini-3-pro-preview","model_name":"Gemini 3 Pro","score":0.721,"rank":6,"is_current_model":false},{"model_id":"gemini-3-flash-preview","model_name":"Gemini 3 Flash","score":0.687,"rank":7,"is_current_model":false},{"model_id":"gpt-4.5","model_name":"GPT-4.5","score":0.625,"rank":8,"is_current_model":false},{"model_id":"qwen3-vl-32b-thinking","model_name":"Qwen3 VL 32B Thinking","score":0.554,"rank":9,"is_current_model":false},{"model_id":"qwen3-235b-a22b-instruct-2507","model_name":"Qwen3-235B-A22B-Instruct-2507","score":0.543,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.185,"rank":33,"is_current_model":true}]},{"benchmark_id":"swe-bench-verified","benchmark_name":"SWE-Bench Verified","models":[{"model_id":"claude-opus-4-5-20251101","model_name":"Claude Opus 4.5","score":0.809,"rank":1,"is_current_model":false},{"model_id":"claude-opus-4-6","model_name":"Claude Opus 4.6","score":0.808,"rank":2,"is_current_model":false},{"model_id":"gemini-3.1-pro-preview","model_name":"Gemini 3.1 Pro","score":0.806,"rank":3,"is_current_model":false},{"model_id":"minimax-m2.5","model_name":"MiniMax M2.5","score":0.802,"rank":4,"is_current_model":false},{"model_id":"gpt-5.2-2025-12-11","model_name":"GPT-5.2","score":0.8,"rank":5,"is_current_model":false},{"model_id":"claude-sonnet-4-6","model_name":"Claude Sonnet 4.6","score":0.796,"rank":6,"is_current_model":false},{"model_id":"qwen3.6-plus","model_name":"Qwen3.6 Plus","score":0.788,"rank":7,"is_current_model":false},{"model_id":"mimo-v2-pro","model_name":"MiMo-V2-Pro","score":0.78,"rank":8,"is_current_model":false},{"model_id":"gemini-3-flash-preview","model_name":"Gemini 3 Flash","score":0.78,"rank":8,"is_current_model":false},{"model_id":"glm-5","model_name":"GLM-5","score":0.778,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.56,"rank":58,"is_current_model":true}]},{"benchmark_id":"tau-bench-airline","benchmark_name":"TAU-bench Airline","models":[{"model_id":"claude-sonnet-4-5-20250929","model_name":"Claude Sonnet 4.5","score":0.7,"rank":1,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.62,"rank":2,"is_current_model":true},{"model_id":"glm-4.5-air","model_name":"GLM-4.5-Air","score":0.608,"rank":3,"is_current_model":false},{"model_id":"glm-4.5","model_name":"GLM-4.5","score":0.604,"rank":4,"is_current_model":false},{"model_id":"minimax-m1-40k","model_name":"MiniMax M1 40K","score":0.6,"rank":5,"is_current_model":false},{"model_id":"qwen3-coder-480b-a35b-instruct","model_name":"Qwen3-Coder 480B A35B Instruct","score":0.6,"rank":5,"is_current_model":false},{"model_id":"claude-sonnet-4-20250514","model_name":"Claude Sonnet 4","score":0.6,"rank":5,"is_current_model":false},{"model_id":"claude-opus-4-20250514","model_name":"Claude Opus 4","score":0.596,"rank":8,"is_current_model":false},{"model_id":"claude-3-7-sonnet-20250219","model_name":"Claude 3.7 Sonnet","score":0.584,"rank":9,"is_current_model":false},{"model_id":"claude-opus-4-1-20250805","model_name":"Claude Opus 4.1","score":0.56,"rank":10,"is_current_model":false}]},{"benchmark_id":"tau-bench-retail","benchmark_name":"TAU-bench Retail","models":[{"model_id":"claude-sonnet-4-5-20250929","model_name":"Claude Sonnet 4.5","score":0.862,"rank":1,"is_current_model":false},{"model_id":"claude-opus-4-1-20250805","model_name":"Claude Opus 4.1","score":0.824,"rank":2,"is_current_model":false},{"model_id":"claude-opus-4-20250514","model_name":"Claude Opus 4","score":0.814,"rank":3,"is_current_model":false},{"model_id":"claude-3-7-sonnet-20250219","model_name":"Claude 3.7 Sonnet","score":0.812,"rank":4,"is_current_model":false},{"model_id":"claude-sonnet-4-20250514","model_name":"Claude Sonnet 4","score":0.805,"rank":5,"is_current_model":false},{"model_id":"glm-4.5","model_name":"GLM-4.5","score":0.797,"rank":6,"is_current_model":false},{"model_id":"glm-4.5-air","model_name":"GLM-4.5-Air","score":0.779,"rank":7,"is_current_model":false},{"model_id":"qwen3-coder-480b-a35b-instruct","model_name":"Qwen3-Coder 480B A35B Instruct","score":0.775,"rank":8,"is_current_model":false},{"model_id":"o4-mini","model_name":"o4-mini","score":0.718,"rank":9,"is_current_model":false},{"model_id":"o1-2024-12-17","model_name":"o1","score":0.708,"rank":10,"is_current_model":false},{"model_id":"minimax-m1-80k","model_name":"MiniMax M1 80K","score":0.635,"rank":18,"is_current_model":true}]}],"comparison_model":{"model_id":"gemini-3.1-pro-preview","name":"Gemini 3.1 Pro","organization_name":"Google","release_date":"2026-02-19","announcement_date":"2026-02-19","knowledge_cutoff":"2025-01-31","param_count":null,"multimodal":true,"license":{"name":"Proprietary","allow_commercial":false},"benchmarks":{"apex-agents":0.335,"arc-agi-v2":0.771,"browsecomp":0.859,"gdpval-aa":1317.0,"gpqa":0.943,"humanity's-last-exam":0.514,"livecodebench-pro":2887.0,"mcp-atlas":0.692,"mmmlu":0.926,"mmmu-pro":0.805,"mrcr-v2-(8-needle)":0.263,"scicode":0.59,"swe-bench-pro":0.542,"swe-bench-verified":0.806,"t2-bench":0.993,"terminal-bench-2":0.685},"provider":{"name":"Google","input_cost":2.5,"output_cost":15.0,"max_input_tokens":1048576,"max_output_tokens":65536,"modalities":{"input":{"text":false,"image":true,"audio":false,"video":false},"output":{"text":true,"image":false,"audio":false,"video":false}}}}}