opencoder:8b - Ollama 框架

OpenCoder 是一个开放且可复现的代码 LLM 系列，包括 1.5B 和 8B 模型，支持英语和中文两种语言。OpenCoder 从零开始，在包含 90% 原始代码和 10% 代码相关网络数据的 2.5 万亿个 tokens 上进行预训练，并在超过 450 万个高质量 SFT 示例上进行监督微调，最终达到顶级代码 LLM 的性能。我们不仅提供模型权重和推理代码，还提供可复现的训练数据、完整的数据处理流程、严格的实验消融结果和详细的训练协议。OpenCoder 为研究人员构建和创新提供动力，是您推进代码 AI 的开放基础。

完全开源: OpenCoder 确保完全透明，不仅发布模型权重和即将发布的推理代码，还发布完整的数据清理代码用于训练。此版本包括高质量的合成数据、大量的检查点以及超过 450 万个监督微调 (SFT) 条目的数据集，使 OpenCoder 成为最全面开源的模型之一。
全面的实验分析: OpenCoder 通过对各种数据清理策略和训练过程进行广泛的消融研究进行严格测试，包括文件级和存储库级重复数据删除实验，确保对模型性能进行彻底的探索和验证。
高质量的合成数据: OpenCoder 提供了一个完全开发的合成数据生成过程和超过 450 万个 SFT 数据条目，为模型训练和评估建立了一个强大的数据基础。
卓越的性能: OpenCoder 在多个语言模型基准测试中实现了高性能，使其在代码的领先开源模型中占据一席之地。

参考

GitHub

论文

Hugging Face

**OpenCoder** is an open and reproducible code LLM family which includes 1.5B and 8B  models, supporting both English and Chinese languages. Starting from scratch, OpenCoder is pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples, finally reaching the performance of top-tier code LLMs. We provide not only model weights and inference code, but also the reproducible training data, the complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols. Empowering researchers to build and innovate, OpenCoder is your open foundation for advancing code AI.

- **Complete Open Source**: OpenCoder ensures full transparency by releasing not only the model weights and forthcoming inference code but also the complete data-cleaning code for training. This release includes high-quality synthetic data, an extensive set of checkpoints, and a dataset of over 4.5 million supervised fine-tuning (SFT) entries, making OpenCoder one of the most comprehensively open-sourced models available.
- **Comprehensive Experimental Analysis**: OpenCoder is rigorously tested through extensive ablation studies on various data-cleaning strategies and training processes, including file-level and repository-level deduplication experiments, ensuring thorough exploration and validation of the model’s performance.
- **High-Quality Synthetic Data**: OpenCoder provides a fully developed synthetic data generation process and over 4.5 million SFT data entries, establishing a robust data foundation for model training and evaluation.
- **Exceptional Performance**: OpenCoder achieves high performance across multiple language model benchmarks, positioning it among the leading open-source models for code.

## References

[GitHub](https://github.com/OpenCoder-llm/OpenCoder-llm)

[Paper](https://arxiv.org/pdf/2411.04905)

[Hugging Face](https://hugging-face.cn/collections/infly/opencoder-672cec44bbb86c39910fb55e)

粘贴、拖放或点击上传图像 (.png, .jpeg, .jpg, .svg, .gif)