wrjc1hod 发表于 2024-8-17 15:37:18

会写代码的AI开源!C语言比Codex写得好,把握12种编程语言丨CMU


    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">萧箫 发自 凹非寺</p>量子位 | 公众号 QbitAI
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">比Codex还会写C语言的AI代码生成模型,<span style="color: black;">此刻</span>开源了!</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">这段时间,用AI写代码<span style="color: black;">能够</span>说是大火,其中最著名的要属OpenAI的Codex和DeepMind的AlphaCode。</p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/707e8bfb9ea44bfc80aca9040d3e5042~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=QErzKRLI%2BV8XqlNQ4As6o5n9lwg%3D" style="width: 50%; margin-bottom: 20px;">
      <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">△基于Codex的Copilot</p>
    </div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">然而,这两个AI模型,全都<span style="color: black;">无</span>开源:</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">其中AlphaCode只给出了<span style="color: black;">有些</span>测试样例,而Codex只开放了API。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">为此,来自CMU的几个<span style="color: black;">科研</span>人员,用GPT-2搞出了一个名叫<strong style="color: blue;"><span style="color: black;">PolyCoder</span></strong>的AI代码生成模型,<span style="color: black;">况且</span>还是<strong style="color: blue;"><span style="color: black;">开源的</span></strong>。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">据<span style="color: black;">科研</span>人员<span style="color: black;">暗示</span>,虽然PolyCoder最大<span style="color: black;">仅有</span>27亿参数<span style="color: black;">(相比Codex有120亿参数)</span>,但它用<strong style="color: blue;"><span style="color: black;">C语言</span></strong>写出来的代码,比Codex的效果还要好。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这儿</span>面<span style="color: black;">到底</span>有什么秘诀?</p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;">用12种编程语言代码集训练</h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">首要</span>来看训练用的<strong style="color: blue;"><span style="color: black;">数据集</span></strong>,这<span style="color: black;">亦</span>是PolyCoder的最大特点之一。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">此前,<span style="color: black;">包含</span>Codex、CodeParrot等AI代码生成模型,<span style="color: black;">重点</span>都是基于<strong style="color: blue;"><span style="color: black;">Python</span></strong>语言的代码来训练。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">例如Codex的<span style="color: black;">评定</span>数据集之一HumanEval,<span style="color: black;">评定</span>的<span style="color: black;">亦</span>是生成Python代码的效果。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">相比之下,<strong style="color: blue;"><span style="color: black;">PolyCoder</span></strong>采用了<strong style="color: blue;"><span style="color: black;">多种编程语言</span></strong>代码集来训练,一共有12种:</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">C、C#、C++、Go、Java、JavaScript、PHP、Python、Ruby、Rust、Scala和TypeScript。</span></p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/209b208b8d00440d9395a310475c6d06~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=yMP3lp63u1LqPvh40oo6aTOOm2w%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">其中,C语言的代码量是最多的,达到了221GB;而Python代码的数据量比Codex和CodeParrot用得都要少。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这儿</span>PolyCoder用的是GitHub上的公开代码,<span style="color: black;">重点</span><span style="color: black;">选择</span>的是<span style="color: black;">各样</span>编程语言中比较受欢迎的库,<span style="color: black;">每一个</span>库<span style="color: black;">最少</span>有50 Stars。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">据<span style="color: black;">科研</span>人员<span style="color: black;">暗示</span>,每种编程语言库的Stars总数加起来不超过25k,以避免模型生成的代码效果太过于倾斜最流行的编程语言<span style="color: black;">(<span style="color: black;">一般</span>编程语言越流行,库的Stars就越多)</span>。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">经过</span>提取库中的文件、经过简单处理<span style="color: black;">(<span style="color: black;">包含</span>消除重复代码)</span>后,一共筛选出大约<strong style="color: blue;"><span style="color: black;">254GB</span></strong>的数据用于训练。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">而后</span>是<strong style="color: blue;"><span style="color: black;">预训练</span></strong>的<span style="color: black;">办法</span>。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">语言模型的预训练<span style="color: black;">办法</span><span style="color: black;">一般</span>有三种。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">第1</span>种是自左向右的语言模型,<span style="color: black;">按照</span>上文预测下文,比较适用于<strong style="color: blue;"><span style="color: black;">代码生成</span></strong>等;第二种是掩蔽语言模型,基于上下文预测屏蔽片段,比较适合<strong style="color: blue;"><span style="color: black;">代码<span style="color: black;">归类</span></span></strong>等;第三种是编解码器模型,比较适用于<strong style="color: blue;"><span style="color: black;">代码注释</span></strong>等任务。</p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/965e2b7728114c51bac29236126d2357~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=XwdVLjUriiztIKbxtMhD%2BHeZfcA%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这儿</span>PolyCoder<span style="color: black;">重点</span>采用的是<span style="color: black;">第1</span>种预训练<span style="color: black;">办法</span>。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">相比于<span style="color: black;">一样</span>采用GPT-2训练的CodeParrot和Codex,PolyCoder在超参数设置上<span style="color: black;">亦</span>稍微有<span style="color: black;">有些</span>差异:</p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/dae4637e3dbd41388d063b89ebdfd87e~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=jgC1oBDhC%2FNfvl9Vldmgw4l2gs8%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">PolyCoder一共<span style="color: black;">供给</span>了三种<span style="color: black;">区别</span>的模型,分别有27亿参数、4亿参数和1.6亿参数,<span style="color: black;">科研</span>人员<span style="color: black;">能够</span><span style="color: black;">按照</span><span style="color: black;">自己</span>需求和<span style="color: black;">区别</span>的训练能力来<span style="color: black;">选择</span>合适的模型。</p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/336e51f7bf32498691f9699635dad98b~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=qy2%2FthkJxWVu0ou5zQhp1IXX%2B00%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">那样</span>,<span style="color: black;">最后</span>训练出来的AI模型,代码生成效果<span style="color: black;">怎样</span>?</p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;">C语言写得尤其好,但Python不行</h1>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">科研</span>人员将PolyCoder与已有的AI代码生成模型进行了对比。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">因为</span>AlphaCode<span style="color: black;">欠好</span>比较<span style="color: black;">(接口没开放)</span>,<span style="color: black;">因此</span><span style="color: black;">科研</span>人员<span style="color: black;">重点</span>分析了下面这些模型,<span style="color: black;">包含</span>GPT-Neo、CodeParrot和Codex等。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">其中蓝色的是开源的,橙色的是没开源的:</p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p26-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/d4586df4551f4ae1a5ade9394712bcaa~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=p%2F5kwC3sjZBJavPLq%2F1cUSZ5joU%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">从参数量来看,PolyCoder并不是最顶尖的,最大的27亿参数模型<span style="color: black;">亦</span><span style="color: black;">仅有</span>Codex的四分之一不到。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">科研</span>人员先是用语言模型<span style="color: black;">评定</span>常用的<strong style="color: blue;"><span style="color: black;">困惑度</span></strong>对一系列模型进行了比较。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">困惑度</span><span style="color: black;">(Perplexity)</span>,用于衡量语言模型<span style="color: black;">(LM)</span>的好坏。困惑度越低,语言模型面对代码感到困惑的程度就越低,模型生成效果越好。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">从图中来看,PolyCoder在<strong style="color: blue;"><span style="color: black;">C语言</span></strong>中意外取得了最好的效果<span style="color: black;">(困惑度最低)</span>。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">用<span style="color: black;">海量</span>C语言训练PolyCoder的结果说明,即使模型整体原理不变<span style="color: black;">(基于GPT-2)</span>,单纯改变训练用的代码集,<span style="color: black;">亦</span>能训练出<span style="color: black;">善于</span><span style="color: black;">区别</span>语言风格的AI代码生成模型。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">可惜的是,从其他语言来看,生成的效果就完全没办法和Codex相比了:</p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/3c50be4e61544942b40320520c3eaf64~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=T7dFoCfAxJhJ0YNAqWeX2iw4oyg%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">例如,在<span style="color: black;">重点</span>用于<span style="color: black;">评定</span>Python代码的HumanEval上,PolyCoder的能力远不如Codex好:</p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/92922deda3104dd1a87f80bc2787c9de~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=cJ3baaNdBja079d9UVCeNJ7mXVw%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">据论文分析,这可能是Python代码数据量、模型参数量不足等<span style="color: black;">原由</span><span style="color: black;">引起</span>的。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">另外</span>,作者们<span style="color: black;">亦</span><span style="color: black;">说到</span>,做出PolyCoder的目的<span style="color: black;">重点</span>还是为了开源一个AI代码生成模型,让<span style="color: black;">更加多</span>人参与<span style="color: black;">科研</span>和<span style="color: black;">运用</span>。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">日前</span>代码<span style="color: black;">已然</span>开源,无论是直接拿来用,还是试着在它的<span style="color: black;">基本</span>上<span style="color: black;">研发</span>新模型都<span style="color: black;">能够</span>。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">感兴趣的小伙伴<span style="color: black;">能够</span>上手一试了~</p>
    <h1 style="color: black; text-align: left; margin-bottom: 10px;">作者介绍</h1>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/e35773ee9ad14453b99c1fa9a01192a7~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=ApUbcUBbe2lUsX8rBzenXekJh1s%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">一作许方正<span style="color: black;">(Frank Xu)</span>,<span style="color: black;">日前</span>在CMU读博,<span style="color: black;">科研</span>方向是NLP、信息抽取等,<span style="color: black;">发布</span><span style="color: black;">太多</span>篇顶会论文,<span style="color: black;">包含</span>ICLR、ACL和EMNLP等。本硕毕业于上海交通大学,师从朱其立教授。</p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/a4d68f3b52094771bb6021b72d42536b~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=%2FDmEbfaODpNv9Whmf7tBawIYPOU%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Uri Alon,在CMU进行博士后工作,<span style="color: black;">科研</span>方向是编程语言处理<span style="color: black;">(PLP)</span>、NLP和深度学习。</p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/f31ec11b2e9d4671b9ce8a6f37d261dd~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=pk2uNA1%2FWrm2lYR7hJrOnA2tvM8%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Graham Neubig,CMU助理教授,<span style="color: black;">科研</span>方向是NLP、<span style="color: black;">设备</span>翻译和基于<span style="color: black;">设备</span>学习的自然语言理解。</p>
    <div style="color: black; text-align: left; margin-bottom: 10px;"><img src="https://p3-sign.toutiaoimg.com/tos-cn-i-qvj2lq49k0/5047fb76495d41bea93baceae51feff5~noop.image?_iz=58558&amp;from=article.pc_detail&amp;lk3s=953192f4&amp;x-expires=1723892990&amp;x-signature=YvspWUlaY%2F38k4S%2BthBVD%2Bkn5As%3D" style="width: 50%; margin-bottom: 20px;"></div>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">Vincent J. Hellendoorn,CMU计算机助理教授,<span style="color: black;">重点</span><span style="color: black;">科研</span>方向是软件工程和<span style="color: black;">设备</span>学习,致力于利用智能<span style="color: black;">办法</span><span style="color: black;">帮忙</span>软件<span style="color: black;">研发</span>人员减少代码调试、程序优化等繁琐工作的时间。</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;">不<span style="color: black;">晓得</span>作者们<span style="color: black;">是不是</span><span style="color: black;">已然</span>在用这个AI撸代码了<span style="color: black;">(手动狗头)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">项目<span style="color: black;">位置</span>:</span><span style="color: black;">https://github.com/VHellendoorn/Code-LMs</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">论文<span style="color: black;">位置</span>:</span><span style="color: black;">https://arxiv.org/abs/2202.13169</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">— 完 —</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">量子位 QbitAI · 头条号签约</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">关注<span style="color: black;">咱们</span>,<span style="color: black;">第1</span>时间获知前沿科技动态</span></p>




nqkk58 发表于 2024-9-26 10:35:29

请问、你好、求解、谁知道等。

nqkk58 发表于 2024-9-29 21:32:21

我们有着相似的经历,你的感受我深有体会。

4lqedz 发表于 2024-10-1 13:00:32

顶楼主,说得太好了!

nqkk58 发表于 2024-10-10 08:47:57

期待更新、坐等、迫不及待等。

1fy07h 发表于 2024-10-24 04:21:06

楼主的文章深得我心,表示由衷的感谢!

m5k1umn 发表于 2024-11-1 21:49:58

我深受你的启发,你的话语是我前进的动力。

1fy07h 发表于 2024-11-8 01:08:40

交流如星光璀璨,点亮思想夜空。

wrjc1hod 发表于 2024-11-13 04:20:14

你的话语如春风拂面,让我心生暖意。
页: [1]
查看完整版本: 会写代码的AI开源!C语言比Codex写得好,把握12种编程语言丨CMU