6257rv7 发表于 2024-8-4 09:43:17

资源 | 从图像处理到语音识别,25款数据专家必知的深度学习开放数据集


    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">选自Analytics Vidhya</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">作者:</strong></span><span style="color: black;"><strong style="color: blue;"><span style="color: black;">Pranav Dar</span></strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;"><span style="color: black;">设备</span>之心编译</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">参与:</strong><strong style="color: blue;">陈韵竹、路</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">本文介绍了 25 个深度学习开放数据集,<span style="color: black;">包含</span>图像处理、自然语言处理、语音识别和<span style="color: black;">实质</span>问题数据集。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">介绍</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">深度学习(或生活中大部分<span style="color: black;">行业</span>)的关键在于实践。你<span style="color: black;">必须</span>练习<span style="color: black;">处理</span><span style="color: black;">各样</span>问题,<span style="color: black;">包含</span>图像处理、语音识别等。<span style="color: black;">每一个</span>问题都有其独特的细微差别和<span style="color: black;">处理</span><span style="color: black;">办法</span>。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">然则</span>,从哪里<span style="color: black;">得到</span>数据呢?<span style="color: black;">此刻</span>许多论文都<span style="color: black;">运用</span>专有数据集,这些数据集<span style="color: black;">一般</span>并不对公众开放。<span style="color: black;">倘若</span>你想学习并应用技能,<span style="color: black;">那样</span><span style="color: black;">没法</span>获取合适数据集是个问题。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">倘若</span>你面临着这个问题,本文<span style="color: black;">能够</span>为你<span style="color: black;">供给</span><span style="color: black;">处理</span><span style="color: black;">方法</span>。本文介绍了一系列公开可用的高质量数据集,<span style="color: black;">每一个</span>深度学习<span style="color: black;">兴趣</span>者都应该试试这些数据集从而<span style="color: black;">提高</span>自己的能力。在这些数据集上进行工作将让你<span style="color: black;">作为</span>一名更好的数据<span style="color: black;">专家</span>,你在其中学到的知识将<span style="color: black;">作为</span>你职业生涯中的无价之宝。<span style="color: black;">咱们</span><span style="color: black;">一样</span>介绍了具备当前最优结果的论文,供读者阅读,改善自己的模型。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;"><span style="color: black;">怎样</span><span style="color: black;">运用</span>这些数据集?</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">首要</span>,你得明白这些数据集的规模非常大!<span style="color: black;">因此呢</span>,请<span style="color: black;">保证</span>你的网络连接顺畅,在下载时数据量<span style="color: black;">无</span>或几乎<span style="color: black;">无</span>限制。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">运用</span>这些数据集的<span style="color: black;">办法</span>多种多样,你<span style="color: black;">能够</span>应用<span style="color: black;">各样</span>深度学习技术。你<span style="color: black;">能够</span>用它们磨炼技能、<span style="color: black;">认识</span><span style="color: black;">怎样</span>识别和构建各个问题、思考独特的<span style="color: black;">运用</span>案例,<span style="color: black;">亦</span><span style="color: black;">能够</span>将你的<span style="color: black;">发掘</span>公开给<span style="color: black;">大众</span>!</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数据集分为三类——图像处理、自然语言处理和音频/语音处理。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">让<span style="color: black;">咱们</span><span style="color: black;">一块</span><span style="color: black;">瞧瞧</span>吧!</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">图像处理数据集</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">MNIST</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/KmXPKA19gW9czEN7kkIJgEaTgu6aQ23icGVBjWBBSbV4r4wIC0wkia63KgNYPmInL94Jpj8ZWMoseENRto4wS3ug/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://datahack.analyticsvidhya.com/contest/practice-problem-identify-the-digits/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">MNIST 是最流行的深度学习数据集之一。这是一个手写数字数据集,<span style="color: black;">包括</span>一个有着 60000 样本的训练集和一个有着 10000 样本的测试集。<span style="color: black;">针对</span>在现实世界数据上尝试学习技术和深度识别模式而言,这是一个非常好的数据库,且无需花费<span style="color: black;">太多</span>时间和精力进行数据预处理。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:约 50 MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:70000 张图像,共分为 10 个类别。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Dynamic Routing Between Capsules》</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">参考阅读:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;"><span style="color: black;">最终</span>,Geoffrey Hinton 那篇备受关注的 Cap</span></a>sule 论文公开了</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">浅析 Geoffrey Hinton <span style="color: black;">近期</span>提出的 Capsule 计划</span></a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">先读懂 CapsNet 架构<span style="color: black;">而后</span>用 TensorFlow 实现,这应该是最<span style="color: black;">仔细</span>的教程了</span></a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">Capsule 官方代码开源之后,<span style="color: black;">设备</span>之心做了份核心代码<span style="color: black;">诠释</span></span></a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">MS-COCO</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_jpg/KmXPKA19gW9czEN7kkIJgEaTgu6aQ23icBhwQW8KI4ibpmIoPCDv6e19lAzffaQR3v1baM6chl8ibzFYoO3Os2qCQ/640?wx_fmt=jpeg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://cocodataset.org/#home</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">COCO 是一个大型数据集,用于<span style="color: black;">目的</span>检测、分割和标题生成。它有以下几个特征:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">目的</span>分割</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">在语境中识别</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">超像素物品分割</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">33 万张图像(其中超过 20 万张是标注图像)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">150 万个<span style="color: black;">目的</span>实例</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">80 个<span style="color: black;">目的</span>类别</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">91 个物品<span style="color: black;">归类</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">每张图像有 5 个标题</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">25 万张带<span style="color: black;">相关</span>键点的人像</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:约 25 GB(压缩后)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:33 万张图像、80 个<span style="color: black;">目的</span>类别、每张图像 5 个标题、25 万张带<span style="color: black;">相关</span>键点的人像</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Mask R-CNN》</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">参考阅读:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">学界 | Facebook 新论文提出通用<span style="color: black;">目的</span>分割框架 Mask R-CNN:更简单更灵活表现更好</span></a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">深度 | 用于图像分割的卷积神经网络:从 R-CNN 到 Mask R</span></a>-CNN</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">资源 | Mask R-CNN 神应用:像英剧《黑镜》<span style="color: black;">同样</span>屏蔽人像</span></a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">ImageNet</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/KmXPKA19gW9czEN7kkIJgEaTgu6aQ23ic9dicjZoc0tenfP5jDkd47zfwiaYIX56rydlErATy0uZ4rH9EbRYEsia7Q/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://www.image-net.org/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">ImageNet 是<span style="color: black;">按照</span> WordNet 层次来组织的图像数据集。WordNet <span style="color: black;">包括</span>大约 10 万个短语,而 ImageNet 为<span style="color: black;">每一个</span>短语<span style="color: black;">供给</span>平均约 1000 张描述图像。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:约 150 GB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:图像的总数约为 1,500,000;每一张图像都具备多个边界框和各自的类别标签。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Aggregated Residual Transformations for Deep Neural Networks》(https://arxiv.org/pdf/1611.05431.pdf)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Open Images 数据集</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/KmXPKA19gW9czEN7kkIJgEaTgu6aQ23icNTWSsmzRZhdrFIYHRpx8IFsgEUgqxAc6aZDcp3D3kiawmUKIRtZ3hZw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://github.com/openimages/dataset</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Open Images 是一个<span style="color: black;">包括</span>近 900 万个图像 URL 的数据集。这些图像<span style="color: black;">运用</span><span style="color: black;">包括</span>数千个类别的图像级标签边界框进行了标注。该数据集的训练集<span style="color: black;">包括</span> 9,011,219 张图像,验证集<span style="color: black;">包括</span> 41,260 张图像,测试集<span style="color: black;">包括</span> 125,436 张图像。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:500GB(压缩后)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:9,011,219 张图像,带有超过 5000 个标签</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:Resnet 101 image classification model (trained on V2 data):</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">模型<span style="color: black;">检测</span>点:https://storage.googleapis.com/openimages/2017_07/oidv2-resnet_v1_101.ckpt.tar.gz</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Checkpoint readme:https://storage.googleapis.com/openimages/2017_07/oidv2-resnet_v1_101.readme.txt</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">推断代码:https://github.com/openimages/dataset/blob/master/tools/classify_oidv2.py</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">VisualQA</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_jpg/KmXPKA19gW9czEN7kkIJgEaTgu6aQ23icLFhtVfPJxDd8kWicJVmHkO7SsCUibKAtvUF92h49f4G1Gkrf1Np2HCSg/640?wx_fmt=jpeg&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://www.visualqa.org/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">VQA 是一个<span style="color: black;">包括</span>图像开放式问题的数据集。这些问题的解答<span style="color: black;">必须</span>视觉和语言的理解。该数据集<span style="color: black;">持有</span>下列有趣的特征:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">265,016 张图像(COCO 和抽象场景)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">每张图像<span style="color: black;">最少</span><span style="color: black;">包括</span> 3 个问题(平均有 5.4 个问题)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">每一个</span>问题有 10 个正确答案</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">每一个</span>问题有 3 个看似<span style="color: black;">恰当</span>(却不太正确)的答案</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">自动<span style="color: black;">评定</span>指标</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:25GB(压缩后)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:265,016 张图像,每张图像<span style="color: black;">最少</span> 3 个问题,<span style="color: black;">每一个</span>问题 10 个正确答案</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge》(https://arxiv.org/abs/1708.02711)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">街景门牌号数据集(SVHN)</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/KmXPKA19gW9czEN7kkIJgEaTgu6aQ23icjqKhyAlicEgH3ibFGS2AQRm5RfY3PFMUX0DlQYibn0roiaw29GAz3uDdQw/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://ufldl.stanford.edu/housenumbers/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这是一个现实世界数据集,用于<span style="color: black;">研发</span><span style="color: black;">目的</span>检测算法。它<span style="color: black;">必须</span>最少的数据预处理过程。它与 MNIST 数据集有些类似,<span style="color: black;">然则</span>有着<span style="color: black;">更加多</span>的标注数据(超过 600,000 张图像)。这些数据是从谷歌街景中的房屋门牌号中收集而来的。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:2.5GB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:6,30,420 张图像,共 10 类</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Distributional Smoothing With Virtual Adversarial Training》(https://arxiv.org/pdf/1507.00677.pdf)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这篇论文中,日本京都大学提出了局部分布式平滑度(LDS),一个关于统计模型平滑度的新理念。它可被用作正则化从而<span style="color: black;">提高</span>模型分布的平滑度。该<span style="color: black;">办法</span>不仅在 MNIST 数据集上<span style="color: black;">处理</span>有监督和半监督学习任务时表现优异,<span style="color: black;">况且</span>在 SVHN 和 NORB 数据上,Test Error 分别取得了 24.63 和 9.88 的分值。以上证明了该<span style="color: black;">办法</span>在半监督学习任务上的表现<span style="color: black;">显著</span>优于当前最佳结果。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">CIFAR-10</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/KmXPKA19gW9czEN7kkIJgEaTgu6aQ23icmRXPUFpKClbC8TYvxYFRaYnol0PO4R6UwamQAQMb7EFU1jtickrBRXg/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://www.cs.toronto.edu/~kriz/cifar.html</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">该数据集<span style="color: black;">亦</span>用于图像<span style="color: black;">归类</span>。它由 10 个类别共计 60,000 张图像<span style="color: black;">构成</span>(<span style="color: black;">每一个</span>类在上图中<span style="color: black;">暗示</span>为一行)。该数据集共有 50,000 张训练集图像和 10,000 个测试集图像。数据集分为 6 个部分——5 个训练批和 1 个测试批。每批含有 10,000 张图像。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:170MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:60,000 张图像,共 10 类</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《ShakeDrop regularization》(https://openreview.net/pdf?id=S1NHaMW0b)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Fashion-MNIST</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="https://mmbiz.qpic.cn/mmbiz_png/KmXPKA19gW9czEN7kkIJgEaTgu6aQ23icJlhZjuh9PHeFBwOXvdr0RnFibXE3YYkLGKcrFGDh0d3tNdwl6xGa3WA/640?wx_fmt=png&amp;tp=webp&amp;wxfrom=5&amp;wx_lazy=1&amp;wx_co=1" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://github.com/zalandoresearch/fashion-mnist</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Fashion-MNIST <span style="color: black;">包括</span> 60,000 个训练集图像和 10,000 个测试集图像。它是一个类似 MNIST 的时尚<span style="color: black;">制品</span>数据库。<span style="color: black;">研发</span>人员认为 MNIST 的<span style="color: black;">运用</span>次数太多了,<span style="color: black;">因此呢</span><span style="color: black;">她们</span>把这个数据集用作 MNIST 的直接替代品。每张图像都以灰度<span style="color: black;">表示</span>,并具备一个标签(10 个类别之一)。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:30MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:70,000 张图像,共 10 类</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Random Erasing Data Augmentation》(https://arxiv.org/abs/1708.04896)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">自然语言处理</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">IMDB 电影评论数据集</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://ai.stanford.edu/~amaas/data/sentiment/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">该数据集<span style="color: black;">针对</span>电影<span style="color: black;">兴趣</span>者而言非常赞。它用于二元情感<span style="color: black;">归类</span>,<span style="color: black;">日前</span>所含数据超过该<span style="color: black;">行业</span>其他数据集。除了训练集评论样本和测试集评论样本之外,还有<span style="color: black;">有些</span>未标注数据可供<span style="color: black;">运用</span>。<span style="color: black;">另外</span>,该数据集还<span style="color: black;">包含</span>原始文本和预处理词袋格式。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:80 MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:训练集和测试集各<span style="color: black;">包括</span> 25,000 个高度两极化的电影评论</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Learning Structured Text Representations》(https://arxiv.org/abs/1705.09207)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Twenty Newsgroups 数据集</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">顾名思义,该数据集涵盖<span style="color: black;">资讯</span>组<span style="color: black;">关联</span>信息,<span style="color: black;">包括</span>从 20 个<span style="color: black;">区别</span><span style="color: black;">资讯</span>组获取的 20000 篇<span style="color: black;">资讯</span>组文档汇编(<span style="color: black;">每一个</span><span style="color: black;">资讯</span>组<span style="color: black;">选择</span> 1000 篇)。这些<span style="color: black;">文案</span>有着典型的特征,例如标题、导语。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:20MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:来自 20 个<span style="color: black;">资讯</span>组的 20,000 篇<span style="color: black;">报告</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Very Deep Convolutional Networks for Text Classification》(https://arxiv.org/abs/1606.01781)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Sentiment140</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://help.sentiment140.com/for-students/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Sentiment140 是一个用于情感分析的数据集。这个流行的数据集能让你完美地开启自然语言处理之旅。数据中的<span style="color: black;">心情</span>已经被预先清空。<span style="color: black;">最后</span>的数据集具备以下六个特征:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">推文的<span style="color: black;">心情</span>极性</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">推文的 ID</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">推文的日期</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">查找</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">推特的用户名</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">推文的文本</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:80MB(压缩后)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量: 1,60,000 篇推文</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Assessing State-of-the-Art Sentiment Models on State-of-the-Art Sentiment Datasets》(http://www.aclweb.org/anthology/W17-5202)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">WordNet</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://wordnet.princeton.edu/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">上文介绍 ImageNet 数据集时<span style="color: black;">说到</span>,WordNet 是一个大型英语 synset 数据库。Synset <span style="color: black;">亦</span><span style="color: black;">便是</span>同义词组,每组描述的概念<span style="color: black;">区别</span>。WordNet 的结构让它<span style="color: black;">作为</span> NLP 中非常有用的工具。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:10 MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:117,000 个同义词集,它们<span style="color: black;">经过</span>少量的「概念关系」与其他同义词集相互<span style="color: black;">相关</span></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Wordnets: State of the Art and Perspectives》(https://aclanthology.info/pdf/R/R11/R11-1097.pdf)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Yelp 数据集</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://www.yelp.com/dataset</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这是 Yelp 出于学习目的而发布的开放数据集。它<span style="color: black;">包括</span>数百万个用户评论、<span style="color: black;">商场</span>属性(businesses attribute)和来自多个大都市地区的超过 20 万张照片。该数据集是<span style="color: black;">全世界</span>范围内非常常用的 NLP 挑战赛数据集。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:2.66 GB JSON、2.9 GB SQL 和 7.5 GB 的照片(<span style="color: black;">所有</span>压缩后)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:5,200,000 个评论、174,000 份<span style="color: black;">商场</span>属性、200,000 张照片和 11 个大都市地区</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Attentive Convolution》(https://arxiv.org/pdf/1710.00519.pdf)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Wikipedia Corpus</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://nlp.cs.nyu.edu/wikipedia-data/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">该数据集是维基百科全文的集合,<span style="color: black;">包括</span>来自超过 400 万篇<span style="color: black;">文案</span>的将近 19 亿单词。你能逐单词、逐短语、逐段地对其进行检索,这使它<span style="color: black;">作为</span>强大的 NLP 数据集。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:20 MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:4,400,000 篇<span style="color: black;">文案</span>,<span style="color: black;">包括</span> 19 亿单词</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Breaking The Softmax Bottelneck: A High-Rank RNN language Model》(https://arxiv.org/pdf/1711.03953.pdf)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Blog Authorship Corpus</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">该数据集<span style="color: black;">包括</span>从数千名博主那里收集到的博客<span style="color: black;">文案</span>,这些数据从 blogger.com 中收集而来。每篇博客都以一个单独的文件形式<span style="color: black;">供给</span>。每篇博客<span style="color: black;">最少</span><span style="color: black;">显现</span> 200 个常用的英语单词。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:300 MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:681,288 篇博文,共计超过 1.4 亿单词。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Character-level and Multi-channel Convolutional Neural Networks for Large-scale Authorship Attribution》(https://arxiv.org/pdf/1609.06686.pdf)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">欧洲语言<span style="color: black;">设备</span>翻译数据集</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://statmt.org/wmt18/index.html</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">该数据集<span style="color: black;">包括</span>四种欧洲语言的训练数据,旨在改进当前的翻译<span style="color: black;">办法</span>。你<span style="color: black;">能够</span><span style="color: black;">运用</span>以下任意语言对:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">法语 - 英语</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">西班牙语 - 英语</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">德语 - 英语</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">捷克语 - 英语</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>: 约 15 GB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:约 30,000,000 个句子及对应的译文</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Attention Is All You Need》</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">参考阅读:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">学界 | <span style="color: black;">设备</span>翻译新突破:谷歌实现完全基于</span></a> attention 的翻译架构</p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><a style="color: black;"><span style="color: black;">资源 | 谷歌全 attention <span style="color: black;">设备</span>翻译模型 Transformer 的 TensorFlow 实现</span></a></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><strong style="color: blue;">音频/语音数据集</strong></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Free Spoken Digit 数据集</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://github.com/Jakobovski/free-spoken-digit-dataset</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">这是本文又一个受 MNIST 数据集启发而创建的数据集!该数据集旨在<span style="color: black;">处理</span>识别音频样本中口述数字的任务。这是一个公开数据集,<span style="color: black;">因此</span><span style="color: black;">期盼</span>随着人们继续<span style="color: black;">供给</span>数据,它会<span style="color: black;">持续</span>发展。<span style="color: black;">日前</span>,它具备以下特点:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">3 种人声</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">1500 段录音(<span style="color: black;">每一个</span>人口述 0- 9 各 50 次)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">英语发音</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>: 10 MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量: 1500 个音频样本</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Raw Waveform-based Audio Classification Using Sample-level CNN Architectures》(https://arxiv.org/pdf/1712.00866)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Free Music Archive (FMA)</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://github.com/mdeff/fma</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">FMA 是音乐分析数据集,由整首 HQ 音频、预计算的特征,以及音轨和用户级元数据<span style="color: black;">构成</span>。它是一个公开数据集,用于<span style="color: black;">评定</span> MIR 中的多项任务。以下是该数据集<span style="color: black;">包括</span>的 csv 文件及其内容:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">tracks.csv:记录每首歌<span style="color: black;">每一个</span>音轨的元数据,例如 ID、歌名、演唱者、流派、标签和播放次数,共计 106,574 首歌。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">genres.csv:记录所有 163 种流派的 ID 与名<span style="color: black;">叫作</span>及上层风格名(用于推断流派层次和上层流派)。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">features.csv:记录用 librosa 提取的<span style="color: black;">平常</span>特征。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">echonest.csv:由 Echonest(<span style="color: black;">此刻</span>的 Spotify)为 13,129 首音轨的子集<span style="color: black;">供给</span>的音频功能。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:约 1000 GB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:约 100,000 个音轨</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Learning to Recognize Musical Genre from Audio》(https://arxiv.org/pdf/1803.05337.pdf)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Ballroom</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://mtg.upf.edu/ismir2004/contest/tempoContest/node5.html</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">该数据集<span style="color: black;">包括</span>舞厅的舞曲音频文件。它以真实音频格式<span style="color: black;">供给</span>了许多舞蹈风格的<span style="color: black;">有些</span>特征片段。以下是该数据集的<span style="color: black;">有些</span>特点:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">实例总数:698</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">单段时长:约 30 秒</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">总时长:约 20940 秒</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:14 GB(压缩后)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:约 700 个音频样本</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《A Multi-Model Approach To Beat Tracking Considering Heterogeneous Music Styles》(https://pdfs.semanticscholar.org/0cc2/952bf70c84e0199fcf8e58a8680a7903521e.pdf)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">Million Song 数据集</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://labrosa.ee.columbia.edu/millionsong/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">Million Song 数据集<span style="color: black;">包括</span>一百万首当代流行音乐的音频特征和元数据,可免费获取。其目的是:</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">鼓励<span style="color: black;">科研</span><span style="color: black;">商场</span>规模的算法</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">为<span style="color: black;">评定</span><span style="color: black;">科研</span><span style="color: black;">供给</span>参考数据集</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">做为</span><span style="color: black;">运用</span> API 创建大型数据集的捷径(例如 The Echo Nest API)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">帮忙</span>入门级<span style="color: black;">科研</span>人员在 MIR <span style="color: black;">行业</span>展开工作</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数据集的核心是一百万首歌曲的特征分析和元数据。该数据集不<span style="color: black;">包括</span>任何音频,只<span style="color: black;">包括</span>导出要素。示例音频可<span style="color: black;">经过</span>哥伦比亚大学<span style="color: black;">供给</span>的代码(https://github.com/tb2332/MSongsDB/tree/master/Tasks_Demos/Preview7digital)从 7digital 等服务中获取。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:280 GB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:一百万首歌曲!</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Preliminary Study on a Recommender System for the Million Songs Dataset Challenge》(http://www.ke.tu-darmstadt.de/events/PL-12/papers/08-aiolli.pdf)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">LibriSpeech</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://www.openslr.org/12/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">该数据集是一个<span style="color: black;">包括</span>约 1000 小时英语语音的大型语料库。数据<span style="color: black;">源自</span>为 LibriVox 项目的音频书籍。该数据集<span style="color: black;">已然</span>得到了<span style="color: black;">恰当</span>地分割和对齐。<span style="color: black;">倘若</span>你还在寻找<span style="color: black;">初始</span>点,<span style="color: black;">那样</span>点击 http://www.kaldi-asr.org/downloads/build/6/trunk/egs/查看在该数据集上训练好的声学模型,点击 http://www.openslr.org/11/查看适合<span style="color: black;">评定</span>的语言模型。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:约 60 GB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:1000 小时的语音</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《Letter-Based Speech Recognition with Gated ConvNets》(https://arxiv.org/abs/1712.09444)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">VoxCeleb&nbsp;</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:http://www.robots.ox.ac.uk/~vgg/data/voxceleb/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">VoxCeleb 是一个大型人声识别数据集。它<span style="color: black;">包括</span>来自 YouTube 视频的 1251 位名人的约 10 万段语音。数据基本上是性别平衡的(男性占 55%)。这些名人有<span style="color: black;">区别</span>的口音、职业和年龄。<span style="color: black;">研发</span>集和测试集之间<span style="color: black;">无</span>重叠。对大明星所说的话进行<span style="color: black;">归类</span>并识别——这是一项有趣的工作。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:150 MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:1251 位名人的 100,000 条语音</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">SOTA:《VoxCeleb: a large-scale speaker identification dataset》(https://www.robots.ox.ac.uk/~vgg/publications/2017/Nagrani17/nagrani17.pdf)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">为了<span style="color: black;">帮忙</span>你练习,<span style="color: black;">咱们</span>还<span style="color: black;">供给</span>了<span style="color: black;">有些</span>真实生活问题和数据集,供读者上手操作。这一部分,<span style="color: black;">咱们</span>列举了 DataHack 平台上关于深度学习的问题。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">推特情感分析数据集</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">触及</span>种族主义和性别<span style="color: black;">卑视</span>的偏激言论已<span style="color: black;">作为</span> Twitter 的<span style="color: black;">困难</span>,<span style="color: black;">因此呢</span>将这类推文与其它推文<span style="color: black;">掰开</span>已<span style="color: black;">非常</span>重要。在这个<span style="color: black;">实质</span>问题中,<span style="color: black;">咱们</span><span style="color: black;">供给</span>的 Twitter 数据<span style="color: black;">包括</span>普通言论和偏激言论。<span style="color: black;">做为</span>数据<span style="color: black;">专家</span>,你的任务是确定<span style="color: black;">那些</span>推文是偏激型推文,<span style="color: black;">那些</span>不是。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>: 3 MB</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量: 31,962 篇推文</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">印度演员年龄检测数据集</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://datahack.analyticsvidhya.com/contest/practice-problem-age-detection/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">针对</span>深度学习<span style="color: black;">兴趣</span>者<span style="color: black;">来讲</span>,这是一个令人着迷的挑战。该数据集<span style="color: black;">包括</span>数千名印度演员的图像,你的任务是确定<span style="color: black;">她们</span>的年龄。所有图像都由人工从视频帧中挑选和剪切而来,这<span style="color: black;">引起</span>规模、姿势、表情、亮度、年龄、分辨率、遮挡和妆容<span style="color: black;">拥有</span>高度可变性。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:48 MB(压缩后)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:训练集中有 19,906 幅图像,测试集中有 6636 幅图像</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">城市声音<span style="color: black;">归类</span>数据集</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">链接:https://datahack.analyticsvidhya.com/contest/practice-problem-urban-sound-classification/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">该数据集<span style="color: black;">包括</span>超过 8000 个来自 10 个类别的城市声音片段。这个<span style="color: black;">实质</span>问题旨在向你介绍<span style="color: black;">平常</span><span style="color: black;">归类</span>场景中的音频处理。</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;"><span style="color: black;">体积</span>:训练集 - 3 GB(压缩后)、测试集 - 2 GB(压缩后)</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">数量:来自 10 个类别的 8732 个标注城市声音片段(单个片段音频时长 &lt;= 4s)<img src="data:image/svg+xml,%3C%3Fxml version=1.0 encoding=UTF-8%3F%3E%3Csvg width=1px height=1px viewBox=0 0 1 1 version=1.1 xmlns=http://www.w3.org/2000/svg xmlns:xlink=http://www.w3.org/1999/xlink%3E%3Ctitle%3E%3C/title%3E%3Cg stroke=none stroke-width=1 fill=none fill-rule=evenodd fill-opacity=0%3E%3Cg transform=translate(-249.000000, -126.000000) fill=%23FFFFFF%3E%3Crect x=249 y=126 width=1 height=1%3E%3C/rect%3E%3C/g%3E%3C/g%3E%3C/svg%3E" style="width: 50%; margin-bottom: 20px;"></span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">原文链接:https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;"><strong style="color: blue;"><span style="color: black;">本文为<span style="color: black;">设备</span>之心编译,<strong style="color: blue;"><span style="color: black;">转载请联系本公众号<span style="color: black;">得到</span>授权</span></strong></span></strong>。</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><span style="color: black;">✄------------------------------------------------</span></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">加入<span style="color: black;">设备</span>之心(全职记者/实习生):hr@jiqizhixin.com</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;">投稿或寻求<span style="color: black;">报告</span>:editor@jiqizhixin.com</span></strong></p>
    <p style="font-size: 16px; color: black; line-height: 40px; text-align: left; margin-bottom: 15px;"><strong style="color: blue;"><span style="color: black;"><span style="color: black;">宣传</span>&amp;商务合作:bd@jiqizhixin.com</span></strong></p>




听听海 发表于 2024-8-28 15:43:42

我完全赞同你的观点,思考很有深度。

很甜的橙橙橙子 发表于 2024-9-9 10:09:31

我们有着相似的经历,你的感受我深有体会。

qzmjef 发表于 2024-10-14 00:04:19

楼主果然英明!不得不赞美你一下!

nykek5i 发表于 2024-10-18 16:11:44

你的话语真是温暖如春,让我心生感激。
页: [1]
查看完整版本: 资源 | 从图像处理到语音识别,25款数据专家必知的深度学习开放数据集