IndoNLP: Nusantara Datasheets

JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

Available datasheet list can be accessed via https://indonlp.github.io/nusa-catalogue/.
Before filling this form, please kindly check if the dataset is present in the list or not.

Email *

Dataset name *

For example: NusaSenti, IndoNLU BaPOS, Indonesian Clickbait, IndoNLG TED En-Id, etc.

Dataset URL *

Direct link to the dataset repository.

Dataset's HuggingFace URL

Link to the HuggingFace version of the dataset (if present). For example: https://huggingface.co/datasets/indonlu.

Dataset subset

The relevant subset in the dataset (only if the dataset is broken by dialects). For example, for NusaSenti, we can fill this field in with either: "Indonesian", "English", "Acehnese", "Balinese", "Banjarese", or "Buginese".

Dataset task *

Abstractive Summarization

Aspect Based Sentiment Analysis

Automatic Essay Scoring

Automatic Speech Recognition

Causal Commonsense Reasoning

Clickbait Detection

Constituency Parsing

Coreference Resolution

Cross-Lingual Abstractive Summarization

Dependency Parsing

Dialect Identification

Emotion Classification

Fact Checking

Hate Speech Detection

Image Captioning

Image Captioning & Generation

Keyword Extraction

Knowledge Base

Language Modeling

Legal Classification

Lexical Normalization

Machine Translation

Morphological Inflections

Multilingual Speech-To-Speech Translation

Multilingual Word Sense

Named Entiy Recognition

Natural Language Inference

Next Tweet Prediction

Optical Character Recognition

Open-Domain Dialogue System

Paraphrasing

POS Tagging

Pos Tagging

Question Answering

Question Answering (Extractive)

Question Generation

Semantic Textual Similarity

Sentiment Analysis

Short Answer Grading

Speech-To-Text Translation

Spoken Language Understanding

Stance Detection

Style Transfer & Paraphrasing

Summarization

Tweet Ordering

Word Sense Disambiguation

Other:

Dataset modality *

Text

Speech

Image

Other:

Dataset domain *

Banking

Books

Commentary

General

Hotel reviews

Journalistic blog

News articles

Religion

Reviews

Social media

Transcribed audio

Other:

Required

Dataset description *

A brief (3-4 sentences long) description of the dataset. For example, IndoNLU TermA's description is: "The TermA span-extraction dataset is collected from the hotel aggregator platform, AiryRooms. The dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment."

Dataset license *

Apache 2.0

BSD

CC-BY 2.0

CC-BY 3.0

CC-BY 4.0

CC-BY-NC 2.0

CC-BY-NC-ND 4.0

CC-BY-NC-SA 4.0

CC-BY-SA 2.0

CC-BY-SA 3.0

CC-BY-SA 4.0

CC0

CDLA-Permissive 1.0

GPL 2.0

LDC User Agreement

LGPL 3.0

MIT

ODbL 1.0

Unknown

Other:

Publish year *

Year of publishing the dataset/paper.

Untitled Title

Dataset language *

If you choose others, please write the ISO 639-3 code.
You can check the language code at https://iso639-3.sil.org/code_tables/639/data

Indonesian (ind)

English (eng)

Sundanese (sun)

Lampung Nyo (abl)

Javanese (jav)

Minang (min)

Madurese (mad)

Batak Toba (bbc)

Ngaju (nij)

Buginese (bug)

Balinese (ban)

Other:

Required

Dialect

More info on the used language(s). The more complete, the better. i.e. Jawa ngapak, cirebon.

Dataset collection style *

Crawling

Crawling and annotation (other)

Crawling and annotation (translation)

Crawling and machine filtering

Human translation

Human translation and annotation

Machine and human translation

Machine translation

Manual curation

Unknown

Other:

Dataset quantity/volume *

For example, IndoNLU TermA has 5k documents, so the input for this field should be: "5000".

Dataset unit *

For example, IndoNLU TermA has 5k documents, so the input for this field should be: "documents".

documents

forms

headlines

hours

others

sentence pairs

sentences

tokens

tweets

utterances

Other:

Ethical risks *

Social media datasets are considered medium-risk as they might release personal information, others might contain hate speech as well so considered as high risk.

Is the dataset split? *

Whether the dataset is split into train/val/test or train/val.

Yes

Clear form

Never submit passwords through Google Forms.

This content is neither created nor endorsed by Google. - Terms of Service - Privacy Policy

Does this form look suspicious? Report

Forms