JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.
IndoNLP: Nusantara Datasheets
Available datasheet list can be accessed via
https://indonlp.github.io/nusa-catalogue/
.
Before filling this form, please kindly check if the dataset is present in the list or not.
Sign in to Google
to save your progress.
Learn more
* Indicates required question
Email
*
Your email
Dataset name
*
For example: NusaSenti, IndoNLU BaPOS, Indonesian Clickbait, IndoNLG TED En-Id, etc.
Your answer
Dataset URL
*
Direct link to the dataset repository.
Your answer
Dataset's HuggingFace URL
Link to the HuggingFace version of the dataset (if present). For example:
https://huggingface.co/datasets/indonlu
.
Your answer
Dataset subset
The relevant subset in the dataset (only if the dataset is broken by dialects). For example, for NusaSenti, we can fill this field in with either: "Indonesian", "English", "Acehnese", "Balinese", "Banjarese", or "Buginese".
Your answer
Dataset task
*
Abstractive Summarization
Aspect Based Sentiment Analysis
Automatic Essay Scoring
Automatic Speech Recognition
Causal Commonsense Reasoning
Clickbait Detection
Constituency Parsing
Coreference Resolution
Cross-Lingual Abstractive Summarization
Dependency Parsing
Dialect Identification
Emotion Classification
Fact Checking
Hate Speech Detection
Image Captioning
Image Captioning & Generation
Keyword Extraction
Knowledge Base
Language Modeling
Legal Classification
Lexical Normalization
Machine Translation
Morphological Inflections
Multilingual Speech-To-Speech Translation
Multilingual Word Sense
Named Entiy Recognition
Natural Language Inference
Next Tweet Prediction
Optical Character Recognition
Open-Domain Dialogue System
Paraphrasing
POS Tagging
Pos Tagging
Question Answering
Question Answering (Extractive)
Question Generation
Semantic Textual Similarity
Sentiment Analysis
Short Answer Grading
Speech-To-Text Translation
Spoken Language Understanding
Stance Detection
Style Transfer & Paraphrasing
Summarization
Tweet Ordering
Word Sense Disambiguation
Other:
Dataset modality
*
Text
Speech
Image
Other:
Dataset domain
*
Banking
Books
Commentary
General
Hotel reviews
Journalistic blog
News articles
Religion
Reviews
Social media
Transcribed audio
Other:
Required
Dataset description
*
A brief (3-4 sentences long) description of the dataset. For example, IndoNLU TermA's description is: "The TermA span-extraction dataset is collected from the hotel aggregator platform, AiryRooms. The dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment."
Your answer
Dataset license
*
Apache 2.0
BSD
CC-BY 2.0
CC-BY 3.0
CC-BY 4.0
CC-BY-NC 2.0
CC-BY-NC-ND 4.0
CC-BY-NC-SA 4.0
CC-BY-SA 2.0
CC-BY-SA 3.0
CC-BY-SA 4.0
CC0
CDLA-Permissive 1.0
GPL 2.0
LDC User Agreement
LGPL 3.0
MIT
ODbL 1.0
Unknown
Other:
Publish year
*
Year of publishing the dataset/paper.
Your answer
Untitled Title
Dataset language
*
If you choose others, please write the ISO 639-3 code.
You can check the language code at https://iso639-3.sil.org/code_tables/639/data
Indonesian (ind)
English (eng)
Sundanese (sun)
Lampung Nyo (abl)
Javanese (jav)
Minang (min)
Madurese (mad)
Batak Toba (bbc)
Ngaju (nij)
Buginese (bug)
Balinese (ban)
Other:
Required
Dialect
More info on the used language(s). The more complete, the better. i.e. Jawa ngapak, cirebon.
Your answer
Dataset collection style
*
Crawling
Crawling and annotation (other)
Crawling and annotation (translation)
Crawling and machine filtering
Human translation
Human translation and annotation
Machine and human translation
Machine translation
Manual curation
Unknown
Other:
Dataset quantity/volume
*
For example, IndoNLU TermA has 5k documents, so the input for this field should be: "5000".
Your answer
Dataset unit
*
For example, IndoNLU TermA has 5k documents, so the input for this field should be: "documents".
documents
forms
GB
headlines
hours
MB
others
sentence pairs
sentences
TB
tokens
tweets
utterances
Other:
Ethical risks
*
Social media datasets are considered medium-risk as they might release personal information, others might contain hate speech as well so considered as high risk.
Choose
Low
Medium
High
Is the dataset split?
*
Whether the dataset is split into train/val/test or train/val.
Yes
No
Next
Clear form
Never submit passwords through Google Forms.
This content is neither created nor endorsed by Google.
Report Abuse
-
Terms of Service
-
Privacy Policy
Forms