IndoNLP: Nusantara Datasheets
Available datasheet list can be accessed via https://indonlp.github.io/nusa-catalogue/.
Before filling this form, please kindly check if the dataset is present in the list or not.
Sign in to Google to save your progress. Learn more
Email *
Dataset name *
For example: NusaSenti, IndoNLU BaPOS, Indonesian Clickbait, IndoNLG TED En-Id, etc.
Dataset URL *
Direct link to the dataset repository.
Dataset's HuggingFace URL
Link to the HuggingFace version of the dataset (if present). For example: https://huggingface.co/datasets/indonlu.
Dataset subset
The relevant subset in the dataset (only if the dataset is broken by dialects). For example, for NusaSenti, we can fill this field in with either: "Indonesian", "English", "Acehnese", "Balinese", "Banjarese", or "Buginese".
Dataset task *
Dataset modality *
Dataset domain *
Required
Dataset description *
A brief (3-4 sentences long) description of the dataset. For example, IndoNLU TermA's description is: "The TermA span-extraction dataset is collected from the hotel aggregator platform, AiryRooms. The dataset consists of thousands of hotel reviews, which each contain a span label for aspect and sentiment words representing the opinion of the reviewer on the corresponding aspect. The labels use Inside-Outside-Beginning (IOB) tagging representation with two kinds of tags, aspect and sentiment."
Dataset license *
Publish year *
Year of publishing the dataset/paper.
Untitled Title
Dataset language *
If you choose others, please write the ISO 639-3 code.
You can check the language code at https://iso639-3.sil.org/code_tables/639/data
Required
Dialect
More info on the used language(s). The more complete, the better. i.e. Jawa ngapak, cirebon.
Dataset collection style *
Dataset quantity/volume *
For example, IndoNLU TermA has 5k documents, so the input for this field should be: "5000".
Dataset unit *
For example, IndoNLU TermA has 5k documents, so the input for this field should be: "documents".
Ethical risks *
Social media datasets are considered medium-risk as they might release personal information, others might contain hate speech as well so considered as high risk.
Is the dataset split? *
Whether the dataset is split into train/val/test or train/val.
Next
Clear form
Never submit passwords through Google Forms.
This content is neither created nor endorsed by Google. Report Abuse - Terms of Service - Privacy Policy