This workshop explores how web archives collections can be
described using the Datasheets for Datasets framework.
Significant work in web archives scholarship focuses on the
description and provenance of collections and their data. Looking beyond the
worlds of libraries, archives and cultural heritage can provide valuable
alternative approaches, which we can experiment with and use. Datasheets for
Datasets is a method for describing large datasets from the field of machine
learning, which uses a standard set of questions arranged by stages of the data
lifecycle.
During this workshop participants will discuss how web archives
collections can be described using the Datasheets for Datasets framework.
Specifically a datasheets template that is arranged into nine sections. This
template asks questions about a dataset, focusing on the specific needs of
machine learning researchers. More information on these questions can be found
here: https://www.microsoft.com/en-us/research/project/datasheets-for-datasets/
Participants will consider how these questions can be
adopted for the purposes of describing web archives datasets. Considering and
assessing how each question might be adapted and applied to describe datasets
from UK Web Archive curated collections.
After a description of the Datasheets for Datasets
framework, there will be a group card-sorting exercise. Each group will
evaluate a set of questions using the MoSCoW technique, sorting them into
categories of Must, Should, Can’t, and Won’t have. Groups will report back on
this task via a facilitated discussion about the priorities and resources
available for generating descriptive metadata and documentation for public web
archives datasets.
About the
Instructors:
Emily Maemura is an Assistant Professor in the School of
Information Sciences at the University of Illinois Urbana-Champaign. Her
research focuses on data practices and the activities of curation, description,
characterization, and re-use of archived web data.
Helena Byrne is the Curator of Web Archives at the British
Library. She was the Lead Curator on the IIPC Content Development Group 2022,
2018 and 2016 Olympic and Paralympic collections.
These workshops will be held in-person only due to the
format of the activity, they won't be recorded