In Effective Testing for Machine Learning Systems (
https://www.jeremyjordan.me/testing-ml/), we explored how machine learning testing is different from machine learning evaluation.
While in ML evaluation we're concerned with model performance (e.g., accuracy), in ML testing we focus on the model's learned behaviour. Is it behaving the way we expect it to behave? For example:
- Sentiment classifier models should be invariant to name of people mentioned
- Road segmentation models should work regardless of weather conditions
- Phishing probability shouldn't go down if the URL changes from https to http
- Fraud probability on product reviews shouldn't depend on customer gender
To better understand how to test for this, we're collecting such scenarios via this survey and share about them.