Machine learning’s crumbling foundations; Doing ‘data science’ with bad data.
ML is rife with all forms of statistical malpractice — AND it’s being used for high-speed, high-stakes automated classification and decision-making, as if it was a proven science whose professional ethos had the sober gravitas you’d expect from, say, civil engineering.
Civil engineers spend a lot of time making sure the buildings and bridges they design don’t kill the people who use them. Machine learning?
Hundreds of ML teams built models to automate covid detection, and every single one was useless or worse.
https://pluralistic.net/2021/08/02/autoquack/#gigo
The ML models failed due to failure to observe basic statistical rigor. One common failure mode?
Treating data that was known to be of poor quality as if it was reliable because good data was not available.
Obtaining good data and/or cleaning up bad data is tedious, repetitive grunt-work. It’s unglamorous, time-consuming, and low-waged. Cleaning data is the equivalent of sterilizing surgical implements — vital, high-skilled, and invisible unless someone fails to do it.
It’s work performed by anonymous, low-waged adjuncts to the surgeon, who is the star of the show and who gets credit for the success of the operation.