Responsible Data Science is a kind of Stewardship

Everytime someone interacts with the digital world they leave behind a digital trail. It certainly accumulates: in 2019, nearly 40,000 exabytes of information will be generated, hoarded, and processed by data scientists. This torrent of information is fed into enormously complex algorithms which drive credit decisions, hiring, determines the length of jail sentences, schedules clinical interventions, and feeds digital addictions.

Compounding the problem is that much of the information we produce is being processed by artificially-intelligent (self-optimizing) algorithms too complex to be understood by individuals. So in effect, many high-stakes decisions are being made in a “black box” environment—meaning no one can really explain why a decision has been made, they just see the result. In response, the University of California Santa Cruz—and the Baskin School of Engineering in particular—has positioned itself at the forefront of a movement akin to the one kicked off by Rachel Carson’s Silent Spring in 1962.  

“Responsible data science means understanding the algorithms we’re building—both their power and limitations—and understanding the situations in which they’ll be deployed and the potential ramifications of the ways they’ll be used,” UC Santa Cruz Computer Science and Engineering Professor Lise Getoor said.

Getoor is one of the leading data science researchers today. She works with probabilistic modeling over richly structured networks and heterogeneous data: in short, she creates algorithms capable of interpreting enormously complicated databases without disrupting the intricate relationships that are nested within. 

“Aside from accuracy and better outputs the power of our methods is that they are not as simplistic as many of the existing approaches and they can reason about context… How this connects to ethical data science is that it is reasoning about these complex interdependencies and relational context that’s oftentimes not done in current simplistic approaches to artificial intelligence and data science algorithms,” said Getoor.

An algorithm is a set of instructions, in this case for a computer to follow (in other words it’s software or a computer program as opposed to a ritual, dance moves or a menu). At a casual glance an algorithm might seem entirely divorced from ethical considerations. After all, an algorithm takes an input, applies a series of instructions and generates an output. However, the output of an algorithm is only as good as the input; good output can easily be misinterpreted or misunderstood, and any decision making that results from analyzing the output of an algorithm can just as easily be taken out of context or misused. Even the act of choosing to turn something into an algorithm has ethical considerations.

Statistics Professor Abel Rodriguez is a former lawyer turned engineering professor. Along with Getoor he has been one of the guiding lights of UCSC’s responsible data science efforts, which includes being part of the interdisciplinary team working on a CITRIS-funded project called “The Uses and Abuses of Data and Analytics in Higher Education.” 

“Scientists can sometimes have a bit of a blindspot when it comes to history,” Rodriguez said. “Take image recognition. It’s become extremely powerful and extremely popular. There’s been a resurgence of some ideas that have to do with using recognition to identify certain types of behavior—sexual preferences, for example, or IQ—and it’s incredible how much credibility these ideas have in the tech world today when you think back to the 19th century and Francis Galton’s creating composite photos of what a criminal looks like, and this was precursor to genetic determinism which itself was a precursor to the atrocities of World War II.”

Rodriquez says he tries to infuse the introductory statistics courses at UC Santa Cruz with case studies that describe the good and bad aspects of data science. For him, particularly as a former lawyer, he says it’s important to recognize there is a significant difference between what is legal and what is ethical, particularly when it comes to corporate uses of big data.  

Rodriguez was recently invited to give the plenary talk at the Lawrence Livermore National Labs (LLNL) Data Science Institute session. There, he cast the issue in dramatic terms. “We are at the intersection of big data, powerful algorithms, and a lot of computing power while trying to use these tools in a societal context,” Rodriguez said. 

“This confluence could help alleviate poverty, save the planet, or fight crime. But a lot of risks come with these tools, and we have to balance the benefits and harms.”

Some of the most eye-opening experiences for Rodriguez have come from looking at the way educators were beginning to use data. Universities and school districts capture enormous amounts of information about their students and consequently can make predictions about their behavior.

“There’s been a push to use some of this data for different purposes,” Rodriguez said. “For example to determine what is the likelihood a student will complete their chosen degree in a given amount of time, and what do you do with this information? If the probability is low do you tell them to quit or do your provide interventions that help? And how do you decide who gets the help?”

Certain campuses (this doesn’t include UC Santa Cruz, according to Rodriguez) are starting to look at location data. “For example, say you know a student isn’t going to class because you’re tracking his or her cellphone, what do you do? Do you send someone to intervene? Do you start calling them? And that’s the benign side.”

The University of Alabama notoriously began using a casino-like rewards system to track whether or not their students would sit through entire football games, and punishing those who would leave losing games early (by restricting their access to playoff tickets). 

Rodriguez concluded his talk at the LLNL by encouraging data scientists to aim for transparency, trust, and reproducibility in their work. He also emphasizes the importance of openness with data research, and asks data scientists to consider whether it’s always the best option for scientists. 

“There is a conflict between ethical science and open science—there are certain senses in which you have to be careful about what you are sharing and the consequences of what you are sharing. The obvious example is modifying viruses—do you want to share ways of making flu viruses more lethal in the public domain?” Rodriguez said. 

A more subtle consideration is that of facial recognition data sets.

“Researchers needed very large data sets to create accurate facial recognition and those data sets are hard to come by and the ones generated have been widely shared,” Rodriguez said, “without necessarily asking the people in those databases whether they have their permission to use them. Given that this same technology has been used to create some of the surveillance technology used in China to identify and persecute Uyghurs, there are real questions that need to be asked about how much responsibility do you bear when you collect data without permission and it gets picked up by someone else.”

Getoor advocates for a design-thinking approach to data science. “Right now we have predictive algorithms that take output and spit out a number, what we want are tools that provide affordances for a more constructive dialog; more what-ifs, something which would allow for a more constructive use of the system: more comparative modeling, giving us an outcome under certain assumptions rather than instead of simply stating a deterministic, zero/one kind of answer.”

She says that the time has come for a partnership between technologists, humanists and social scientists. “We need to have more humanists and social scientists at the table during the design process; it will completely reshape the way these things are designed and it’s something that UCSC can really bring to the table.”

Another important consideration for what’s sometimes referred to as data dignity would be reforming the legal system. “One of the challenges is that the legal system still has an antiquated notion of an individual vs a corporate data ownership, but so much of the data we generate these days has multiple players involved.”

Getoor uses genomic data as a simple example. Do a child’s parents own the right to share a child’s genome? What about the hospital? “We need new ways of conceiving data ownership and access and perhaps even ways to be rewarded if our data is being used for profit.”

Data ownership is closely tied to identity and notions of privacy. “Traditionally there’s been a notion of identity that’s either a centralized notion of identity in that you have some centralized agency that controls it, think social security number, some other single ID number, another extreme is that individuals should have complete ownership of their data. The challenge is that neither of those is tenable, the single identity, the centralized identity can be hacked and has many points of failure and then just the realisticness of being able to manage and control access to all of your information is challenging.”

The key takeaway from their efforts in developing ethical data science is that human beings need to maintain their autonomy in the process. The way to go beyond being fearful of dystopias or fantasizing about techno utopia is to make sure people are aware of the capabilities and limitations of algorithms and to avoid giving up control, especially when the most difficult kinds of decisions are being made—because that’s when the temptation to turn things over to the machines is greatest and where it’s the most dangerous.

While data scientists have yet to achieve the swashbuckling status of prospecting petroleum geologists hacking their way through Amazonian jungles or grinding through thousands of feet of rock, the Silicon Valley cliche that Big Data has become the “New Big Oil” has a ring of truth. Data is a mysterious, valuable commodity suddenly seems essential to our economic existence, yet the more it takes root, the more sinister it seems. 

Perhaps the way to best engage with an increasingly data saturated age is to treat individuals and their data and the algorithms probing and sifting through them and judging them with the dignity that ought to be afforded to any other creature we share our environment with. Strip mining humanity’s data dignity is a short-term solution at best; the only way truly forward is careful stewardship.