FACES OF BIG DATA
How the UCLA College is harnessing a titanic new engine of equity and discovery
By Michael Agresta and Jonathan Riggs | Art by Mimi Chao J.D. ’09 / Mimochai Studio
Can an algorithm help improve outreach to prevent homelessness? A new Los Angeles County program in collaboration with UCLA data scientists at the California Policy Lab is betting $14 million that the answer is yes — the groundbreaking new Homelessness Prevention Unit uses predictive analytics to identify individuals most at risk.
Till von Wachter, professor of economics and faculty director of the California Policy Lab’s UCLA site, explains how the data scientists painstakingly linked 500 different factors of previously siloed, anonymized client data from eight different county agencies. Individuals whose resulting risk profiles closely align with those of previous clients who have become homeless are flagged to receive individualized care from social workers housed in the new L.A. County prevention unit.
“We’re looking for a needle in the haystack,” von Wachter says. “We use a big data approach, machine learning. It’s an ideal example of how cutting-edge data science research and the use of very, very large data sources can really make a difference on the ground and lead to something tangible.”
The approach represents an epochal transformation in the way science, policy work and social inquiry at the UCLA College are changing the world in the era of big data. From fighting disease to predicting wildfires to calculating the true costs of mass incarceration, complex data science touches every discipline, animates vital new conversations and illuminates long-sought discoveries.
A Data-Driven World
In the past academic year, the UCLA Department of Statistics became the UCLA Department of Statistics and Data Science — but not because of any changes at the departmental level. Rather, the new name reflects increased recognition in an evolving world of the importance of data to everyone and everything. In this sense, the world has evolved toward the department’s view.
“This is not some trend we’re reacting to — we’ve been trying to learn from data since before data was big,” says Mark S. Handcock, professor of statistics. “These questions have always existed, and we’ve always been exploring them.”
The name expansion underscores the fact that these explorations transcend academic disciplines, adds Mario Bonk, professor and chair of the department of mathematics.
“This doesn’t mean that statistics has the monopoly on data science,” he says. “It just means that there are more opportunities for interdisciplinary collaborations on a wider, more impactful scale.”
Proof positive: UCLA’s data theory major, which was established in 2019 and spans math as well as statistics and data science. Now one of the most popular majors in either department, it is primarily intended for students who wish to gain a deeper understanding of the principles underlying data science so they can build tools, theories and processes to further the field.
“It’s an innovative major that is very different from the cookie-cutter data science majors cropping up at universities all over the country. We’ve given it a more theoretical focus,” says Mason Porter, professor of mathematics. “We’ve created a lot of innovative new courses for it, including data-driven modeling of complex systems, the societal impact of data and an ‘Experience of Data Science’ capstone in which students work on teams with data from an industrial partner.”
In 2021, UCLA also established a social data science minor — created by Handcock — for students majoring in social science disciplines. The demand for such courses and expanded programming reflects the real-world need for graduates who can work with data — a need that continues to grow stronger.
“In all the rankings, like in U.S. News & World Report, statisticians and data scientists are at or near the top of the best jobs,” says Hongquan Xu, professor and chair of the UCLA Department of Statistics and Data Science. “The job opportunities are just incredible. At our ‘Data Theory in the World’ seminar, people in the industry shared with our students that for every graduate in this field, there are at least five job offers.”
It makes sense. After all, the applications of data and the synergy of math and statistics surround our society in every way possible. Take, for example, the attempt to successfully launch self-driving cars.
“This has only become possible by new methods that are able to process huge data in real time,” says Bonk. “And everything that’s under the hood consists of powerful mathematical methods.”
That said, there is a tendency by the media and the general public to reduce these types of efforts, initiatives, research and breakthroughs to the simplest terms.
“I think that people use the term ‘big data’ without really knowing what they mean by it. People ought to focus on whether data is ‘good,’ ‘bad,’ ‘ugly’ or ‘useful.’ Any of these adjectives can apply to ‘big data,’ and any of them can apply to ‘small data,’” Porter continues. “Rather than focusing on whether or not data is large, we should be focusing on analyzing data well and analyzing it responsibly with respect both to scientific rigor and to societal impact.”
This thoughtful, grounded approach is especially important in light of how mighty a force data truly is.
“Data is the new electricity. It’s moved from being seen as an incredible, magical thing that we can’t possibly understand to something that we can at least observe in terms of its enormous energy and use,” says Handcock. “And then, over time, we forget about the incredible power of it. We switch on our lights, our computers; everything just runs in the background.
“We’re going through the same things with data — soon enough, we won’t be talking about big data anymore,” Handcock concludes. “It will just be part of everything that we all do.”
Rather than focusing on whether or not data is large, we should be focusing on analyzing data well and analyzing it responsibly with respect both to scientific rigor and to societal impact.
—MASON PORTER
Public and Planetary Health
“I’m definitely very happy about this data revolution that’s happening, where people are taking data analysis seriously,” says Rick Paik Schoenberg, professor of statistics. “It’s obvious both at UCLA and beyond.”
He finds this true in both his teaching role and in his research, which has involved using data to create forecasting models for earthquakes, wildfires, crime and even disease.
For some of this work, he has partnered with Andrea Bertozzi, UCLA’s Betsy Wood Knapp Chair for Innovation and Creativity, director of applied mathematics and distinguished professor of mathematics and mechanical and aerospace engineering. When the COVID-19 pandemic began, the two were asked to volunteer on a Los Angeles County Department of Public Health committee to forecast the number of hospital beds, personnel and equipment that would be needed per day.
“It was a remarkable time — thrilling but also scary — for us to see the mathematical modeling making an impact at that level,” Bertozzi says, “because of the high stakes of what the county would decide after taking the information we provided into account.”
Bertozzi and Schoenberg were able to get many graduate and even undergraduate students involved in this research. And the pair has continued to push forward, both together and individually, on other projects that use similar tools and approaches.
“I have a component of my research that involves sorting through large batches of data, sometimes incorporating active learning where a human is involved in the algorithm,” Bertozzi says. “For example, we’re working on remote sensing with scientists at Los Alamos National Lab, and they’re interested in detecting surface water in remote areas like the Arctic. This is a big issue if you’re looking at global warming — how can you take these data and predict where water and other resources will be?”
This type of research dovetails with work being done by Karen McKinnon, an assistant professor in both UCLA’s department of statistics and data science and the Institute of the Environment and Sustainability.
“There has been an explosion of tools, most using open-source software, which reveal new avenues for data analysis. The challenge is determining which tool is the right one for the scientific question of interest,” says McKinnon. “I am most excited about big data methods that have the potential to allow us to learn something new about the climate system. The most exciting future directions are at the interface of physics and machine learning, wherein physical constraints are used in tandem with purely data-driven models.”
Using statistical methods to bridge the gap between simulations and observations can provide much-needed insight into better understanding how Earth’s climate has already changed — and how it may continue to do so.
“Much of what we know about the environment is from measurements that we have taken across time and space, and statistics can help us make sense of this data,” McKinnon adds. “Across the UCLA Division of Physical Sciences and beyond, there is a huge number of faculty using data to do interesting, cutting-edge work about climate science and human responses to climate change.”
The eagerness of students to get involved in this field, and the potential they see in it to make a big difference for the future of the world, hits home with Schoenberg every day in his role as director of the department’s master of applied statistics and data science program.
“We’re getting more than 400 applications a year for this one program alone, our alumni are getting good jobs and a lot of companies are wanting to partner with us,” Schoenberg says. “It’s definitely an exciting time to be a statistician.”
Across the UCLA Division of Physical Sciences and beyond, there is a huge number of faculty using data to do interesting, cutting-edge work about climate science and human responses to climate change.
—KAREN McKINNON
New Frontiers
Besides the countless uses of big data in our daily lives, the concept stretches far beyond our planet.
“Space physics has so much satellite data and observations, it’s getting harder to go through it in the traditional way,” says Jacob Bortnik, professor of atmospheric and oceanic sciences as well as the faculty director of the UCLA Space Physics and Planetary Sciences, Applications, Communication and Engineering (SPACE) Institute. “You need something a lot more sophisticated to pick up subtle patterns, and machine learning and AI are exactly those kinds of tools.”
Bortnik and his team use these tools so frequently, in fact, that he authored a how-to article in 2021 for Eos on using machine learning in Earth and space sciences. One of their main uses has been aiding Bortnik and his team in reconstructing 3D dynamic models of inner space—the area between the Earth’s upper atmosphere and geosynchronous Earth orbit—to predict and respond to space weather, which can involve electromagnetic fields directly affecting the performance of technology in space and on Earth.
For example, in the last year, about 40 SpaceX Starlink satellites were impacted by a geomagnetic storm that resulted in them falling back to Earth in, as Bortnik puts it, “a spectacular billion-dollar display of light.” And you don’t have to be a billionaire CEO to be impacted; currents like these can also affect the lives of everyday citizens via power grids, internet cables, GPS and even credit card systems.
Another way to comb through massive quantities of space-related data is to empower citizen scientists to aid in its parsing. Jean-Luc Margot, a professor of Earth, planetary, and space sciences and of physics and astronomy, is also the lead researcher of the “Are we alone in the universe?” project for the UCLA SETI group, which allows interested community members from every walk of life to join in the search for extraterrestrial intelligence by classifying radio signals.
Similarly, Emmanuel Masongsong, program manager for UCLA’s Experimental Space Physics Group, has joined forces with an international team to help launch HARP, or Heliophysics Audified: Resonances in Plasmas, where volunteers can help NASA scientists potentially discover plasma waves.
“HARP employs a simple web interface to make space weather more tangible, converting magnetic data into sound. It empowers students and the public to use their senses to pick out complex or subtle patterns in the noise, helping scientists to scour through decades of satellite observations,” he says. “While analyzing real satellite data can be messy, having a hand in authentic NASA research is inspiring and exciting.”
Describing the experience as an opportunity for these citizen scientists to respond to the “music” around Earth, Masongsong calls projects like these game-changers.
“By exposing people to the exciting dynamics of our space environment, HARP validates that anyone can make contributions to science,” he adds. “We want to empower people to focus on what they feel is exciting or notable, since this broad array of experience is valuable for analyzing novel data sets.”
Just as many on Earth are using big data approaches to better understand space, many are also using space-collected data to better understand conditions on Earth—and even profit from them. Bortnik mentions how a company imaging all the Walmart parking lots in the world for several weeks was able to build a model based on the cars and traffic patterns to predict net revenue—information that can be worth billions to hedge funds.
The challenge and opportunity of big data affect everyone in every field, he adds, from the recent Hollywood writers’ strike (partly inspired by the studios’ refusal to limit the contributions of AI-generated creative content) to what it ultimately means to contribute artistically and scientifically as a human.
“Everything is changing, and we’re going to have to redefine the value-add that only humans can contribute,” Bortnik says. “Students today have a ton of big challenges ahead of them, but they have more amazing tools and more data in real time than anyone ever thought was possible. Science is evolving, and data is the catalyst.”
We want to empower people to focus on what they feel is exciting or notable, since this broad array of experience is valuable for analyzing novel data sets.”
—EMMANUEL MASONGSONG
A Celebration of Data
First held at UCLA in 2011, the annual DataFest competition sponsored by the American Statistical Association brings together undergraduates from schools across the U.S. to win awards and even potential recruitment by employers. Graduating senior Bruins on some of the winning teams at 2023 ASA DataFest shared why the competition — and the field — are so special.
I have been very involved in the UCLA statistics community, so I was always curious about participating in DataFest, despite not having a major in the department. I firmly believe that our team’s strength was the diverse perspectives we brought to problem-solving, based on our unique backgrounds.
The changing world of big data has a significant impact on the majority of careers. As an economics major, I gained exposure to data analysis through my econometrics classes. However, I recognized the need to complement my knowledge with further coursework in statistics and data science. The use of statistical models and big data is everywhere, from academia to industry. Every professional, regardless of specialization, has to be educated in this regard to make sure the tools that are available today are used ethically and responsibly.
—ARMAN GHAZARYAN, business economics major with a minor in statistics and data science
My biggest takeaway from DataFest was that data science doesn’t necessarily have to entail very complex modeling. Our team felt that modeling wouldn’t answer our question, so instead we did a lot of counting and making data visualizations to illustrate our conclusions. Sometimes simpler is better!
It’s very important to me that my work has meaningful value, and data science gives me the tools to understand the current state of the world and areas we can do better in.
—AVANI KANUNGO, statistics and data science major
I originally entered UCLA planning to be pre-med. Outside the classroom, I worked at an Alzheimer’s clinic, where I collected and handled patient data. I was struck by the incongruity that many fields, like medicine, collect a great deal of data, and yet professionals in these fields tend to be deficient in the skills required to analyze it. When I started taking upper division statistics classes, I was fascinated by the increasing role of statistical applications in many different industries.
With the big data revolution, I am excited about the possibility of pulling data from diverse sources to solve complex, interdisciplinary problems. In particular, I believe this revolution can transform the health care industry in terms of speed, cost and insights. In my lifetime, I envision statisticians and data scientists using advanced tools like artificial intelligence, machine learning and predictive analytics to create widely accessible, collaborative medical data hubs from disparate internal and external data sources.
I am proud to be a statistics and data science major because it is a highly quantitative, rigorous discipline that promises to deliver incredible advances across many industries to improve the quality of human life.
—PAIGE LEE, statistics and data science major, neuroscience minor
My biggest takeaway from DataFest was the need for creativity and storytelling when it comes to data. Even though we were all provided with the same data set, each group extracted something completely unique, yet somehow intertwined with the other groups. The winners of the competition not only analyzed the data thoroughly, but could also explain why the audience should care about their specific conclusions. There is a story behind the numbers, and it takes more than just simple analysis to bring it out.
After graduation, I’ll be working in the damages litigation consulting field—I never expected a litigation company needed my skills as a data analyst/statistician. I imagine that as big data becomes more readily available, legal cases will be defined by analysis, especially those that involve vast amounts of consumer data.
—KATHY NGUYEN-LY, statistics and data science, political science double major
Ongoing developments in deep learning and natural language processing open many avenues for research. As an aspiring data scientist and computational social scientist, I look forward to refining and utilizing these tools to uncover new insights of use to social scientists, policymakers and other stakeholders.
As a language that cuts across cultures and nationalities, data science helps us bridge gaps in our understanding of each other and the communities in which we live.
—LUCAS OPHOFF, statistics and data science, political science double major
My biggest takeaway from DataFest is the crucial importance of teamwork. I learned that when tackling a data insight question with hundreds of variables, there are numerous approaches to consider.
With the exponential growth of data and recent developments in chat assistant AI, there is a growing demand for individuals skilled in analyzing and interpreting large data sets. I believe that people’s interest in big data will certainly benefit my career, but it also means constantly learning the newest techniques to efficiently analyze and draw conclusions.
—RYAN WALLACE, statistics and data science major
I have enjoyed and appreciated how my major has incorporated consulting classes to better prepare us for real-life challenges in statistics and data science. On a more personal note, I have enjoyed being able to partake in a diverse community of data enthusiasts—especially at ASA DataFest!
—SARAH ZHARI, statistics and data science major
With Big Data Comes Big Responsibility
In recent years, social scientists, including UCLA’s Safiya Noble (see sidebar), have raised the alarm that if we rely on big data for social and economic purposes without heavily regulating its use, we risk reinforcing inequities.
UCLA economics professor and California Policy Lab faculty director Till von Wachter, for his part, is highly attuned to the sensitive nature of the data his group works with. Figuring out how to use highly personal data about, say, mental health treatment in a responsible, unbiased and farsighted way is a challenge that requires not just a commitment to justice but also expertise in law and data security. That complicated work is well worth it, however, with equitable research and policy as the goal.
“We’ve paid the fixed costs to create a legal framework, to have a highly secure IT infrastructure and to clean up the data,” von Wachter says of the California Policy Lab, which has been in the headlines lately for unique data-driven studies of California’s unemployment benefits system during the COVID-19 pandemic. “We also collaborate with our community advisory board for their insights on this work and we only work with anonymized data, all of which facilitates cooperation between agencies and researchers.”
When it comes to big data, discussions of the science involved can sometimes get abstruse. Its impact, however, ranges from the individual to a global scale — take climate science.
Alex Hall, professor in the department of atmospheric and oceanic sciences and the Institute of the Environment and Sustainability as well as director of the Center for Climate Science at UCLA, observes that his field was one of the first to embrace big data several decades ago. Without algorithmically assisted analyses of vast troves of data, scientists never could have developed accurate next-day weather forecasts — let alone climate models that predict conditions decades or centuries from now.
What’s changed, Hall says, is the introduction of machine-learning analysis techniques. This data technology has made it possible to conduct new research, including his work on extreme precipitation events, one of the most catastrophic effects of climate change. Using artificial intelligence to detect changes in these phenomena, Hall’s team tested whether leading climate model predictions of increasing precipitation extremes were accurate. They were: Storms-wise, the real world is behaving according to climate change projections.
“We’re using machine learning to find pretty subtle signals that would otherwise be difficult to see,” Hall says. “We’re also experimenting with different ways to use AI to address the question of the distribution of wildfire risk and enable us to make skillful predictions.”
To tackle that and other research questions, Hall can count on legions of new trainees; in 2018, UCLA became the first U.S. college to offer a climate science major. After all, a grounding in climate science is synonymous with a strong education in handling big data.
We’re also experimenting with different ways to use AI to address the question of the distribution of wildfire risk and enable us to make skillful predictions. —ALEX HALL
Seeing the Bigger Picture
According to Juliet Williams, professor of gender studies and chair of the UCLA social science interdepartmental program, the intellectually galvanizing rallying cry in her field in recent years has been the insistence that data itself is social.
“There have been those who have heralded the advent of the age of big data as one that will enable us to transcend human bias,” Williams says. “Finally, we’ll have a more direct and pure access to the truth of how the world works, so that we can solve problems non-ideologically. But, of course, what has quickly been discovered is that big data as often as not mirrors the biases of the social world.”
UCLA social scientists have been at the forefront of questioning big data practices in industry, government and finance, as well as generating new data science projects that serve equity and justice. One such example is the Million Dollar Hoods effort co-led by UCLA history, African American studies and urban planning professor (and MacArthur “genius grant” recipient) Kelly Lytle Hernández. Million Dollar Hoods aims to finally put an accurate price tag on mass incarceration by tracking, neighborhood by neighborhood, how much public money is spent locking up Los Angeles residents.
Professor of sociology and American Indian studies Desi Small-Rodriguez has also made headlines with her work on what she calls Indigenous “statistical erasure” through the U.S. census. Small-Rodriguez’s research tells an instructive story of how state power is expressed through data collection and analysis and explores how the nations might someday achieve the goal of “data sovereignty.”
One of the great promises of big data is that it can bring together previously siloed information for combined analysis by super-efficient algorithms. Ironically, however, the academic conversation around big data has itself long been siloed. Williams and her colleagues, including Darnell Hunt, UCLA’s executive vice chancellor and provost and former dean of social sciences, addressed this split with a new set of curricular offerings meant to bring data science, humanities and social science onto common ground at the UCLA College.
“We started to notice that students in fields like history, gender studies, Chicano studies and sociology had a very strong interest in social justice, but they weren’t necessarily taking any statistics beyond the minimum,” Williams says. “At the same time, we had lots of students in economics and political science who were getting very sophisticated quantitative and data-related training, but weren’t necessarily being given the theoretical concepts, tools and frameworks to query the social origins and impacts of data.”
Thus was born, in 2021, the UCLA Mellon Social Justice Curriculum, a $5-million investment in expanded curricular offerings aiming to bridge the gap between social inquiry and data science. The grant has allowed for the hiring of five new faculty and the development of a freshman-year cluster course on data, society and social justice.
Williams, who serves as faculty co-lead for the initiative, and her colleagues are also working on a one-year data and society master’s degree track and an undergraduate data justice scholarship.
“We want to make sure, as we’re training 21st-century UCLA graduates, that they have the full repertoire of tools necessary to realize transformative change,” Williams says. “We’re recognizing that as much as you have to have fluencies in social theory, you also have to understand the basics of how statistics work. You have to be able to work with data sets, because that’s increasingly the language in which public policy is being debated and formulated.”
We want to make sure, as we’re training 21st-century UCLA graduates, that they have the full repertoire of tools necessary to realize transformative change.
—JULIET WILLIAMS
A Human Touch
The same is true when it comes to the intersection of big data and the humanities. In fact, Williams’ Mellon faculty co-lead is Todd Presner, chair of the UCLA Department of European Languages and Transcultural Studies and special advisor to the vice chancellor for research. Presner is currently working on a book, “Ethics of the Algorithm: Computational Approaches to Holocaust History and Memory,” in which he examines the innovations made possible in the field via everything from natural language processing to machine learning to data visualizations.
These approaches have also made a difference in work tied to the ancient world, according to Chris Johanson, associate professor of classics and chair of digital humanities. For example, in aristocratic Roman funeral traditions, mourners would portray multiple generations of the deceased’s most notable ancestors, both real and mythological.
Johanson’s RomeLab project developed reconstructions of these funerals and, using a searchable database of all known members of Roman society’s elite known as the Digital Prosopography of the Roman Republic, as well as network graph visualizations of their family trees, Johanson and his students created visualizations for every aristocratic funeral that might have occurred during the entirety of the Roman Republic.
“RomeLab is just one microscopic example of how one can work with computationally actionable data in the humanities,” Johanson says. “But it shows how these tools allow students to connect closer to the people and materials of the past than they could have otherwise.”
This philosophy informs the division on a broad scale. For example, John Papadopoulos, professor of classical archaeology, history and culture, has incorporated light detection and ranging (LiDAR) data to create 3D models of an Athenian agora excavation project. And Ashley Sanders Garcia, vice chair of digital humanities, has used text mining and network analysis to recover the history of Algerian women who lived between 1567 and 1837.
Another exciting project involves work being done by Jessica Cook, a doctoral candidate in English writing her dissertation on how 19th-century mnemonics and poetry informed the conceptualization of modern computing, focusing in great detail on Ada Lovelace, the world’s first computer programmer. To access Lovelace’s archive — most of which is unpublished — Cook had to photograph all the papers in the archive and train an AI model to read Lovelace’s Victorian-era handwriting and then transcribe it.
Cook’s efforts have proved so successful that she is currently running her model on Lovelace’s entire corpus of writing and will take similar approaches to the handwriting of Lovelace’s important correspondents.
Click here to read a Q&A spotlight with doctoral candidate Jessica Cook, who graciously agreed to answer a few more questions from the UCLA College about her work to reclaim the impact and very words of Ada Lovelace, the world’s first computer programmer.
Arguably the most powerful takeaway is that this project will finally allow Lovelace’s entire body of work to become accessible to researchers who can ensure she receives the rightful credit many of her male contemporaries have enjoyed for centuries.
“This kind of large-scale digital humanities endeavor is an exciting demonstration of how big data and AI have transformed the field of literary study,” says Cook. “However, this particular project is especially poignant because Ada Lovelace’s contributions to the history of computing were the genesis of the very AI technologies that make this research possible. If Lovelace had not produced the very pieces of writing that I am transcribing, it is possible that the modern computer as we know it may also not have existed.”
Keeping the focus on humanity is key, these researchers agree.
“As much as technology has the ability to distance us from what it means to be human, it’s also able to bring us much closer — to allow us to connect strands and stories of the human experience more efficiently than ever before,” Johanson says. “It’s really exciting to think about what is possible at UCLA and beyond when north and south campus collaborate.”
It’s really exciting to think about what is possible at UCLA and beyond when north and south campus collaborate.
—CHRIS JOHANSON
Bench Science and Beyond
In the world of medical and biological research, likewise, there has been an overwhelming transformation of laboratory practices thanks to the advent of big data collection and sharing. Increasingly, biology research is done by analyzing publicly available data sets measured and deposited by any of the thousands of laboratories worldwide, says Alexander Hoffmann, professor of microbiology and immunology and founding director of the Institute for Quantitative and Computational Biosciences at UCLA.
“There is an increasing number of biologists who have never trained to hold a pipette, grow cells or stand at the lab bench,” he adds. “But they are trained in computational algorithms and workflows, and they have biological knowledge. That’s a huge shift in life sciences — we now have dry-lab scientists, in addition to the traditional wet-lab scientists.”
Dry-lab science can indeed have a huge impact in the real world. Jingyi Jessica Li, professor of statistics, biostatistics, human genetics and computational medicine at UCLA, does work illustrative of the complex path of such lifesaving medical research in the era of big data. Li’s expertise lies neither in the gathering of data from experiments nor the application of findings to medicines and therapies, but in the in-between step of deciding which algorithms are best for parsing which of the enormous data sets now available to researchers.
“My role in this whole long process is to ensure that the analysis is rigorous,” says Li, “so we can give a proper confidence level to the findings we observe so that we are not overly optimistic, or we don’t miss important findings.”
Recently, Li published a study that may revolutionize the way differential gene expression is examined. When scientists want to determine which genes are expressed differently by healthy and sick patients in the case of, for example, liver disease, they need a statistical algorithm to help them flag gene expressions worthy of further study.
Until recently, they’ve relied heavily on a statistical measure known as p-value, whose calculation, however, can be mysterious, dubious and error-prone for non-statisticians. Li’s research shows that methods relying on ill-posed p-values are often deeply flawed, turn up false discoveries or miss relevant genes. Li has designed a statistical framework, known as “Clipper,” that allows users to find differentially expressed genes using a new concept called contrast scores, which can be flexibly constructed using properly set up (experimental or in silico) negative control data, without relying on p-values.
Navigating these complexities in ways that are mutually intelligible to researchers working separately in dry labs around the globe is the path to achieving real medical breakthroughs in the big data era.
“How to distinguish signals from noise is the grand challenge in my field,” Li says. “Statistical modeling offers a way to make data analysis more transparent and interpretable.”
How to distinguish signals from noise is the grand challenge in my field.
—JINGYI JESSICA LI
Training the Next Generation
Li, with her research team, makes use of UCLA’s leadership in the field of next-generation sequencing, a big data method for determining the sequences of DNA and RNA, often for research into genetic conditions and diseases. According to Hoffmann, UCLA has become a beacon for NGS research in part because the university excels in training young scientists in the analysis method. This is thanks largely to two projects Hoffmann oversees: the Collaboratory and the Bruins-In-Genomics (B.I.G.) Summer Research Program.
Led by molecular, cell and developmental biology professor Matteo Pellegrini and housed in UCLA’s Institute for Quantitative and Computational Biosciences, the QCBio Collaboratory is a postdoctoral training program but also so much more.
“A broad UCLA community of scientists learn from the postdocs how to handle big data, how to analyze it, what the computational workflows are that are state of the art,” Hoffmann says. “And when they are done taking the workshops and they apply their newfound skills to their data, they can engage the postdocs in a collaborative way for expert consulting.”
This commitment to training has paid off in a big way for researchers at UCLA.
“The Collaboratory was initiated over 10 years ago, and it’s had a tremendous impact,” Hoffmann adds. “It’s a key reason why UCLA has adopted NGS and other big data measurement approaches very, very rapidly, whereas many researchers in the field at other institutions have these data sets lying around that nobody knows what to do with. The Collaboratory has really been phenomenal in removing the bottleneck for analysis.”
For UCLA, however, leadership in this field so crucial to future medical breakthroughs isn’t about leaving other institutions in the dust. It’s about sharing knowledge and skills to empower a diverse rising generation of scientists. That’s the idea behind B.I.G. Summer, which is an eight-week summer institute in quantitative and computational biosciences that is open to applicants from UCLA and other institutions, often from underrepresented backgrounds. Successful applicants get free tuition and a living stipend to spend their summer learning and working on bioscience datasets.
“While we are pretty advanced at UCLA,” Hoffmann says, “there are also lots of students, lots of talent, in other institutions that are still in the process of making that transformation.”
When the next great breakthrough in curing genetic disease occurs, it will surprise no one if it happens at UCLA. But it also may well happen thanks to non-UCLA scientists who trained here for a postdoctoral year, or for a summer after graduating from a college in their home city. It may happen thanks to an applied biologist who used Li’s algorithms — perhaps without Li or her colleagues even knowing it.
Such is the bold new universe of collaborative knowledge creation available thanks to big data, which is transforming so many aspects of scientific progress and of our lives. Will big data turn out to be too much for us to handle? Not at the UCLA College, where, when the data gets big, so do the solutions.
The QCBio Collaboratory was initiated over 10 years ago, and it’s had a tremendous impact. —ALEXANDER HOFFMANN
A Voice for Data Justice
Professor Safiya Noble raises the alarm about big data practices that lead to big inequities
When some scholars write about the emergence of big data, they see potential for new cures, insights and beneficial social policies. Safiya Noble, a MacArthur fellow and professor of gender studies, information studies and African American studies at UCLA, is cognizant of those amazing possibilities, but she also sees great harms already taking place.
“In the big data economy that we’re living in, there are thousands of data brokers buying and selling data about the public 24/7,” Noble says. “And a lot of that data can often be used in discriminatory ways.”
For an example, Noble points to the loan application process. It is illegal for financial institutions to ask questions about gender, race, ethnicity and similar identity markers.
“And yet, social network data can expose your race, your gender or any other protected class,” Noble says. “Using that data in combination with an offer for a financial product would be discriminatory, but so many of these kinds of products just come into the marketplace without any oversight.”
Recently, Noble has met with the Federal Trade Commission, Consumer Financial Protection Bureau and members of the U.S. Congress to discuss big data products that could be harmful to consumers. She has also been in dialogue with major tech firms about “How Search Engines Reinforce Racism,” the subtitle of her blockbuster 2018 book “Algorithms of Oppression.”
“There’s no question that search companies like Google have studied my work and tried to address the concerns revealed through that research,” Noble says. “Once the research is there, they have to contend with it, and many — but not all — do.”
At UCLA, Noble recently formed a new research group, the Center on Race and Digital Justice, where she aims to shine a light on the ways data can be used to discriminatory ends and, in her words, “to advocate for the abolition of systems that are very dangerous and racially unjust in our society.”
She is also involved in shaping the curriculum for UCLA undergraduates through the DataX initiative, which began in 2019 and where she took the reins in 2022. This highly interdisciplinary effort seeks to bring together everyone involved in data science at UCLA to share ideas, skills and commitment to justice and fairness in the use of data to address the most pressing issues facing society.
“If the DataX initiative is successful,” Noble says, “any student going into a field that is data-intensive will leave UCLA thinking critically about the potential for social harm to communities and to individuals.”
BRUIN CONNECTIONS
Data paints a bigger picture that reveals as much about the individual as it does society as a whole, from the inspiration behind this feature’s artwork (created by UCLA School of Law alumna Mimi Chao) to the discoveries made by doctoral candidate Jessica Cook. Read more here about how Cook is using technology to conduct groundbreaking literary research that will reveal a more accurate and inclusive portrait of our world, and learn more about Chao here.