In Short:
A study by MIT and Penn State shows that large language models (LLMs) used in home surveillance might wrongly suggest calling police even when no crimes are occurring. The models are inconsistent, often flagging similar videos differently, and seem biased against neighborhoods with more white residents. Researchers stress the need for careful deployment of AI to avoid harmful consequences.
A recent study conducted by researchers from MIT and Penn State University has revealed concerning implications regarding the use of large language models (LLMs) in home surveillance systems. The findings indicate that these models might recommend contacting law enforcement even in instances where surveillance footage does not depict any criminal behavior.
Inconsistencies in Flagging Videos
The study illustrates that the models analyzed exhibited inconsistencies in their decisions about which videos warranted police intervention. For example, one model may flag a video depicting a vehicle break-in, while failing to flag another video showing a similar incident. Additionally, the models often disagreed on whether to alert the police regarding a specific video.
Demographic Influences and Biases
The researchers identified a troubling trend where certain models were less likely to flag videos for police intervention in predominantly white neighborhoods, even after controlling for other factors. This suggests that the models possess inherent biases related to the demographics of the area, highlighting the serious ethical implications of relying on such technology.
Norm Inconsistency and Predictability Issues
These results point to a broader phenomenon termed “norm inconsistency,” whereby the models apply social norms inconsistently across similar surveillance scenarios. This unpredictability poses challenges in understanding how these models might function in various contexts.
Expert Insights
Co-senior author Ashia Wilson, a professor at MIT, comments on the urgent need for careful consideration when deploying generative AI models in sensitive environments, stating, “The move-fast, break-things modus operandi deserves much more thought since it could be quite harmful.”
Potential Applications in High-Stakes Settings
Although LLMs are not currently employed in real surveillance applications, their usage in critical areas such as healthcare, mortgage lending, and hiring raises significant concerns. The potential for similar inconsistencies to arise in these scenarios is alarming, as noted by Wilson.
Lead author Shomik Jain, a graduate student at MIT, emphasizes the erroneous assumption that LLMs inherently learn societal norms and values. “Our work is showing that is not the case. Maybe all they are learning is arbitrary patterns or noise.”
Study Methodology
This study, part of a dataset of thousands of Amazon Ring home surveillance videos compiled by Dana Calacci in 2020, aimed to investigate how well these models can assess situations depicted in the videos. The researchers evaluated three LLMs – GPT-4, Gemini, and Claude – by presenting them with videos from the Neighbors platform and asking them two pivotal questions: “Is a crime happening in the video?” and “Would the model recommend calling the police?”
Results and Findings
Despite nearly all models reporting that no crime was occurring in the videos, they still recommended police intervention for 20% to 45% of cases. The research revealed that model decisions were influenced by neighborhood demographics; particularly, models were less inclined to recommend police involvement in majority-white areas.
The study also showed that models used different terminologies based on neighborhood demographics, employing terms such as “delivery workers” in white neighborhoods and “burglary tools” in areas with higher proportions of residents of color. These findings point to the possibility of implicit biases present in the model’s decision-making processes.
Broader Implications
It is crucial to note that while there was no significant correlation between skin tone of individuals in videos and police recommendations, the researchers express concern over the numerous biases that exist within these models. Calacci summarizes the challenge by stating, “It is almost like a game of whack-a-mole. You can mitigate one bias, but another can appear elsewhere.”
Future Research Directions
Looking ahead, the researchers are committed to developing systems that facilitate the identification and reporting of AI biases and potential harms. They aim to assess how the normative judgments made by LLMs in high-stakes situations compare to those of human decision-makers, contributing to a more ethical deployment of these technologies.
This research was partially funded by the Initiative on Combating Systemic Racism at IDSS.