UPenn AI Safety ASSET Student Seminar

Technical AI Safety · Summer 2026 · University of Pennsylvania

Logistics

When: Wednesdays, 12:00–1:15 PM ET
Where: Amy Gutmann Hall (AGH) 615 — in person encouraged. Zoom option available, link shared via mailing list.
Lunch: provided each week (usually New Delhi, occasionally El Merkury).
Organizers: Berkan Ottlik and Davis Brown.
Join / suggest a topic: sign up for the mailing list.

About

The UPenn AI Safety ASSET Student Seminar is a student-run seminar on technical AI safety, graciously hosted at Penn and funded by the ASSET center. We read papers together and invite guest speakers from Penn, other universities, AI safety organizations, and AI labs.

Anyone excited to learn about AI safety is welcome — we ask only for some background in machine learning, roughly equivalent to an introductory course. Over the summer, topics span deceptive alignment, monitoring and AI control, open-weights safeguards, mechanistic interpretability, model motivations, multi-agent safety, subliminal learning, backdoors, and AI governance.

Schedule — Summer 2026

Date	Speaker	Affiliation	Topic
May 20	Berkan Ottlik	UPenn	Emotion Concepts and their Function in a Large Language Model
May 27	Davis Brown	UPenn	Current AIs seem pretty misaligned to me & Finding Widespread Cheating on Popular Agent Benchmarks
Jun 3	Canceled
Jun 10	Canceled
Jun 17	Chloe Li	Anthropic Alignment Fellow	Model spec midtraining
Jun 24	Canceled
Jul 1	Daniel Tan	Arcadia Alignment (UK AISI)	Emergent misalignment & model motivations
Jul 8	Skipping — ICML
Jul 15	Peter Hase	Schmidt Sciences / Stanford	Blackbox and Whitebox Monitoring
Jul 22	Meena Jagadeesan	UC Berkeley (incoming UPenn)	Anticipating Risks in LLM Ecosystems
Jul 29	Matan Shtepel	CMU	Evaluating and Improving Monitorability Evaluations
Aug 5	Stephen Casper	Harvard (Berkman Klein)	AI Governance in 2026
Aug 12	Rico Angell	NYU	Estimating tail risks in language model outputs

Talk details

Peter Hase — Blackbox and Whitebox Monitoring (Jul 15)

AI models often learn problematic reasoning processes due to misspecified training objectives. Monitoring helps us detect these behaviors at deployment time. For example, inspecting Chain-of-Thought reasoning in LLMs is perhaps the single most common approach to understanding how a model got to its answer. This practice has proven effective for identifying reward hacking, mistaken background knowledge, and misinterpretation of user instructions. To improve the efficacy of blackbox monitoring methods, we introduce methods for enhancing the faithfulness of CoT explanations. With a similar goal, whitebox monitoring involves probing model internal states for signs of misaligned behavior. On this topic, we explore the geometry of truthfulness representations in LLMs, leading us to better lie detector probes. Going forward, work on monitoring should combine insights from CoT faithfulness, LLM introspection, self-verification (“confessions”), and activation monitoring.

Meena Jagadeesan — Anticipating Risks in LLM Ecosystems (Jul 22)

As LLMs are deployed at scale, these models interact with humans, other models, and model-providers in a broader ecosystem. However, classical evaluation practices fail to capture ecosystem-level risks, and past observations may not predict the future since the structure of these ecosystems is rapidly evolving. This talk will investigate how multi-agent interactions shape ecosystem-level risks, and how to anticipate these risks before they emerge. I will focus on three case studies from my work: model-provider competition inducing non-monotone scaling trends (NeurIPS 2023), test-time feedback loops leading to reward hacking (ICML 2024), and human-AI interactions disrupting collusion (arXiv 2025). For each case study, I will reflect on which of our assumptions about ecosystem structure hold up in today’s deployments.

Matan Shtepel — Evaluating and Improving Monitorability Evaluations (Jul 29)

Automatically monitoring AIs for misbehavior is critical to preventing real-world harm such as the recent OpenAI-HuggingFace incident. To improve our ability to monitor AIs, we must be able to evaluate how reliably different methods can flag misbehavior.

In this talk, we formalize monitorability and monitorability evaluations and identify two key issues: (1) Monitorability evaluations hinge on many implementation level decisions which are made inconsistently across works. (2) The prompting techniques used in current monitorability evaluations may be too weak to fully elicit AIs ability to evade monitors.

For (1), we show that current monitorability evaluations are not robust to variation in implementation level decisions (at the scale of prior work), and (2) develop a new reward-shaping, RL-based elicitation for monitorability which outperforms the state of the art.

We remain optimistic monitorability evaluations can be made more trustworthy, but conclude that they are currently not reliable measurements of monitorability in the wild.

Stephen Casper — AI Governance in 2026 (Aug 5)

What’s going on, why it’s a mess, and why it’s going to get messier.

Emerging technologies are always hard to govern, especially when their onset is crammed into a few intense years. With AI, policymakers, thus far, have produced more case studies in failure than success. This talk will overview the stages of governing emerging tech, the challenges that are arising, and the diverse policy strategies that governments across the world are taking. Finally, we will speculate about how things may change in the next few years and how governments will need to adapt. We will speculate about how Xi Jinping, Elon Musk, Sam Altman, Jensen Huang, Bernie Sanders, and anonymous hackers may all have the potential power to “blow it up” and usher in the next messy chapter of AI governance.

Inspired by the formatting of the FOLDS Seminar.