Today’s AI & Tech Briefing (June 22, 2026)

Today’s selection of 8 noteworthy AI/ML papers from arXiv, covering multi-agent bias propagation, knowledge conflict resolution, specialized LLM applications in education and security, and advancements in multimodal retrieval and reinforcement learning.

1. Contagion Networks: Evaluator Bias Propagation in Multi-Agent LLM Systems

Authors: Zewen Liu | Categories: cs.LG, cs.AI, cs.MA Link: arxiv.org/abs/2606.20493

This paper introduces Contagion Networks, a formal framework for measuring how evaluator biases spread across interacting LLM agents. In a controlled 3-agent experiment using DeepSeek-chat, the authors find that biases consistently propagate between agents (gamma in [0.157, 0.352]) even within the same model, and that increasing evaluator committee size from k=1 to k=3 reduces effective contagion by 72.4%.

Takeaway: A timely and actionable study on a critical, often-overlooked failure mode in multi-agent LLM systems, with a clear mitigation strategy (larger evaluator committees) that can be applied immediately in production deployments.

2. Navigating Unreliable Parametric and Contextual Knowledge: Explicit Knowledge Conflict Resolution for LLM Inference

Authors: Huang Peng, Jiuyang Tang, Weixin Zeng, Hao Xu, Xiang Zhao | Categories: cs.AI Link: arxiv.org/abs/2606.20245

The authors propose MACR, a framework that moves beyond binary choice (trust model vs. trust context) by introducing an explicit conflict-resolution mechanism using multi-agent reasoning. MACR uses a modified semantic entropy measure to assess LLM confidence and employs three specialized agents to induce explicit rules, analyze conflicts, and resolve inconsistencies across both internal and external knowledge sources.

Takeaway: This addresses a fundamental limitation in current LLM usage—how to handle conflicting information when both the model and provided context may contain errors—with a principled, interpretable solution.

3. PsyScore: A Psychometrically-Aware Framework for Trait-Adaptive Essay Scoring and ZPD-Scaffolded Feedback

Authors: Wei Xia, Jin Wu, Haoran Shi, Xiangyu Wang, Chanjin Zheng | Categories: cs.CL Link: arxiv.org/abs/2606.20287

PsyScore integrates diagnostic assessment with instructional scaffolding by embedding the Graded Partial Credit Model into a neural architecture for precise ability estimation. The framework then conditions multi-agent feedback strategies on the diagnosed ability parameter to adapt instructional focus across different proficiency levels, providing pedagogically aligned feedback.

Takeaway: A rare example of genuinely integrating educational measurement theory (Item Response Theory) with modern LLM capabilities, moving beyond treating scoring and feedback as separate problems.

4. Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

Authors: Haochen Han, Jue Wang, Alex Jinpeng Wang, Fangming Liu | Categories: cs.CV, cs.AI Link: arxiv.org/abs/2606.20177

This paper introduces RS-Neg, the first benchmark to evaluate negation understanding in remote sensing MLLMs, and reveals that advanced models struggle with negation, exhibiting hallucinations and substantial performance degradation. The authors propose NeFo, a test-time learning method that uses only ~5% unlabeled test samples to significantly improve negation comprehension.

Takeaway: A critical gap exposed—models that cannot reliably understand negation are unsuitable for high-stakes remote sensing applications like emergency response, making this benchmark and mitigation method highly practical.

5. Multi-View Decompilation for LLM-Based Malware Classification

Authors: Bercan Turkmen, Vyas Raina | Categories: cs.CR, cs.AI Link: arxiv.org/abs/2606.20436

The authors show that providing LLMs with decompiled pseudo-C views from two different decompilers (Ghidra and RetDec) significantly improves malicious-class F1 scores for malware classification. Their analysis reveals that the two decompilers make partially different errors, demonstrating that multi-decompiler prompting provides complementary evidence without requiring model training.

Takeaway: A simple, training-free improvement that leverages existing tools (multiple decompilers) to boost LLM-based malware triage—practical and immediately deployable in security workflows.

6. AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Authors: Zepeng Li, Jie Ren, Zhanyong Tang, Jie Zheng, Zheng Wang | Categories: cs.SE, cs.AI Link: arxiv.org/abs/2606.20373

AutoPass is a multi-agent framework that uses compiler and runtime evidence to guide LLM-generated optimization decisions, opening up the compiler’s internal state for the LLM to query. Operating in an inference-only, training-free setting, it achieves geometric-mean speedups of 1.043x and 1.117x over LLVM -O3 on x86-64 and ARM64 systems respectively.

Takeaway: Rather than treating the compiler as a black box, this work gives the LLM visibility into compiler internals—a promising direction that outperforms both expert-tuned heuristics and classical autotuning methods.

7. ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

Authors: Yuhan Liu, Pei Fu, Hang Li, Yukun Qi, Chao Jiang et al. | Categories: cs.IR, cs.AI Link: arxiv.org/abs/2606.20280

ELVA introduces a rule-based reinforcement learning framework that addresses “grain blindness” in multimodal retrieval—the tendency of models to overlook grain-level information in queries. By extending Reinforcement Learning with Verifiable Rewards to retrieval tasks, it jointly optimizes the ranking of negative samples and achieves a 13.1% improvement on the newly introduced MRBench benchmark.

Takeaway: A novel application of RLVR to the retrieval domain, with a new benchmark (MRBench) that better captures the complexity of real-world multimodal queries requiring multi-grain understanding.

8. Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

Authors: Ziheng Wei, Annie Qu, Rui Miao | Categories: stat.ML, cs.LG Link: arxiv.org/abs/2606.20206

This paper addresses off-policy evaluation in reinforcement learning when rewards are missing not at random (MNAR)—a common problem in healthcare and marketing where record-keeping is irregular. The authors introduce a reward-dependent propensity model and use future states as shadow variables to identify the conditional mean reward, proposing a Fitted-Q-Evaluation-style estimator with strong theoretical guarantees.

Takeaway: Addresses a realistic but understudied problem in offline RL—missing rewards that are correlated with the reward value itself—with rigorous theoretical foundations and strong empirical results.

This content was generated with AI assistance. Paper information sourced from arXiv.

Today's AI & Tech Briefing (June 22, 2026)