Large Language Models | Sara Candussio

Reading Between the Tokens: Uncovering the Semantic Minima of AI Monologues

Fri, 24 Apr 2026 12:05:00 +0000

Invited talk at the CLCG Linguistics Lunch at the University of Groningen.

Chain-of-Thought prompting asks models to reason step by step — but most of what they write is filler. This talk presents work on identifying the semantic minima of AI reasoning: the tiny subset of tokens that actually carry predictive weight, detectable in real time from the model’s internal states. Erasing up to 95% of the output leaves a sparse set of words that still perfectly predicts the correct answer.

A Dialectic Pipeline for Improving LLM Robustness

Wed, 28 Jan 2026 00:00:00 +0000

Can LLMs improve their accuracy without further training, just through a dialectic way of questioning themselves — as Hegel suggested?

This was the core question behind my Master’s thesis. The short answer: yes, and by a lot.

The Idea

Inspired by Hegelian dialectics, the pipeline structures reasoning into three stages:

The thesis–antithesis–synthesis pipeline.

Thesis — the model produces an initial answer given the question, context, and options.
Antithesis — the model challenges its own answer, now also seeing the thesis.
Synthesis — the model produces a final answer, having seen both thesis and antithesis.

No fine-tuning. No domain-specific verifiers. Just structured self-dialogue.

Results

The pipeline was tested on multi-hop QA benchmarks (HotpotQA, WikiHop) across five open-source models under 20B parameters (Phi-mini, Phi-medium, Gemma-2B, Gemma-9B, LLaMA-8B).

Accuracy improvements across models on HotpotQA.

From 53.4% to 80.7% on HotpotQA with Phi-mini (+27.3%).

Improvements of up to 30% on complex multi-hop questions — beating standard Chain-of-Thought prompting.

CoT vs. pipeline on WikiHop across all models.

Key Takeaways

Self-debating is the main driver: letting models reflect on and contrast their own reasoning significantly boosts performance, especially as question complexity increases.
Instruction following matters: models that strictly follow instructions (Llama, Phi) benefit more than those that get “too creative” (Gemma-2).
Smart filtering > summarization: when dealing with long contexts, filtering for relevant information beats summarization, which can hurt deductive reasoning.
Avoid overthinking: for simpler tasks, too much deliberation can introduce errors. A touch of “impulsivity” sometimes helps.

Original vs. summarized vs. filtered context on WikiHop.

This work also received an Honorable Mention at the Emanuele Pianta Award (AILC) for the best Italian NLP Master’s thesis at CLiC-it 2025. 🏆

CLiC-it 2025, Cagliari.

If you’re interested in agentic reasoning, small language models, or multi-hop QA — feel free to reach out!

Large Language Models: Potenzialità, Limiti e Sistemi Multi-Agent

Mon, 15 Dec 2025 00:00:00 +0000

Workshop at Novalia, Trieste — December 2025.

Part of a two-session seminar series on digital transformation, co-organized with IP4FVG and the University of Trieste.

This session covered the inner workings of Large Language Models, multi-agent architectures, and fine-tuning strategies — with an eye toward practical business applications and a frank discussion of current limitations.

The core message: the challenge isn’t just adopting AI, but integrating it strategically to augment rather than replace human potential.

Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers

Thu, 10 Jul 2025 00:00:00 +0000

This work introduces a Transformer-based decoder that inverts embeddings of Signal Temporal Logic (STL) formulae. By constructing a small STL vocabulary, the model can generate valid formulae quickly, generalize across semantic structures, and simplify formulas while preserving their meaning. Our methodology is evaluated across varying formula complexity and applied to requirement mining tasks, performing optimization directly in the semantic space.

Create your slides in Markdown - click the Slides button to check out the example.

Add the publication’s full text or supplementary notes here. You can use rich formatting such as including code, math, and images.

If you find overlap with your work or interests, I would be glad to connect and explore possible collaborations.

OverRef: Studying Over-Refusal in Large Language Models

Wed, 01 Jan 2025 00:00:00 +0000

Ongoing project on over-refusal in LLMs: studying when and why models refuse legitimate user queries, with benchmarking and dataset resources.