Topic

#Prompt engineering

Prompt engineering is the practice of crafting and structuring instructions given to a language model to obtain accurate and useful outputs. For example, chain-of-thought prompting techniques measurably improve GPT-4's performance on reasoning tasks.

40Articles

8Sources

65Avg. signal

arXiv cs.CL·Jun 18

As Easy as Rocket Science: Assessing the Ability of Large Language Models to Interpret Negation in Figurative Language

arXiv study assessing LLM ability to interpret negation in figurative language. Researchers annotate an existing dataset and evaluate multiple models. Finding: negation combined with figurativeness presents particular challenge, with performance heavily dependent on prompt style.

Evals Prompt engineering Reasoning

SIG

HYP

arXiv cs.CL·Jun 18

BCL: Bayesian In-Context Learning Framework for Information Extraction

BCL is an optimization framework for information extraction using particle filtering and Bayesian updates to systematically refine label representations. It generalizes across sequence labeling and relation classification tasks, demonstrating consistent improvements over existing approaches across model scales.

Prompt engineering Reasoning Evals

SIG

HYP

arXiv cs.CL·Jun 18

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration

SAGE is a stochastic prompt optimization framework using multi-agent guided exploration. Compares three strategies: error-informed random search, genetic algorithm, and SAGE with diagnostic code execution. Deployed on mental-health chatbot: 8 cycles of noisy A/B tests compound into statistically robust next-day retention gain.

Prompt engineering AI Agents Multi-agent

SIG

HYP

arXiv cs.CL·Jun 18

Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation

Activation steering improves synthetic data generation for low-resource languages. Two strategies tested: Language Steering (linguistic identity) and Quality Steering (well-formedness). Evaluation across 4 open-source LLMs, 11 languages, classification tasks. Early-layer steering increases diversity and downstream performance.

Prompt engineering Fine-tuning Benchmarks

SIG

HYP

arXiv cs.CL·Jun 18

Efficient Financial Language Understanding via Distillation with Synthetic Data

Distillation framework with synthetic data for financial sentiment analysis. Knowledge transfer from large instruction-tuned teacher to compact student models. Clustering-based seed selection generates synthetic examples via few-shot prompting. Compact model outperforms teacher on complex/noisy text with minimal supervision.

Fine-tuning RAG Prompt engineering

SIG

HYP

Simon Willison·Jun 17

Quoting Charity Majors

Charity Majors observes that in 2025, the economics of code production flipped: generating code became nearly free and instant instead of expensive and time-consuming. Lines of code shifted from being treasured and carefully curated to disposable and regenerable overnight.

Code generation Prompt engineering

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

Headless screenshot loops let a local 30B agent finish a raytraced FPS demo in pure C

A local Qwen 27B agent completed a raytraced FPS demo in pure C using headless screenshot loops for visual debugging. Adding headless mode with keyboard/mouse injection and frame capture transformed the approach: the model learned to automate recursive visual debugging loops independently.

Qwen AI Agents Code generation

SIG

HYP

arXiv cs.CL·Jun 17

PromptMN: Pseudo Prompting Language

PromptMN is a domain-specific language that structures natural prompts with %-prefixed typed directives (roles, goals, constraints, outputs). Tested on Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 without fine-tuning, it reduces context ambiguities in agent and software development workflows.

Prompt engineering AI Agents Tools

SIG

HYP

arXiv cs.CL·Jun 17

Are you speaking my languages? On spoken language adherence in multimodal LLMs

LLM-based ASR systems often misidentify output languages in multilingual contexts. Authors propose three mitigation strategies: zero-shot prompting, supervised fine-tuning, and Chain-of-Thought reasoning to improve language adherence while preserving code-switching flexibility and ASR performance.

Voice Prompt engineering Fine-tuning

SIG

HYP

arXiv cs.CL·Jun 17

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

Two recent studies reach contradictory conclusions about LVLMs' ability to coordinate efficient referring expressions. This research controls for task differences and directly compares prompting styles. Models coordinate efficiently with explicit prompting but fail to infer communicative efficiency needs from implicit prompts.

Prompt engineering Vision Evals

SIG

HYP

arXiv cs.CL·Jun 17

Environment-Grounded Automated Prompt Optimization for LLM Game Agents

Automated prompt optimization framework for LLM agents in interactive environments. Decomposes observation-to-action pipeline into descriptor and action-selection agents, iteratively refines via LLM-driven evolutionary loop guided by environment returns. On BabyAI/BALROG: improves from 0% to 72.5% success on PutNext without fine-tuning.

AI Agents Prompt engineering Reinforcement learning

SIG

HYP

arXiv cs.AI·Jun 17

LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline

Curriculum-grounded automated marking pipeline using LLMs to assess exam responses. Grounds model outputs in official curriculum artefacts (syllabus, performance descriptors, marking guidelines). Delivers marking outcomes comparable to human tutors with improved traceability to authorised standards.

Evals Prompt engineering Reasoning

SIG

HYP

arXiv cs.CL·Jun 17

MemSlides: A Hierarchical Memory Driven Agent Framework for Personalized Slide Generation with Multi-turn Local Revision

MemSlides introduces a hierarchical memory framework for personalized presentation agents. It separates long-term memory (user profiles, tool experience) from working memory (active preferences), enabling multi-turn local revisions without full deck regeneration.

AI Agents Prompt engineering Tools

SIG

HYP

arXiv cs.CL·Jun 17

Prompt Perturbation for Reliable LLM Evaluation over Comparison Graphs

Method to evaluate LLMs via pairwise comparisons by resolving intransitivity (cycles A≻B≻C≻A). Prompt perturbation framework generates prompt variants, identifies structural inconsistencies in comparison graphs, then applies filtered ranking methods to stabilize leaderboards.

Evals Prompt engineering Benchmarks

SIG

HYP

arXiv cs.CL·Jun 17

Bridging Functional Correctness and Runtime Efficiency Gaps in LLM-Based Code Translation

SwiftTrans, an LLM-based code translation framework, combines multi-perspective exploration (MpTranslator with parallel in-context learning) and difference-aware selection (DiffSelector) to improve both functional correctness and runtime efficiency. Evaluated on CodeNet, F2SBench, and SwiftBench.

Code generation Prompt engineering Benchmarks

SIG

HYP

arXiv cs.AI·Jun 17

Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification

Brick-DICL introduces a two-stage dynamic in-context learning framework for automated Brick schema classification of BMS points (936 classes). Combines metadata-RAG and class-RAG to enhance LLM domain knowledge, with multi-LLM filtering to reduce manual verification effort.

RAG Prompt engineering Reasoning

SIG

HYP

Reddit r/LocalLLaMA·Jun 17

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

Analysis of low narrative diversity in LLM-generated stories. The author examines why models produce repetitive tales with similar characters and structures despite varied prompts.

Llama Prompt engineering Evals

SIG

HYP

Reddit r/LocalLLaMA·Jun 16

Gemma 12b - Reasoning hardening instructions

A user shares a system instruction to improve reasoning in Gemma 12b QAT. The technique aims to reduce cognitive bias and adapt reasoning depth to context. It works well on trick questions but partially fails on certain problems depending on framing.

Gemini Prompt engineering Reasoning

SIG

HYP

arXiv cs.AI·Jun 16

ChatPlanner: A Large Language Model Framework for Personalized Public Transit Routing

ChatPlanner is a framework using fine-tuned LLMs with RAG to extract user preferences from natural language and integrate them into public transit routing optimization. Evaluated on 8 personas and 5 contexts, the system combines fine-tuning (output structure) and RAG (query-specific context) to identify solutions overlooked by existing planners.

RAG Fine-tuning Prompt engineering

SIG

HYP

arXiv cs.CL·Jun 16

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

SHARD is a self-reframing distillation method to improve safe-helpfulness balance in LLMs. It rewrites sensitive prompts using philosophical guidelines to surface benign intent, reframes responses into safer and more helpful versions, then fine-tunes the model on self-reframed responses. Tested on DNA and LINGUASAFE, SHARD improves helpfulness while preserving safety.

Fine-tuning AI safety Alignment

SIG

HYP

arXiv cs.CL·Jun 16

Few-Shot Biomedical Relation Extraction with Large Language Models: A Viable Alternative to Supervised Learning?

Comparative study of few-shot biomedical relation extraction with LLMs vs supervised learning on BioREDirect. Pairwise classification vs joint generation: F1=0.44 (few-shot) vs 0.56 (supervised) in micro-F1, but 0.45 vs 0.38 in macro-F1. LLMs outperform baseline on rare relations.

Prompt engineering Benchmarks RAG

SIG

HYP

arXiv cs.AI·Jun 16

APEX: Adaptive Principle EXtraction A Three-Layer Self-Evolution Framework for Production AI Agents

APEX is a self-improvement framework for production AI agents that co-evolves three dimensions: prompt harness (L1), behavioural principles (L2), and workflow topology (L3). Tested on Joe, an NVIDIA Nemotron super-agent, APEX achieves a Health Score of 0.570 (+90% vs baseline) and distils 6 reusable principles using only 4 LLM calls.

AI Agents Prompt engineering Reinforcement learning

SIG

HYP

arXiv cs.CL·Jun 16

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

ASAG, a training-free method analyzing attention distributions, detects overthinking in reasoning models and adaptively stops generation. Tested on DeepSeek-R1-Distill and Qwen3, it improves accuracy by 3.2% while reducing generated tokens by 40% on Qwen3-8B.

Reasoning DeepSeek Qwen

SIG

HYP

arXiv cs.CL·Jun 16

Encode Errors: Representational Retrieval of In-Context Demonstrations for Multilingual Grammatical Error Correction

Retrieval method for in-context demonstrations using Grammatical Error Representations (GER) for multilingual grammatical error correction. On 8B open-source models, results match GPT-4o-mini and Deepseek2.5. For low-resource languages, F₀.₅ scores improve up to 1.20× over baseline.

RAG Prompt engineering Benchmarks

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

An agent that plans with a frontier model but runs most of tokens locally (built it for my own dual-3090 rig)

Personal hybrid agent tool: frontier model planning (Codex) with local execution using Qwen 3.6 27B on dual RTX 3090. 3-tier architecture (Planner/Local/Senior optional) to minimize frontier costs while retaining reasoning capabilities. Deterministic task validation.

AI Agents Qwen Code generation

SIG

HYP

arXiv cs.CL·Jun 15

Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment

Study across 9 models and 972,000 responses shows LLMs comply with harmful nudges on moral judgments (A=1.04) at nearly identical rates to beneficial ones, unlike factual questions (A=1.58). Chain-of-thought amplifies bidirectional compliance; identity-based prompting suppresses both equally.

Alignment AI safety Evals

SIG

HYP

arXiv cs.CL·Jun 15

Persuasion Index: A Theory-Guided Framework for Persuasion Analysis

Persuasion Index (PI) is a taxonomy of 15 dimensions grounded in persuasion theories from psychology and communication. Implementation with 55 sub-features built from lexicons and rule-based detectors. Evaluation on 4 public datasets shows PI provides a shared feature space for interpreting rhetorical patterns. Lightweight linear models with interpretability. Open-source package and web interface released.

Papers AI safety Prompt engineering

SIG

HYP

arXiv cs.AI·Jun 15

HarnessX: A Composable, Adaptive, and Evolvable Agent Harness Foundry

HarnessX is a foundry for composable and adaptive AI agent harnesses. It uses AEGIS, a trace-driven multi-agent evolution engine, to optimize prompts, tools, and control flow. Across 5 benchmarks (ALFWorld, GAIA, WebShop, tau³-Bench, SWE-bench), HarnessX achieves +14.5% average gain (up to +44%), without model scaling.

AI Agents Multi-agent Prompt engineering

SIG

HYP

arXiv cs.AI·Jun 15

Communication Policy Evolution for Proactive LLM Agents

Formalized study of communication policies for autonomous LLM agents. Comparison of text-based vs UI-based strategies across multiple environments and models. Proposes Communication Policy Evolution (CPE), a self-evolution framework refining policies through rollout and prompt-level evolution, without model modification.

AI Agents Prompt engineering Papers

SIG

HYP

arXiv cs.LG·Jun 15

Graph-based Target Back-Propagation for Context Adaptation in Multi-LLM Agentic Systems

GTBP (Graph-based Target Back-Propagation) is a context adaptation framework for multi-LLM agentic systems. It back-propagates local targets through a directed acyclic graph workflow and updates prompts stage-wise. Theoretically convergent, outperforms baselines across 3 benchmarks.

AI Agents Multi-agent Prompt engineering

SIG

HYP

arXiv cs.CL·Jun 15

QIAS 2026: Overview of the Shared Task on Islamic Inheritance Reasoning

QIAS 2026 is a shared task evaluating LLMs' ability to reason about Islamic inheritance. Based on MAWARITH (12,500 annotated Arabic cases), it requires full calculation: heir identification and share assignment. 16 teams tested prompting, RAG, and fine-tuning. Results show precise legal interpretation and structured numerical reasoning remain highly challenging.

Benchmarks Reasoning RAG

SIG

HYP

arXiv cs.AI·Jun 15

YeasierAgent: Agentic Social Sandbox as a Canvas for Intent-Driven Creation of Platform-Agnostic Symbiotic Agent-Native Applications

YeasierAgent introduces an application-building paradigm based on symbiotic agents, narrative worlds, and scene-aware interaction. The system unifies automated generation, user-created worlds, and spatial multi-agent collaboration to enable cross-platform agent-native applications without reliance on fixed graphical layouts.

AI Agents Multi-agent Prompt engineering

SIG

HYP

Reddit r/LocalLLaMA·Jun 15

Do long agent sessions get “context rot” for you too?

User reports long coding-agent sessions suffer from "context rot": accumulation of failed debugging attempts, stale assumptions, and noise that degrades model reasoning. Proposes separating durable memory from active context rather than simply increasing context size.

AI Agents RAG Prompt engineering

SIG

HYP

Hacker News (AI)·Jun 14

AI is code – and can't be prompted into being smarter

An article arguing that AI is fundamentally code and cannot be made smarter through prompting alone. Challenges the notion that better instructions can overcome the architectural limitations of models.

Prompt engineering Reasoning

SIG

HYP

Reddit r/LocalLLaMA·Jun 14

Can we stop dunking on DiffusionGemma and hack it instead?

DiffusionGemma suffers from hallucinations in naive inference. A user compiles methods (entropy-bounded sampler, canvas cap, thinking mode) to improve quality with 2–3× speedup gains. Three tiers of solutions: drop-in configs, orchestration wrappers, and custom decoders.

Open source Code generation Reasoning

SIG

HYP

Reddit r/LocalLLaMA·Jun 14

Codebase getting larger - Qwen3.6-27B starting to compound issues - how to work smartly with this model?

Developer using Qwen3.6-27B via llama.cpp encounters recurring bugs in Python codebase despite 128K context window. Testing strategies: full project reads vs focused function analysis, KV quantization disabled. Seeking approaches to minimize model errors.

Qwen Code generation Prompt engineering

SIG

HYP

The Decoder·Jun 13

Microsoft's SkillOpt boosts GPT-5.5 by using nothing but a trained Markdown file

Microsoft and three Chinese universities developed SkillOpt, a method optimizing instruction documents for AI agents using classical training principles. A simple Markdown file boosts GPT-5.5 by ~23 points on procedural tasks and transfers across models (Codex, Claude Code).

GPT Claude Code Prompt engineering

SIG

HYP

Reddit r/LocalLLaMA·Jun 12

Use context profiler to optimize your LLM calls and reduce token use

ContextSpy is an open-source profiling tool that analyzes context usage in LLM applications. Operating as a local proxy, it records requests and breaks down token allocation (system prompt, tool definitions, conversation history) to identify optimization opportunities, similar to CPU/memory profilers.

Tools AI Agents Open source

SIG

HYP

OpenAI Blog·Jun 12

New OpenAI Academy courses for the next era of work

OpenAI launches three Academy courses to build practical AI skills, create repeatable workflows, and apply agents in everyday work.

OpenAI AI Agents Prompt engineering

SIG

HYP

arXiv cs.AI·Jun 12

TrajGenAgent: A Hierarchical LLM Agent for Human Mobility Trajectory Generation

TrajGenAgent is a hierarchical LLM-agent framework for realistic human mobility trajectory generation without model fine-tuning. An orchestrator LLM synthesizes activity chains via in-context learning, then a deterministic workflow grounds them using personalized POI retrieval, distance-aware location selection, and LLM-based duration estimation. Evaluation via anomaly-detection framework on benchmark datasets.

AI Agents Prompt engineering Reasoning

SIG

HYP