Performance Enhancement through Instruction Tuning Based on User Feedback Data

December 12, 2025 · 12 min read

Jaechan Lee

POSTECH

Yura Shin

AI Specialist

This work is a collaborative research effort with Minjoon Son (advised by Prof. Youngmyung Ko) as part of the "Campus: Beyond Safety to Intelligence – Postech Living Lab Project with EVA"

🎯 Introduction: Shifting Feedback from 'Retrospective Correction' to 'Cognitive Enhancement'

When EVA makes judgments based on images, operators often provide specific feedback like: "This is indeed a safety vest. Why did it get confused?" or "Shouldn't there be an alert here?" This feedback contains not just the right or wrong answer, but also the human reasoning and context behind the judgment.

Previously, EVA utilized this feedback by storing it in a separate Vector DB and using it to adjust the Alert status when similar situations occurred. While this approach offered the advantage of quick application, it had a structural limitation: it did not improve the model's intrinsic reasoning capability and merely retrospectively filtered errors.

To fundamentally address this issue, we completely changed our approach. We reconstructed user feedback not as simple error reports, but as Instruction Data that the model can directly use in its inference process to strengthen its Visual Reasoning capability.

This article will focus on how VLM-based Instruction Tuning utilizing user feedback data overcomes the limitations of the previous Vector DB-centric approach and improves the model's visual reasoning performance.

1. Structural Challenges and the Need for Improvement in the Existing Vector DB-Centric Approach

EVA has been using a method where False Positive feedback is stored as vectors, and new images are corrected by searching for similar cases. While this method was quick and simple, it presented the following structural challenges in enhancing the model's intelligent judgment ability:

(1) Reliance on Case-Based Dependence: The Vector DB relies solely on previously stored cases, making it difficult to generalize and respond when new types of complex cases (Hard Cases) emerge during system operation.
(2) Fundamental Limitation in Model Reasoning: Since the correction logic only operates at the filtering stage, the model itself does not learn error patterns, leaving open the possibility of repeating the same visual confusion.
(3) Insufficient Understanding of Complex Visual Context: The model lacks the ability to autonomously understand and integrate subtle site-specific visual variables into its judgment, such as lighting changes, color differences, or the presence/absence of reflective stripes.

In summary, the conventional approach faithfully served its role as a "tool for retrospectively correcting results," but its structure was unable to help the model learn why it misjudged and how to judge correctly, preventing it from making consistent judgments independently.

2. New Direction: VLM-based Instruction Tuning and Project Goals

While the previous Vector DB method focused on retrospectively correcting errors, the new approach centers on enabling the model to understand rules and reach consistent conclusions independently. The chosen solution for this is VLM (Vision-Language Model) based Instruction Tuning.

VLM has a structure that understands both images and text. By training it with Instructions (directives) and the correct answers, the model gains context-aware visual reasoning capability beyond simple detection. It learns not only "what to see" but also "how to judge."

This project was designed to validate these characteristics in a real industrial environment, with the core objectives as follows:

To reconstruct False Positive/False Negative feedback into a QA Instruction dataset.
To perform Instruction Tuning on the EVA-based VLM using this data.
To quantitatively verify how much the actual reasoning capability of the tuned model improves compared to the baseline.

This project was a process of demonstrating the possibility of structural improvement—allowing the model to understand rules and make autonomous judgments—going beyond mere performance enhancement.

3. 💡 Methodology

3.1. Phase 1: Baseline Establishment and Failure Collection

In the first phase of this project, we set the VLM currently used in EVA, Qwen2.5-VL-32B-Instruct, as the Baseline model. We used the Kaggle PPE Dataset to collect a total of 152 instances of Hard Cases (False Positive/False Negative).

False Positive (FP): Cases where an Alert occurred despite safety gear being worn.
False Negative (FN): Cases where an Alert did not occur even though safety gear was not worn.

These failure cases were crucial evidence, revealing the scenes where the model actually experienced visual confusion, going beyond simple accuracy metrics.

For the collected Hard Cases, human annotators directly described the image situation and supplemented the context for why the model failed. This process was aimed at systematically securing the visual patterns and judgment points that the model struggled with, moving beyond simple error checking.

Example of False Positive Data Feedback

Scenario: "If there is any worker not wearing either a safety helmet or a safety vest, please trigger an alarm (Alert)."

Failure Type: False Positive
Feedback: All workers in the image are wearing safety vests and helmets. Four out of five are wearing black helmets, and one is wearing a white helmet.

Failure Type: False Negative
Feedback: There are four workers in the image, and the third worker from the left is wearing a safety vest but not a safety helmet.

3.2. Phase 2: QA Instruction Dataset Construction

In Phase 2, we advanced the collected feedback from simple text into a QA Instruction dataset that could directly enhance the model's reasoning capability. This stage was critical for teaching the model not only "what it got wrong" but also "why it was wrong and how to judge correctly."

(1) Dataset Construction Procedure: Semi-Automated Construction

Referring to the semi-automated Instruction generation method used in Holmes-VAD (Zhang et al. 2024), we applied the following process:

Input the False Positive/False Negative images + human-written feedback into a powerful VLM. (We used Gemini-2.5-flash for this project).
Generate QA drafts according to the four predefined Question Types (A-D).
Researchers review the drafts, remove unnecessary explanations, and revise them to match the actual judgment rules.

This process converted simple feedback into a structured Instruction format, resulting in a total of 518 QA data points from the initial 152 feedback instances.

(2) Four Question Types (A–D)

The collected feedback was converted into the following four types. Type A focuses on fact verification, while Types B, C, and D are all reasoning enhancement types. Each type was designed to supplement the judgment factors the model had previously missed.

A. Fact Verification

The most direct correction type, which fixes the part the model incorrectly judged. It is effective for baseline correction by clearly identifying the core of the issue.

Question: "If there is any worker not wearing either a safety helmet or a safety vest, please trigger an alarm (Alert)."
Answer: "The worker in the image is wearing a safety helmet but not a safety vest. Therefore, an alert should be triggered."

B. Comparison & Contrast Reasoning

A type designed to teach the model to distinguish between similar-looking but actually different visual patterns. Used to explain subtle differences, such as safety vest vs. ordinary clothing, or the presence/absence of reflective tape.

Question: "Distinguish between the person complying with safety regulations and the person who is not, and explain the differences."
Answer: "The worker in the center is wearing a black safety helmet but no safety vest. This worker is wearing only an orange T-shirt, which, unlike the safety vests of the other workers, lacks reflective tape and does not have a visible seam between the torso and sleeves."

C. Counterfactual / Hypothetical Reasoning

A type that helps the model gain a deeper understanding of visual evidence through condition-change-based reasoning, such as: "What would need to change for this person to be judged as compliant?"

Question: "If this person were to be judged as 'wearing a safety vest,' what visual elements in the current image would need to be changed?"
Answer: "To judge this person as 'wearing a safety vest,' first, the garment should become sleeveless. Also, the color of the garment should change to a bright color like fluorescent yellow, and silver reflective stripes should be added to the shoulder or waist lines."

D. Failure Analysis & Self-Diagnosis

A type that teaches the model to self-identify and explain errors like "reasoning is correct but the final conclusion is wrong" (reasoning-action inconsistency). It plays a crucial role in strengthening the model's consistency.

Question: "Despite correctly inferring that the worker was 'not wearing a safety vest,' the model ultimately concluded (Alert: False). What is the problem with this inference, and what is the correct judgment?"
Answer: "Although the model clearly explained the 'non-compliance' due to 'no safety vest' in the text, it took the contradictory final action of (Alert: False). This is a critical 'reasoning-action inconsistency' error. The correct output should be (Alert: True)."

3.3. Phase 3: VLM Fine-Tuning

Phase 3 involved actually tuning the VLM model used in EVA utilizing the QA Instruction dataset constructed earlier. This stage focused on enabling the model to understand rules more accurately and make consistent judgments.

(1) Target Model and Environment
Qwen2.5-VL-32B-Instruct was selected as the experimental target because it is the model that can be deployed in EVA, allowing us to directly verify the actual service improvement effect. Considering the constraints of the training environment, we applied Quantization and QLoRA to reduce the model's memory usage.

(2) Tuning Technique: LoRA (Low-Rank Adaptation)
Instead of retraining the entire model parameters, this method supplements only the necessary parts with additional low-rank weights (LoRA modules).

The massive weights of the original model are kept intact while only a small set of weights needed for the specific task are fine-tuned. This technique significantly reduces training costs while still achieving performance improvement.

(3) Training Method: Supervised Fine-Tuning
In Fine-tuning, we used both the failure case image and the QA Instruction as input.

Input: Failed detection image & corresponding QA Instruction for that image
Objective: Train the model to accurately generate the correct answer text token by token, given the image and the question.

In this stage, supervised learning was applied to enable the model to combine visual information with rule-based queries to generate accurate judgments and explanations.

4. 📈 Key Insights and Performance Benchmarks

4.1. Defining Comparison Models

In this experiment, we applied different scopes of data to a single baseline model (Qwen2.5-VL-32B-Instruct) to compare performance differences.

Model	Description
Baseline	The original model without any additional training.
Model A	Model tuned using only Fact Verification Questions (A). → Focus on simple judgment, e.g., "Is a helmet worn?"
Model ABCD	Model tuned with diverse questions, including Comparison/Contrast (B), Hypothetical (C), and Failure Analysis (D), in addition to Fact Verification (A). → Learning why it makes a certain judgment and when that judgment changes.

Model A is a simple fact-centric model, while Model ABCD is a rich reasoning and context-based model.

4.1. Key Insights Discovered

✔ RQ1. Does Feedback-based Instruction Tuning Improve Real-world Performance?

The results are clear. Both Model A and Model ABCD recorded higher performance than the Baseline. We confirmed that organizing feedback data into a Q&A format and using it for training significantly improves the model's PPE judgment capability.

✔ RQ2. Are Simple Fact Questions Sufficient?

Simple fact questions alone have limitations.

Model A (Trained with only fact questions): Judged compliance mostly well but tended to be overly optimistic, resulting in 88 False Positives.
Model ABCD (Including reasoning questions): Learned various inference patterns, such as comparison/contrast, evidence explanation, and hypothetical scenarios, which significantly reduced False Positives to 17 instances.

As shown, training the model with the reasoning behind the judgment is required for it to be more stable and reliable than simply training on right/wrong answers.

4.2. Performance Benchmarks

Evaluation was performed on a total of 1,000 test images (500 compliant, 500 non-compliant).

📊 Quantitative Performance Summary

Model	Accuracy	Precision	Recall	F1-score
Baseline	0.89	0.87	0.91	0.69
Model A	0.91	0.85	0.99	0.91
Model ABCD	0.94	0.96	0.91	0.94

Model ABCD recorded the best overall metrics.

🚫 False Positive Reduction Effect

Category	Baseline	Model A	Model ABCD
False Positive Instances	69	88	17

While Model A frequently misjudged non-compliant states, Model ABCD corrected most of these errors through diverse reasoning training. This confirms that training with reasoning-based questions is effective in increasing the model's judgment reliability.

5. Conclusion

In this project, we applied feedback-based VLM Instruction Tuning to solve the False Positive/False Negative issues occurring in EVA and to improve the model's complex condition judgment capability.

The key conclusions drawn from the experiments and analysis are as follows:

Processing and training on feedback data in Instruction format improves model performance.
Feedback converted into QA Instructions improves the model's intrinsic reasoning capability, unlike simple filtering. F1-score and Precision both saw a noticeable increase compared to the Baseline.
Training that includes reasoning questions yields greater effectiveness than simple fact verification.
The model trained on questions involving comparison/contrast, evidence explanation, and hypothetical scenarios (Model ABCD) significantly reduced False Positives, enabling highly reliable judgment in real industrial environments.
Enhanced Model Stability through Hard Case Learning
While the previous Vector DB approach relied on case-based correction, Instruction Tuning improved the model's ability to cope with new types of visual ambiguities actively.

In summary, the approach of VLM Instruction Tuning based on user feedback data proves effective in simultaneously enhancing the model's reasoning capability and judgment reliability in real industrial settings.

Crucially, it provides significant insights by confirming the effectiveness of training that includes "Why" and "What-if" thought processes, moving beyond simple correct answer learning.

References

Holmes-VAD (Zhang et al. 2024)
https://www.kaggle.com/datasets/lbquctrung/worksite-safety-monitoring-dataset

🎯 Introduction: Shifting Feedback from 'Retrospective Correction' to 'Cognitive Enhancement'​

1. Structural Challenges and the Need for Improvement in the Existing Vector DB-Centric Approach​

2. New Direction: VLM-based Instruction Tuning and Project Goals​

3. 💡 Methodology​

3.1. Phase 1: Baseline Establishment and Failure Collection​

3.2. Phase 2: QA Instruction Dataset Construction​

3.3. Phase 3: VLM Fine-Tuning​

4. 📈 Key Insights and Performance Benchmarks​

4.1. Defining Comparison Models​

4.1. Key Insights Discovered​

4.2. Performance Benchmarks​

5. Conclusion​

References​