Abstract:Objectives: With advancements in computer vision technology, accurately identifying and segmenting various components in food images has become essential for food nutrition analysis and promoting healthier diet management. However, most existing image segmentation models rely solely on a single image input, which often struggles to capture subtle distinguishing features in food images with minimal visual differences, ultimately impacting segmentation accuracy. This paper addressed the limitations of single-modality approaches in segmentation tasks by incorporating text information to provide richer contextual data for the model. Additionally, it leveraged self-distillation techniques to guide the model in effectively segmenting food images. Methods: This paper proposed a multi-modal self-distillation segmentation model guided by ingredient information to improve food image segmentation. The model leveraged the comparative languaged pre-training model (CLIP) to capture ingredient information and fused it with image knowledge. By combining the strengths of the diffusion model in dense prediction, the model achieved accurate segmentation of food images. Results: When evaluated on the benchmark dataset FoodSeg103, the model achieved an mIoU of 47.93%, surpassing the current best-performing FoodSAM model by 1.51%. On the UEC-FoodPIX Complete benchmark dataset, the mIoU reached 75.13%, outperforming the FoodSAM model by 8.99%. Conclusions: The proposed multi-modal self-distillation network demonstrated strong performance in food image segmentation, showcasing the effective role of ingredient information in guiding segmentation tasks. This approach significantly improves segmentation accuracy and presents a promising solution for food image analysis.