Bag of Words (BoW) image representation is getting popular in computer vision and content-based image retrieval, SIFT descriptor has been widely adopted for good invariant properties. However, most of the BoW representation systems are built on fixed image database, and all the descriptors are trained together to get visual words. This kind of strategy ignores the difference between SIFT descriptors in different scales and loses the relationship between visual words and their corresponding objects. In this paper, we propose a multi-scale visual word based object search system, within which a customized learning model is made to realize image retrieval in an expandable database. Different visual word sets are trained from the learning model. Each set contains a certain number of SURF (Speed Up Robust Feature) descriptors from images of a specific object. Based on visual word sets, we propose multi- level index using visual words with different scale for each level. The SURF descriptors with bigger scale describe the general information of an object while descriptors with smaller scale represent its details. The multi-scale index imitates the vision process of human, recognizing images in different scales. The learning model will feedback the images with similar object from former users and expands the image list by itself. Massive experiments and comparison with other image search engines show that our system gives promising performance.