Friday, March 1, 2019
Advanced Data Structure Project
CSCI4117 Advanced Data Structure Project proposal of marriage Yejia Tong/B00537881 2012. 11. 5 1. Title of Project Succinct info expression in top-k inventorys recuperation 2. Objective of Research The master(prenominal) aim of this project is to thread word how to in force(p)ly stripping the k documents where a given pattern occurs nigh frequently. While the problem has been discussed in umteen papers and solved in various ways, our research is to look for the novel algorithmic programs and (succinct) data structures among lately associate materials and find the one dominating almost all the distance/ snip tradeoff. 3.Background/History of the Study Before we beigin our aim to find a much(prenominal) a succinct data structure, there are a scrap of fundamental kit and boodle in our approach. There populate two main among many ideas in classic information retrieval inverted big businessman and term frequency. (Angelos, Giannis, Epimeneidis, Euripides, & Evangelos, 2 005) The inverted forefinger is a also referred to as postings file, which is an index dara structure storing a mapping from content. It is the most utilized data structure in the Information Retrieval domain, used on a massive scale for example in search engines.Term frequency is a mensuration of how often a term is found in a appeal of documents. However, there are restricted assumptions for the efficiency of the ideas the text must be easily tokenized into words, there must non be too many different words, and queries must be whole words or phrases, make lots of difficulty in the document retrieval via various languages. Moreover, one of the attractive properties of an inverted file is that it is easily compressible while windlessness supporting fast queries. In practice, an inverted file occupies space limiting to that if a compressed document compendium. Niko & Veli, 2007) In further development, people find efficient data structures such as suffix arrays and suffix guides (full-text indexes) providing good enough space/time efficiency to inverted files. Recently, several compressed full-text indexes pose been proposed and show effective in practice as well. A extrapolate suffix tree diagram is a suffix tree for a passel of strings. Given the set of strings D = S(1), S(2), S(n) of total length n, it is a Patricia tree containing all n suffixes of the strings. It can be built in time and space, and can be used to find all k occurrences of a string P of length m in time. Bieganski, 1994) Then, we instanter get close to our original motivation the register Retrieval. Matias et al. gave the first efficient upshot to the enrolment Listing problem with O(n) time preprocessing of a collection D of document s d(1), d(2), d(k) of total length Sumd(i) = n, they could answer the document listing query on a pattern P of length m in time. (Y. , S. , S. , & J. , 1998) The algorithm uses a generalized suffix tree augmented with extra edges making it a directed acyclic graph.However, it requires bits, which is importantly more than the collection size. Later on, Niko V. and Veli M. in their paper present an alternating(a) space-efficient variant of Muthukrishnans structure that takes bits, with optimal time. (Niko & Veli, 2007) Based on the minimize prove, we finally move advance to our intensive topic Succinct data structure in top-k documents retrieval. 4. Research to the Study According to the background field above, the suffix tree is used to minimize the space consumption.In the suffix tree document model, a document is considered as a string consisting of words, not characters. During constructing the suffix tree, each suffix of a document is compared to all suffixes which exist in the tree already to find out a stain for inserting it. Hon W. K. , Shah R. and Wu S. B. introduced the first efficient solution for the top-k document retrieval. (Hon, Shah, & Wu, 2009) In order to get rid of too many noisy factors in t he large collection, the algorithm adds a minimum term frequency as one of the parameters for super relevant pattern P. Hon, Shah, & Wu, 2009) Furthermore, they also developed the f-mine problem for the high relevancy, that however documents which have more than f occurrences of the pattern need to be retrieved. The look of relevance here is simply the term frequency. In the later study, Hon W. K. , Shah R. and Wu S. B. achieved the study of Efficient Index for Retrieving Top-k Most Frequent Documents by driving the solution derived from related problem by Muthukrishnan (Y. , S. , S. , & J. , 1998), answering queries in time and fetching space.The approach is based on a new use of the suffix tree called induced generalized suffix tree (IGST). (Hon, Shah, & Wu, 2009) The practicality of the proposed index is authorize by the experimental results. 5. Future Works Since all the fundamental works are settled, our futuer analysis of the Succinct data structure in top-k documents ret rieval is mainly based on the most recently accomplishment by Gonzalo N. and Daniel V. (Gonzalo & Daniel, 2012) , a New Top-k Algorithm dominating almost all the space/time tradeoff. . References Bibliography Angelos, H. , Giannis, V. , Epimeneidis, V. , Euripides, P. G. , & Evangelos, M. (2005). Information Retrieval by Semantic Similarity. Dalhousie University, Faculty of calculator Science. Halifax None. Bieganski, P. (1994). Generalized suffix trees for biological sequence data applications and implementation. Minnesota University, Dept. of Comput. Sci. Minneapolis None. Gonzalo, N. , & Daniel, V. (2012). Space-Efficient Top-k Document Retrieval. Univ. of Chile, Dept. f Computer Science. Valdivia None. Hon, W. K. , Shah, R. , & Wu, S. B. (2009). Efficient INdex for Retrieving Top-k Most Frequenct Documents. None Springer, Heidelberg. Niko, V. , & Veli, M. (2007). Space-efficient Algorithms for Document Retrieval. University of Helsinki, Department of Computer Science. Finland N one. Y. , M. , S. , M. , S. , C. S. , & J. , Z. (1998). Augmenting suffix trees with applications. 6th Annual European Symposium on Algorithms (ESA 1998) (pp. 67-78). None Springer-Verlag.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment