Design of compression and indexing techniques of cluster

Posted By : Anil Kumar | 22-Sep-2017

html5

Problem domain:

Web crawler is the major component of the web. The data on the web is in abundant and in unstructured form. Web grows each moment and large updates and made to web at each instance of time. There are large amount of web pages that are obsolete and there is no notification is given to repository for this. The design issue of repository is to prepare the repository in such a manner that user retrieves the web pages in the reasonable amount of time. An example to illustrate would be, the physical storage has no space to store new downloaded document and another example would be, there is large amount of data exist in storage that leads to more searching complexity.

Description of the various modules:

The most important component of the architecture, Coordinator Module receives document from crawler as parameter and performs the following steps:
Identify document domain and format
Decide memory block
Extract URLs
Add URLs to seed of URLs
Compress the document
Update Index
store compressed document

Algorithm of Coordinator module:

Coordinator_module (domain, document)
{
Steps:
1.Identify the document domain like doc, html, or pdf etc.;

2.Check the domain of downloaded document , decide the memory block and search it only its specific domain cluster (memory block)whether the document has already downloaded or not;

3.If (document is fresh one)
{

3.1 extract the links or references to the other cites from that documents;

3.2 Compress the document using Huffman algorithm;

3.3 Update Index;

3.4 Check the status of decided cluster and if there is a need to create space for new downloaded document then remove less ranked document;

3.5 Store compressed document it into repository;

}

4. else

{

4.1 Call ranking_module (cluster, document);

4.2 Convert the URL links into their absolute URL equivalents;

4.3Add the URLs to set of seed URLs;

}

Conclusion:

The proposed work presented clustered web repository, and meta-data management. Clustered Web base uses repository and working modules to distribute data among domain specific clusters which compose a clustered web repository. Distribution of web pages to different domain specific blocks reduces searching complexity.
It also focuses on the presence of keyword in different document because the keyword in maximum number of documents will be mapped to less length Huffman code. The mechanism reduces the size of document and after updates the index. The data structure of the indexer fastens the search for matched results from the Inverted Index with the cluster information. It also helps the user to process the user query with fast and more relevant results.