Crawling the entire web.

If you don’t want this here can you please point me to the right sub?

I didn’t know where to post this but I know this is not an easy task to execute.

I only need to look at the source code not have the web page rendered on a web browser.

I will not crawl images or pdf files or anything else other than what is in the source code but compressed using a different kind of compression.

I have made my own fully functioning web browser for iOS (including Cloud Sync) and I have some experience on the subject.

Assume that cost isn’t an issue. I can purchase 30 pettabytes of storage, 10 medium-specd PCs and the fastest available connection OTE (the national internet provider of Greece) can offer.

My questions are:

  • What are the minimum requirements for this? (except storage)
  • Generally, what is a time frame of completion? (except pages that need to be updated more frequently or any other functional segmentation delay, think of it straight forward)

I experiment with a miraculous new format that can compress any website into a text file (1KB) that contains all of the meaning of that page but this is not the reason I choose to do this.

Google has 30 trillion pages, that is 30*10^12*1KB = 30*10^12 KB = ~27.939 pettabytes or ~28 pettabytes of data.

submitted by /u/ashesnroses
[link] [comments]

Leave a Reply