Here is a non-Yahoo! related post that I responded to on Astahost. Considering it is an interesting answer and it took some thought and time to create, I am also posting it here. An Astahost member, FireFoxRules, asks the following question (
http://www.astahost.com/Downloading-Int ... 21292.html):
Quote:
I’m wondering if it is possible to save a copy of everything on the Internet. Ignoring ISP data transfer limitations (max GB per month), I have a download speed of approximately 4 Mbps.
The Internet isn’t limited to web pages though, it includes everything that is public accessible (not password-protected) which includes all music, videos, pictures, software, etc. Furthermore, I am not limiting it to HTTP servers as torrents, files on FTP servers and anything on peer-to-peer networks (Gnutella/LimeWire) will count as well.
Saving everything at its current state (ignoring changes to the live version after it is saved), how long will this take? What if I upgrade my Internet connection, or theoretically use all the bandwidth of (for example) educational institutions (universities), ISPs (Shaw, Comcast, etc) and large corporations (Microsoft, Google, etc).
I am not talking about indexing content, I mean saving the actual file. Every web page would be considered one file, and pictures, JavaScript, CSS, etc would be their own files.
My response:
Interesting question. I am actually surprised that that you, FireFoxRules, asked it as it sounds like a crazy idea that I would expect from a newb. At any rate it did get me to think so I will propose an answer.
Assumptions
• You have an insane Internet backbone connection will guaranteed reliability and speed. I will assume that you have a 100 Mb/sec connection which is usually only available to ISP level organizations.
• You have an appropriately sized upstream connection to do all the requesting.
• You actually get the bandwidth you paid for. I personally have a “10 Mb/sec down and 1 Mb/sec up” consumer cable connection. I have never seen anything close to these numbers in real life. The closest I have seen is 2 Mb/sec down (downloading ISOs from Microsoft MSDN) and there is a hard limit of around 115 kb/sec up that I constantly hit. A more typical download speed is around 500 kb/sec for regular web browsing.
• We will ignore all network structure and latency issues and assume you have a direct connection to your target with no hops in between.
o The nature of TCP/IP will limit you to around 80% of your bandwidth under ideal operation. When you have only two computers on a network (the idea case) you will still never get 100% bandwidth because of TCP header overhead, IP header overhead, other traffic such as ARP requests, and IP timing issues. A typical network usually sees only 45-50% bandwidth because of collisions. A stressed out network may only get 10%.
o There is latency between your request and the data.
Machine and router hardware delays. Usually microseconds.
Every hop adds delay. Usually milliseconds.
Server response time. Usually small compared to everything else but could become an issue. Ranges from milliseconds (typical) to minutes.
o In total you should expect to take at least 50% off your promised bandwidth in an idea case. This brings out 100 Mb/sec connection to more like 50 Mb/sec; but as stated earlier, we are ignoring this.
o Internet speed is based on more than your connection speed. The bandwidth of the server is also very important. You may have sufficient bandwidth but if you request from a server that is slower than your connection, you are stuck with their speed. I find that a typical website will only transfer up to 50 kb/sec so you will have to download from many different servers at the same time to fill your 100 Mb/sec pipe.
• You have enough computing power. At 100 Mb/sec you are starting to get into the range of IDE hard drive data transfer range. You will also want to have several threads going at the same time to maximize bandwidth utilization. You want to download a different webpage while you are waiting on the request for a separate page. Better yet, you want to keep your bandwidth pipe full even if you hit a slow server or a timeout which can be up to 2 minutes. I would guess that you would need 150-300 threads or requests going at the same time to meet this demand. A single computer likely will not be able to do this alone so you would end up with at least 5-10 servers on your end to pull this off. This of course breaks the idea case of no network congestion or collisions as described earlier.
• You have enough storage space. A quick search shows that YouTube alone has around 7.7 petabytes of content (http://beerpla.net/2008/08/14/how-to-fi ... n-youtube/ ). Newegg is showing 1TB hard drives for around $90. With the needed hardware and controllers, you are looking at around $100/ TB. At this rate you will need 7700 1 TB hard drives which would cost you around $770,000. A related article on BackBlaze (http://blog.backblaze.com/2009/09/01/pe ... d-storage/) shows you how to build your own 67 TB 4U rack server for $7,867 including drives and rack hardware. At the BackBlaze rate, 7.7 PB will cost $904,118 or almost 1 million dollars.
Gottchas
• Connection speeds are measured in BITS and not BYTES. There are 8 bits to a byte so this means that you need to divide your connection speed by 8 right off the top. This will make our 100 Mbit/sec connection a 12.5 Mbyte/sec connection. With typical network delays, this would become 6.25 Mbyte/sec.
Now let’s do some calculations (whips out trusty TI-89 calculator).
12.5 Mbyte/sec*60 seconds = 750 MB/min
750 MB/min* 60 mins = 45 GB/hour
45 GB/hour *24 hours = 1080 GB/day or ~1 TB/day (1.08e12)
With the YouTube example above of 7.7 petabytes (10e15)…
7.7e15 Bytes/1.08e12 Bytes/day=7129.63 days
7129.63 days/365 days/year = 19.5332 years
Just downloading the YouTube database with an insane Internet connection will take you almost 20 years and almost 1 million dollars just in hard drive storage.Hope this answers your question