 |
|
|
|
 |
shaolin_Z
Hei Hu Quan

Registered: Nov 2004
Location: Austin, Texas, USA: TXTA #102
|
|
|
Something I thought you guys might find interesting.
www.scroogle.org

Why we scrape:
| quote: |
Scraping and ad-stripping Google's results
If done in the public interest and not for profit, it's legal.
What's more, Google can't block you if they can't find you.
Public Information Research, Inc., the nonprofit public charity behind www.google-watch.org and www.scroogle.org, has been running a Google proxy for more than two years. On January 3, 2005 we released the source code for our proxy. Our review of the legal situation has convinced us that we are covered by "fair use" under the Copyright Act.
This step that we have taken has implications for all search engines. These engines crawl the public web without asking permission, and cache and reproduce the content without asking permission, and then use this information as a carrier for ads that generate private profit. We are convinced that if citizens scrape Google and strip the ads, and make the scraped results available as a nonprofit public service, that this is legal. This is especially the case if there are public policy concerns behind the scraping.
Google Watch has been the most prominent critic of Google's outrageous privacy policies for more than two years. This is why we started the proxy, and it's why we continue the proxy. We invite Google to serve us with a cease and desist letter as a first step toward resolving this issue. So far, we have yet to hear from Google's lawyers. By releasing the source code for our proxy, we're trying to escalate the issue.
If it can be established that what we're doing is legal -- or at least sufficiently legal so that Google is not eager to challenge us -- then this will begin to restore a public-interest balance to the web that has been declining ever since big money got behind the dot-coms.
There is the additional problem of whether anyone who scrapes Google can avoid getting blocked by Google. We experienced this when Google blocked Scroogle in December, 2003. We moved to a different server and continued as before, because Google could no longer find us. In our opinion, it's legal for Google to block whomever they want, even while it's also legal for us to scrape them if we can.
If the scraping is done properly, it is not worth Google's trouble to find you. Our source code separates the "fetch" portion of program, which is done by curl or wget, from the searcher interface and parsing of the fetched results. If the fetching is done by a server on a different Class C address from the website that shows the scraped results, there is little that Google can do to find the IP address that is responsible for the actual fetch.
| quote: | A Google block requires a John Doe server
Google uses a couple dozen data centers with dedicated IP addresses. A number of these are located outside the U.S. Once these addresses are discovered (search for "google data centers"), it is trivial to maintain the list. The addresses will change over time, but they won't change that quickly.
If a scraper is coming into Google from an address that is outside the local IP block where his public interface operates, we believe that Google is currently ill-equipped to discover him. Yahoo, by contrast, appears to have a more centralized system, and is able to throttle excessive activity from a single IP. We saw only two IP addresses for Yahoo when our Yahoo scraper was active. About two percent of our fetches were throttled. Google, with a more distributed system, makes it easy for scrapers to distribute their fetches across most of Google's data centers.
Setting up a John Doe fetch is quite easy. All you need are CGI privileges on Mr. Doe's server. It's easiest to just share someone's account. Dedicated IP hosting is best for this. There is no need for DNS name service from Mr. Doe, and no lookup delays.
When you get a search request, instead of forking to one of Google's IP addresses, you fork to Mr. Doe's CGI program. This program on Mr. Doe's site is a subset of the source code already available. Mr. Doe does the fetch from the list of Google IP addresses, and then immediately spits out that same file back to you, and deletes the file. It all happens without dropping the connection between your scraper and Mr. Doe. You parse this file on your public site as if it arrived directly from Google. There could easily be more than one Mr. Doe. Evil hackers could even use a network of zombie PCs.
What would Google need to find Mr. Doe? This is guesswork, but it seems that Google would need software at all of their data centers that can be switched in or out in real time. This software would scan incoming search terms. If there's a match with a secret term sent out on your proxy by some Google undercover cop using your interface, then the software would report back that this term was logged at such-and-such data center, from such-and-such IP address. Now Google knows whom to block. They do have an IP blocking capability across all data centers, but we suspect that they don't yet have this sort of search-term interception and reporting capability. The reason the software would have to be switchable is because this scanning is CPU-intensive for Google, and it only needs to run on rare occasions.
If Google blocks us, we plan to take our Yahoo scraper out of retirement within 24 hours as a substitute for Google's results, and think about what we should do next. Yahoo's bloated interface requires four times more bytes per fetch than Google's www.google.com/ie interface, and this would be a sad day for us. |
The worst-case scenario we can think of would involve a two-pronged attack by Google. The first prong would be a legal effort by Google to stop us. We welcome this, and believe that we can prevail even though our market cap at PIR is somewhat less than Google's $50 billion. The second prong would be to block us once again. Currently our proxy is doing the Google fetch from the same Class C that our domains are on. This is an invitation for a block; it would take Google about 20 minutes to identify our fetcher's IP address.
The larger issue here is that the commercialization of the web became possible only because tens of thousands of noncommercial sites made the web interesting in the first place. All search engines should make a stable, bare-bones, ad-free, easy-to-scrape version of their results available for those who want to set up nonprofit repeaters. Even if it cuts into their ad profits slightly, there's no easier way to give back some of what they stole from us.
|
___________________
"The Greatest enemy of knowledge is not ignorance, it is the illusion of knowledge." -Stephen Hawking
"First they came for the communists, and I did not speak out— because I was not a communist;
Then they came for the socialists, and I did not speak out— because I was not a socialist;
Then they came for the trade unionists, and I did not speak out— because I was not a trade unionist;
Then they came for the Jews, and I did not speak out— because I was not a Jew;
Then they came for me— and there was no one left to speak out for me." -Martin Niemöller
|
|
Nov-12-2005 08:58
|
|
|
 |
 |
|
 |
 |
shaolin_Z
Hei Hu Quan

Registered: Nov 2004
Location: Austin, Texas, USA: TXTA #102
|
|
|
| quote: | Originally posted by St_Andrew
Congrats man |
hehe... thanks man. Once I have Mplayer and all my codes installed, I'll be set . I don't think I'm going to be using windows much anymore except for a few apps that are outh there for Linux (e.g. Trackor DJ Studio). But I made the mistake of only making a 4.5 gig partition on my HD for Linux. I guess I'll use partion magic to resize it and make it bigger. I haven't installed a bittorrent client since I don't have much space (check out djmixes2k.com, it a torrent site and completely legit, they have tons of great sets). So which client do you guys use/suggest? Another thing, I'm very particular about properly tagging and renaming mp3 files I get from there. I use musicmatch in windows since it's got great id3tagging and renaming features for any amount of files at once. Does anyone know of something similar?
___________________
"The Greatest enemy of knowledge is not ignorance, it is the illusion of knowledge." -Stephen Hawking
"First they came for the communists, and I did not speak out— because I was not a communist;
Then they came for the socialists, and I did not speak out— because I was not a socialist;
Then they came for the trade unionists, and I did not speak out— because I was not a trade unionist;
Then they came for the Jews, and I did not speak out— because I was not a Jew;
Then they came for me— and there was no one left to speak out for me." -Martin Niemöller
|
|
Nov-12-2005 15:13
|
|
|
 |
 |
|
 |
All times are GMT. The time now is 23:15.
Forum Rules:
You may not post new threads
You may not post replies
You may not edit your posts
|
HTML code is ON
vB code is ON
[IMG] code is ON
|
|
|
|
|
|
Contact Us - return to tranceaddict
Powered by: Trance Music & vBulletin Forums
Copyright ©2000-2026, Jelsoft Enterprises Ltd.
Privacy Statement / DMCA
|