Про мега-сканирование
Jul. 13th, 2007 02:40 amИнтересные подробности:
“Previously, when people have done scanning, they always were constrained by their budget and their scale,” Clancy told me. “They had to spend all this time figuring out which were the perfect ten thousand books, so they spent as much time in selection as in scanning. All the technology out there developed solutions for what I’ll call low-rate scanning. There was no need for a company to build a machine that could scan thirty million books. Doing this project just using commercial, off-the-shelf technology was not feasible. So we had to build it ourselves.”
Google will not discuss its proprietary scanning technology, but, rather than investing in page-turning equipment, the company employs people to operate the machines, I was told by someone familiar with the process. “Automatic page-turners are optimized for a normal book, but there is no such thing as a normal book,” Clancy said. “There is a great deal of variability over books in a library, in terms of size or dust or brittle pages.” (To needle Google, several blogs have posted images from the books site that include the scanners’ fingers.) Google will not reveal how much it is spending on the books project. In 2005, Microsoft announced that it would spend two and a half million dollars to scan a hundred thousand out-of-copyright books in the collection of the British Library. At this rate, scanning thirty-two million books—the number in WorldCat’s database—would cost Google eight hundred million dollars, a major but hardly extravagant expenditure for a multibillion-dollar corporation.
Copying all those pages presents many difficulties, but writing software to make the books useful to searchers is even harder. “The scanning technology is boring,” Clancy said. “The real challenge is to get somebody something that they are actually interested in, inside a book. Web sites are part of a network, and that’s a significant part of how we rank sites in our search—how much other sites refer to the others.” But, he added, “Books are not part of a network. There is a huge research challenge, to understand the relationship between books.”