More and more I'm frustrated with the cyber-infrastructure of climate science. It seems to be on the verge of crisis, in a Kuhnian sense. Everyone has individual solutions for how to do large computations, manage very large data sets, and collaborate between institutions. For example, due to limited resources, I just had to move some simulation output from a remote server to a local, external hard drive. One simulation (not a big one) generated some 50GB of output that I don't really want to throw away. Retrieving this data took hours, and then several more hours to send it from the remote site to my desk. It's crazy, inefficient, and isolating.
There needs to be a better way. We need to harness the power inherent in "cloud computing" and the latest technology for using simple, intuitive web interfaces for accessing remote data (e.g., MobileMe, Google applications, etc.) and apply them to scientific computation, data storage, and data analysis.
We have seen small steps in these directions from projects like SETI@home and climateprediction.net, among others. I have also just read an article from Nature [LINK] saying that Amazon (see update below) and Google have both started down these roads, as has the NSF with something called DataNet. However, as the article notes, there are serious challenges, not just in terms of technology but also dealing with access, cost, and fairness. These can be touchy issues, especially in fields where the rate of work can vary greatly among different research groups.
I'll also just complain that even besides dealing with sharing and storing data, the ever-growing size of data sets in Earth Sciences, and particularly in the climate sciences, demands new tools for analyzing and visualizing the data. I've seen some projects that seek to deal with the emerging issues, but the progress of these new tools seems to be lagging significantly behind the growing data sets. As a concrete example, take the analysis of output from the NICAM, a global cloud-system resolving model. This is a model that has points every 7km over the entire surface of the earth. A good deal of variable are on vertical levels, say about 50 of them. It is conceivable that you'd be interested in examining global fields every hour for several years. On a typical desktop, loading a single 3-dimensional field for ONE hour would require all (or more) of the available memory, making operating on the field pretty slow, and doing serious number crunching is basically impossible. This isn't going to be a special case for long, either. A new generation of cutting-edge models will have similar resolution, and as they start producing actual simulations (i.e., ones from which scientific results are desired), analysis tools need to be available to do the job. Right now, I don't have any such tools. Those that do exist need to be made available and useable, and soon.
UPDATE:
I have been looking into these vague notions a bit more. Amazon has a side company called Amazon Web Services that sells cloud computing (computation, storage, database query, etc). The service seems to leverage the fact that Amazon has a ton of computational power and storage just sitting around, so they try to sell their downtime to companies that need more cyberinfrastructure than they can afford. It's a pay as you go system, and you only pay for the compute power/time that you actually use. It seems very interesting. Of course, the problem is transferring this kind of system to a more science community system. It would be nice, for example, if the same kind of system were available from an NSF computing center, and you could access data interactively using a web browser, or submit large simulations from a web browser that then run in the cloud with results going to the online storage facility. Of course, the problem is that "science" doesn't have a giant existing distributed computing environment with plenty of downtime, and there's not a lot of incentive to set one up (i.e., the NSF isn't that altruistic). These are just thoughts to chew on.
Your point about cloud computing is - from my opinion - exactly THE point. Putting old computing paradigmn into a new technology will not help - instead we have to find new ways of utilizing new technologies: As Google Software engineer Christophe Bisciglia says "Tell me," he'd say, "what would you do if you had 1,000 times more data?"
ReplyDeleteRoland
The problem is that we DO have 1000 times more data and we have no idea what to do with it.
ReplyDeleteI agree that we need a paradigm shift, and I think that more and more we are seeing the possibilities with things like Amazon's web services and Google Apps. These have to be scaled up in terms of compute power for science applications, which is probably the easy part. The hard part is creating a "Google Apps" for science, including submitting and running distributed computations, storing and retrieving data, and a set of tools for distributed data analysis.