Home > TechTalks > Transcripts Archive > TechTalks Transcript

TechTalks Transcript

Managing Massive Data Storage�What Are the Ways?

May 2, 2002

Audio
  • Streaming MP3
  • Download MP3 (Download Tips)

Judith
Judith Boettcher
[JB]
Bruhn
Mark Bruhn
[MB]
Shankar
Anurag Shankar
[AS]

JB: Welcome to the CREN Tech Talk series for spring of 2002 and to this session on Managing Massive Data Storage�What Are the Ways? You are here because it�s time to discuss the core technologies for your future campus. This is Judith Boettcher, your CREN host for today, and our session is coming to you with the support of the CREN member institutions. I�d like to welcome a special technology anchor for today�s program, Mark Bruhn of Indiana University. Many of you will recognize Mark from different times he has co-hosted the Tech Talks and also when he�s been an expert. Mark is an expert in �all things security� as well as general IT infrastructure issues. Welcome, Mark, in your role as Howard. Glad to have you!

MB: I�m glad to be Howard!

JB: Good!

MB: Just to introduce the topic today, if you think about this, it used to be that computer processors were so slow that it took them a long time to wade through large databases. Now, processors are becoming so unbelievably fast and with computer grids being developed here and there, billions of calculations can be done in seconds. So now researchers have the compute power to deal with astronomical numbers of instructions and can generate billions of bits of data. For example, astronomers can collect a terabyte�that�s a million megabytes of data�per day. Obviously, they need a place to store that data before, during and after their processing. Regan Moore who is the Associate Director for Enabling Technologies of the San Diego Supercomputing Center put it well in a past issue of the SDSC magazine which was formerly known as Gather/Scatter, which is now known, I think, as Envision. He said, �A collaborative effort is needed to develop the technology because we�re crossing boundaries to get an entirely new type of system. Today�s technology has mastered the computation-intensive paradigm. Data-intensive problems are the next step.� He goes on to say, �We�ve asked researchers what they would do if they had access to these capabilities. They always have innovative projects that currently cannot be handled from statistical analyses of molecular modeling�� I�m not a scientist, by the way. ��trajectories to spatial analyses of remote sensing data.� And, in the �oh, by the way� category, there are some students and others on our campuses that just need a place to put that assignment or that document so they�re accessible whenever and from wherever they are and they don�t have to carry around a bunch of diskettes. So having said that, I�d like to remind everyone listening that you can ask questions directly of our expert during the webcast via e-mail to expert@cren.net. And of course, you can pick up archived sessions at the CREN website, www.cren.net. The transcripts and archives from these webcast events are generally up by Thursday of the following week. If you�re anxious, sometimes they�re up there a bit sooner, but I wouldn�t plan on it. The web event page has a lot of relevant links on these events and resources so be sure to check them out. In fact, there�s one animation of a silo being built that�s actually very interesting to take a look at. So Judith, I think I�ve managed to mention something related to at least two of the hats worn by our expert today, big data stores and astronomy.

JB: Well, thanks very much, Mark, and I�d like�as long as we�re mentioning those good links on our web event site�I think that we�ll also mention, be sure not to miss a very recent paper from our expert today called �Building a Massive Distributed Storage Infrastructure at Indiana University.� So after they hear us talk about it today, they can go and actually read about it as well. And with that, I�d like to segue into our introduction of our expert and welcome Anurag Shankar from Indiana University. Anurag is the Manager of Distributed Storage Services in the University IT Services organization at Indiana University. He is also, as Mark has mentioned, an assistant professor in the department of Astronomy. Welcome, Anurag, really glad to have you here.

AS: Thanks for having me!

JB: Great! We�ve got a bunch of questions, as we always do, and we never get through them all. So I think we might as well go ahead and get started. Do you want, Anurag, to talk a little bit about what is massive data storage and how big is big in this area?

AS: I think Mark has set the stage here quite nicely. Massive data storage meant probably gigabytes, maybe ten, 20 years ago. Basically, �massive� means lots of data. The definition of that changes at this point. Big would be maybe 50 terabytes, if your institution has a data storage system that can store that kind of data, where you have a massive data storage system. And I mean a single instance of a massive data storage system here.

JB: Just so that folks can remember what a terabyte is, Anurag, you want to tell us how many megabytes one terabyte is?

AS: Yes, indeed. People today are probably a lot familiar with gigabytes which is a thousand megabytes and a terabyte is a thousand gigabytes, which is a million megabytes. And the common sizes that the largest data repositories in the world are reaching are petabytes which is a thousand terabytes or basically a billion megabytes.

JB: Okay, and maybe you can talk about how many�do you have petabytes now at Indiana University that you�re storing and manipulating?

AS: No, we are a relatively young massive data storage site. We started storing data in our massive data storage system about two, two and a half years ago. We are now at about 70 terabytes, and according to one of the definitions that I see here on a page that we will ultimately put on the Tech Talk web page, a 50 terabyte site is considered a large mass store system. By that reckoning, Indiana University is today a massive data storage site.

MB: So Anurag, what are the benefits? Who uses these things? Talk a little bit about the applications.

AS: Right. Massive data storage has been pretty much limited until recently, I would say, at least on academic campuses to supercomputer centers and to high end users. What�s been the case typically is that each campus, for example, has a small community of high end users who have accounts on a supercomputer center which may not be at their institution but, let�s say, at San Diego or University of Illinois, things like that. And they log into those supercomputers and have access to a massive data storage system. The assumption here is that on these supercomputers you are calculating lots and lots of instructions and creating lots and lots of data and therefore you need to store them locally. This has been changing, particularly over the 90�s, and today it would not be unfair to say that almost anyone on a university campus has the need for a massive data storage system. Typical file sizes�in other words, files stored by a random Joe User on campus, for example�used to be hundreds of kilobytes, perhaps five to ten years ago. Today, we all know they�re about five megabytes because that�s the size of an MP3.

JB: Okay, good!

AS: And this number is on the increase as more and more multimedia type applications become common and people start sending multimedia e-mail, things of that nature. So the intended audience, intended target for a massive data storage today on a university campus is certainly the high end crowd and that�s been the case for a long time. But also, for example, users who are storing administrative data�your paychecks, student records, scanned documents of interest to the university, human resources data. There�s lots of campuses have medical centers, so there�s medical data, radiograms, patient records, etc., etc., MRI, things of that nature. There�s also a big push in the area of digital libraries or large digitized collections of images, audio, video, etc., etc. And also general user data. Just to give an example, I asked our e-mail folks here at Indiana University how much data would we have to archive if we were to archive every single e-mail that goes through our systems. And it turns out that number is in terabytes per year. So this is a long answer to your question, Mark, but at this point we have a very large contingent of users using it. At Indiana University, we have a large genomics initiative and these people are beginning to use massive data storage. Digital libraries are already using it. We have a very large music library project called the Variations Project where they have digitized ten terabytes of music and they serve it in the Music Library online. And we have physicists who are talking about generating petabytes every year starting in about two or three years, when some of the large experiments go online in Europe, etc.

MB: That would mean that entire library collections could now or in the near future be digitized and stored?

AS: This is correct. While that is not the intent of digital library programs, that is entirely feasible. For example, if you were to store the entire printed collection of the US Library of Congress, that�s about ten terabytes, which is fairly reasonable a size, given that the massive data storage sites are currently capable of storing hundreds of terabytes. For example, at IU, we have a 500 terabyte massive data storage system.

JB: And you said earlier that you were at about 70 terabytes right now. Is that right?

AS: This is correct.

JB: Okay.

MB: Well, compare that, Anurag, with for example, the holdings of the library here on campus which is actually fairly large.

AS: I doubt that it�s as large as the Library of Congress but if you were to take the library at Indiana University, for example, we may get up to almost a terabyte. Just to give you another sense of scale, if you were to take all the information available on the web today, that�s supposed to be about 8 petabytes or 8,000 terabytes. And if you were to take all of the printed material in the world, that�s only 200 petabytes.

MB: Only 200 petabytes.

JB: And that�s today.

AS: Today.

MB: I was also interested in the bit you mentioned, health records, MRI�s and x-rays and those sorts of things. Do you see that happening now, where these things are becoming digitized more and more so that they can be shared more easily between healthcare providers?

AS: That�s correct. We are having discussions with a number of hospitals in the Indianapolis area where, for example, in their MRI and radiology labs, they are able to take data at rates which far exceed their ability to store them at this point and they call these different modalities. They�re not able to use all of the modalities. There�s also�and this may be state or federal law, I�m not entirely certain�but by law, they are required to store the data for five years and in order to contain all this data on their massive data storage system, which is their own at this point, they have to keep these modalities low. Now, they want to talk to us to see if they can archive a lot more data on our massive data storage system. So this is a general trend and I see more and more of this happening, as there is more and more of collusion between academic massive data storage sites perhaps and the academic hospitals. Maybe that�s where it�ll happen first and then it�ll go on from there.

MB: You talked a bit about some impediments to establishing these large data stores in the past. Can you kind of go a bit more through the history and the evolution of these things and maybe talk about some of the impediments now to implementing these large data stores?

AS: Sure, yeah. I would say the primary impediment in the past to establishing a massive data storage system was cost. The infrastructure, the hardware infrastructure as well as the software infrastructure was fairly expensive because it was highly specialized and catered to a fairly high end, high performance computing, high performance storage sites. You paid a premium for that. To use massive data storage�and we haven�t talked about hierarchical storage management so far, so let me introduce that term. Most of the massive data storage sites today in the world use software which is called hierarchical storage management. And all that means is that data are stored on fast medium and then they migrate seamlessly to slower and therefore less costly medium. The impediments there, for example, in the area of HSM�s are that when you store data on a slower medium, when you go access the data bits there is a delay because the slower medium is slower. Often it�s tape, which means the tape has to be located, mounted, and that used to take perhaps several minutes back, let�s say, in the late 80�s when I was a graduate student in Urbana and at the National Center for Supercomputing Applications. They had a system called CFS where I was storing maybe a few gigabytes of data and I was very happy doing that because I couldn�t do that anywhere else. So cost was perhaps the biggest issue in terms of basic infrastructure, hardware and also the human resources cost. These people were expensive, the people that maintained these systems. Some of the same issues are there. The cost has gone down substantially and it�s going down even more, even over the past three or four years that I�ve been involved in building this massive data storage system. We have expensive hardware today on which we run our massive data storage system and within the next two years, it�s clear that we�re going to be running this on cheaper, let�s say Linux boxes with gigabit connectivity and things of that nature.

MB: So I�m going to set you up with this question because we�ve talked about this before, but�so what you have is a large backup system?

AS: This is a fairly common misconception about hierarchical storage management systems. Because they use tape, people are under the impression that these are backup systems. Let me just define what I consider to be a backup system. A backup system, in my opinion, is a system where you are making a backup, a snapshot of your disks on the tape and then you�re making that tape inaccessible to yourself by putting it in a vault somewhere. So therefore, you cannot delete that data accidentally by going and doing a DELETE *.* on your disk. On an HSM system, an hierarchical system, all of your data are accessible in a near line fashion. What that means is that all of the directory information stays on disks forever. You can do a directory listing and you basically have the data bits sitting on store media�for example, tape. But if you were to do a DELETE *.* by mistake, your data are gone. So this is not a backup system, though it can be used as such. It is possible to set up scripts, etc., so that when you write your data onto a massive data storage system you can take away, for example, the ability for you to delete them later and so forth. But it is not a backup system per se.

JB: That�s one thing that I was confused about as we were talking and getting ready for today and that is, if you�ve got these really large massive storage systems, what do you do about backup? Or disaster recovery? How do you do that?

AS: That�s a very relevant question. In the case of massive data storage systems, there is no such thing as backup typically.

JB: Okay.

AS: Which means if you have a petabytes archive which costs, I�m sure it�s millions of dollars or maybe more than that, tens of millions of dollars, obviously to back it up means replicating that entire infrastructure and so that is simply not done. It�s simply not cost feasible to do so. So what is done instead, for example in our setup and I�ll say a few more things about this later on, usually you make multiple copies, particularly on tape, so when a user at Indiana University writes a piece of data onto our massive data storage system, two tape copies are made just to prevent against the accidental erasure for some reason, though I can�t see how that will happen when both tapes are in a tape library. But let�s say one tape goes bad for some reason. It will then retrieve data from the second tape. So data on a massive data storage system are typically not backed up, though there could be data that are so critical that you are literally backing them up to tape and removing them from a massive data storage system. There are also massive data storage systems where there�s a concept of shelf tape, which means you write that data onto tape and then you remove the tape from the system. The problem is that the metadata�in other words, what I was calling the directory information�still resides on disk and so you could still do a DELETE *.* and that gets rid of the metadata and then it doesn�t matter that you have the data sitting on shelf tape.

JB: Because you could never find it.

AS: Because you can�t find it.

JB: Okay.

MB: We want to actually�you mentioned to Howard, we want to spend a little bit more time on security and integrity before, but I want to kind of finish what you were talking about before about the evolution of these things and ask you about what are the problems with establishing one of these large data stores today? You talked a bit about the cost and what have you from before, but what are the impediments today? Are there latency problems? Is it networking? What are the issues? Why doesn�t everybody do it?

AS: There are always impediments when you do something at this scale. The impediments, cost has gone down but still it is not a cheap project to undertake. Much thought and planning must go into doing this sort of a thing, on a campus-wide scale particularly. Some of the impediments now, in other words, things that you must plan for are things such as human resources cost. These things have fairly high, very steep learning curves. Once you have overcome that learning curve, you generally can use a very small number of FTE�s, full time equivalent people, to manage these kinds of systems. The classic delay, for example, in retrieving a file from tape is still an impediment for a certain class of applications which must have data always on spinning disks, for example. Networks use to be a great impediment, but now with gigabit and even beyond gigabit networking it is possible to actually get data or write data or to read data from these systems, if not always with a single gigabit connection. You can even multiplex it over multiple gigabit connections, so I would say networks, networking is not a very large issue. In fact, at Indiana University, we are planning to move terabytes of data between our Bloomington and Indianapolis campuses which are 15 miles apart.

MB: And there�s this concept of the Global Terabit Research Network also, the GTRN, that I was also involved in that would, when it comes to fruition, that amount of data would be transferable given regional speed constrictions, but that amount of data would be transferable between large research networks globally.

JB: So you�re saying�let me see if I can get this straight�if you have a researcher on one of your campuses, at the Bloomington campus, wanting to access terabytes of data across that 15 mile link, that that�s a reasonable thing to do and there�s not a perceivable delay in access of that kind of data?

AS: This is correct, yeah. These are tests we are performing even as we speak and at this point, these sorts of things are now possible. And Mark mentioned the GTRN, the Global Terabit Research Network. The idea here is to actually do simulations and to move data over the network, typically�let me just give you an example. For example, there�s a project called Atlas, a large physics project which is an experiment which will take place at CERN�which is the Center for European Research in Physics, Nuclear Research in Geneva�and this experiment will create petabytes of data, starting in 2006. They have experiments of that nature, maybe not generating that volume of data today, and the way they move data is by a system that is sometimes called FEDEXNET. Basically these are data written on tapes and shipped and there�s also actually a concept of the bandwidth of the FEDEXNET. But that�s because the global networks are simply not fast enough to move data sizes of that volume or volumes of data. So the GTRN will allow that to happen and as I�ve said, the tests we�re doing at Indiana University actually between two campuses 15 miles apart connected over a gigabit basically Wide Area Network show that this will be possible in the future.

JB: So that was between the Bloomington campus and which other campus?

AS: The Bloomington campus and Indianapolis campus. They are connected today via a two gigabit pipe, two gigabit per second pipe and this will go up in the future.

JB: That sounds great. Listen, now is a really good time to remind folks out there to go ahead and send in your questions to expert@cren.net. Okay, so the experiments that you mentioned that are going on are with physics experiments, Anurag?

AS: The experiments I spoke about, we are testing the feasibility of moving terabytes of information over a Wide Area Network are actually entirely within the Information Technology services area, within my area which is massive data storage, but its application will be in areas such as physics when we show that this is feasible. It looks very good right.

JB: So does it make sense�I think Mark and you were talking about impediments or barriers in the use of this kind of a system and you�re basically saying that networking is no longer a problem. What is the biggest barrier or the biggest challenge that you�re face with right now?

AS: One of the challenges we faced, I would say, when we started with this was the recognition that most university campuses are going to need some sort of massive data storage within the coming five years, let�s say. However, this massive data storage system has to be accessible not just to the high end users because the high end users are completely used to using command line, things like that. They are generally high end users, they can do that. We had to make this massive data storage system accessible to the masses so��

JB: And how did you do that, then?

AS: And I will say a few words about that. Basically, that, just to explain that, massive data storage systems have been pretty much in the realm of high performance computing and high performance storage. But as I just mentioned, now there�s a need to make this accessible, particularly in an academic setting, to anyone on campus that can show that they have a valid need for storing large amounts of data. To do that, what has been the big impediment beyond cost, etc., is the lack of tools�for example, software tools that make it very easy for a, let�s say, Windows 98 user or a Mac user on a campus to access these massive data storage systems. So we undertook this task. This was the biggest impediment when we started and we undertook the task of eliminating this impediment and I think we have been reasonably successful. And that is actually one of the unique contributions Indiana University has made. MD: Anurag, you�re managing a unit called Distributed Store Services, right? So describe a bit more about the suite of services that you provide. It�s not just the massive data store, like you say, for the high end users. But how do you support other uses as you described?

AS: Right. The services that we provide are meant to cover the entire gamut of users on our campuses and I would like to remind you that we actually have eight campuses which are distributed around the state of Indiana�two large research campuses in Bloomington and Indianapolis and then six smaller campuses which are distributed all over the place. We provide a service which is literally known as Massive Data Storage Service. It�s built upon a software infrastructure called HPSS, High Performance Storage System. We also provide a service called the Common File System, or CFS, and this is a service that is intended for the masses. The connection between the two is a file system interface which allows people to see both the massive data storage system or the CFS as a single entity, but the back end of these two systems are completely separate. One is an HSM, Hierarchical Storage Management system with tapes and disk caches. The other one is purely a disk-based system, a fairly low cost system. But by using an infrastructure called DFS we�re able to link these things together. We also have a service called AFS. It�s a fairly well-known service to a physicist. It�s called Andrew File System that allows people worldwide to share data, for example, share information. And it�s a worldwide file system. So these are the services we currently provide. It is possible for anyone on the campus, any of the campuses at Indiana University, to apply for an account on, for example, the massive data storage system or the Common File System or the Andrew File System and to use them and to use them from any desktop, be it Windows 98, Windows 2000, Windows XP or Macintoshes, Linux, UNIX, you name it.

JB: Let�s follow that user just a little bit. So if a person like myself on a Macintosh and I want to use this, what will I see on my desktop then and how easy will it be to use?

AS: Yeah, I knew you would ask me a Mac question! I�m not a Mac person, but let me see if I can answer this question.

JB: Oh, all right.

AS: On the Macs, I think there�s a concept of network volumes. They connect to a network volume and so what you would do on a Mac, if it was you sitting, for example, in an office randomly somewhere on campus. You would enter in there the address of our massive data storage service which is conveniently something simple like mdss.iu.edu and it would connect you to a volume that looks to you just like another network volume and you simply drag and drop and do all of those things.

JB: So it really acts just as an accessory hard drive.

AS: That�s exactly right. Under Windows, for example, which I know a little bit better, it would be a drive D, for example, and it would be unlimited drive D more or less where you can drag and drop as much stuff as you want into this drive D and you simply can retrieve that later on with that associated cost of perhaps 30 seconds because your data have migrated to tape. They first go to disk and they stay there for a time. If you haven�t used them for a while, they�re migrated to tape seamlessly without your knowledge. And when you go grab the bits back is when you see the delay.

JB: So how long generally do you keep things in the active hard drive storage before moving them to tape?

AS: Usually, it is not managed in that manner. The disk caches are shared resources so they�re high water marks. They�re shared resources and if more and more people start using, as more people start using it, the stuff that�s getting older basically gets written to tape and deleted from disk. So it really depends on usage rather than how old the data have to be before they�re deleted. Because the idea here is that if there�s space on disk, then the data should reside on disk indefinitely because then you can use this hierarchical system as if it was purely a disk based system.

JB: Okay. So when we talked before about the fact that your system actually has the ability to accommodate 500 terabytes of data and yet you have�how much of that 500 terabytes, and tell me whether this is even a good question. How much of the 500 terabytes would be residing on tape vs. disk?

AS: That�s an easy question to answer. We have exactly two terabytes of disk cache so in the worst case scenario, two terabytes of data would be on disk cache and 68 would be on tape.

MB: I�m intrigued by this notion of the service or the combination of services being used by some, I guess, widely-varying kinds of people. There�s got to be some support challenges there, right? So you have to have people who can deal with a student who�s using the store or a physics researcher or an astronomist [sic]. How do you deal with that kind of variety of users?

AS: This is a highly relevant question because ultimately, having a service is nice but if you cannot support it, it�s more or less useless. What we have done here at Indiana University is to have dedicated people that provide support, especially high end support, and we have trained our general support staff that handle, for example, PC and Macintosh in any number of questions. There are probably tens to hundreds of such support people that operate on a large university campus. We have trained them to handle the simple user questions. We have had them in training and have explained to them how the system works and how it shouldn�t be used, how it should be used. And they are the front line of defense against the common user type questions, especially frequently asked questions. Of course, the frequently asked questions are also on the web and in our knowledge base where people can easily go see them. At the high end, I have again high end people within my group. I have Ph.D. physicists for example whose job it is to provide dedicated support to very high end projects such as in physics, such as in chemistry or in genomics, etc., etc. So yes, Mark, support is definitely a challenge and it�s something we�re continuously learning.

MB: Thanks. I have a note here also that I use implementation�I think the quote is that it�s �the first distributed architecture implementation of HPSS.��

AS: This is correct, yes. Now we�re getting into the nitty-gritty of the actual architecture here. Typical massive data storage architectures, again, include a number of servers connected to large tape libraries and to a large amount of disk. Typically, these installations are central to a campus and people access these resources over a network. What we have done at Indiana University is using HPSS software, we have decided that it would be nice if users on both our campuses could store data and to retrieve data locally�and by that, I mean not having to go over the Wide Area Network. This is relevant because until recently, we did not have unlimited bandwidth between the two campuses so the idea of moving terabytes of data was simply impractical. So because we started doing this two years ago, we decided to put some hardware�let me say before that that our central mass store HPSS infrastructure is located in Bloomington�and we put some hardware up at Indianapolis and connected the two together over the network as if they were a single instance of HPSS and set things up in such a way that when a user at Indianapolis writes data, the data are written to the local disk up in Indianapolis using their Local Area Network, therefore at high speeds. And then they�re written to tape over, again, locally at Indianapolis. When the user comes back six months later and wants to access that data, he goes and perhaps on his Windows desktop wants to move this and drags it from the mass store folder to his desktop. When he does that, a metadata request goes to Bloomington saying, �Where are these data located?� and again, that�s a very small amount of data. Metadata general means by its very nature that it�s a very small fraction of the actual data. And that metadata request is served by the servers in the Bloomington instance of HPSS and then basically, which tells the [inaudible] user to the Indianapolis user that their data are actually located in Indianapolis and the transfer then occurs between the user and the Indianapolis infrastructure. This is of a lot of interest to a lot of people because clearly the unlimited bandwidth is not generally available out there and often people have limited bandwidth connections between their remote sites.

MB: How is such a thing�how would such a thing, do you think, benefit something like disaster recovery processes? Or would it?

AS: Very good question! Clearly because we have�actually, if we had limited bandwidth then we couldn�t really do disaster recovery very well because in order to do disaster recovery you have to be able to move the data themselves. You have to move them over this Wide Area Network so what we�re doing now is in the last year or so, we have established a network called ILITE which uses a fiber infrastructure that Indiana University owns and over this high speed network, we are now able to use this remote or satellite site at Indianapolis to make a second copy of tape�and I mentioned this earlier�we make a second tape copy of all the data that are written down in Bloomington and vice versa and thereby making both sides disaster-proof. So in other words, one of our data centers in Bloomington, for example, could go up in smoke and we would have all of the metadata because we also back it up on both campuses and we would have the second copy of all of the data in Indianapolis to recover from.

MB: Of course, you knew I was going to ask that, right, because disaster recovery planning is in my portfolio!

JB: Well, you know, that sounds like a really great system so that you basically have taken that really high bandwidth network and used that as a part of the whole configuration for disaster recovery. That�s neat! That�s good. Let me remind our folks out there to send in questions, what questions you might have about how students are using the system and also other questions about mass storage. With that, if I could ask a question about students� use of this, you mentioned earlier, Anurag, that actually students themselves can ask for accounts on this massive storage system. What are students�are students doing that and what are they using it for?

AS: Actually, let me correct that. All faculty, staff and graduate students at Indiana University are eligible to request an account and they simply go to a web page, authenticate themselves and basically they get an account on the fly.

JB: Okay.

AS: Undergraduate students, simply because of the numbers that we have to deal with, can do so if they have a research project which is worthy of massive data storage. They will get an account if they can get a faculty to sponsor them. Undergraduates at Indiana University also get a distributed storage account, but it�s called Common File System account. The reason why we don�t allow the massive data storage to be accessible by undergraduates is because the data are stored on tape and tapes don�t do a very good job of storing small files. They work best when they�re streaming and the choice that we made�in other words, HPSS�currently does not provide a way to handle small files on tape. This is something that we�re working on with HPSS and ultimately we may have that, in which case we may become the first site that will provide gigabytes or terabytes of storage to their undergraduates as well.

JB: Well, you know, I know that your Variations project, your music project at Indiana is quite well known. Are students manipulating those files and using their Common File System services for those kinds of projects?

AS: That is correct, yes. Students currently get�this is undergraduate students�currently get 100 megabytes of storage, which doesn�t sound like a lot. But for most students, in fact, about 80% of our students, that is quite enough. And they are able to store their multimedia files, etc., on the Common File System and it is also possible to access them via the web. One of our contributions is that we have made our massive data storage or I should say our distributed storage systems available to anyone from anywhere via the web.

MB: So let me ask this about student storage, then. Another question close to my heart! Use of these data stores to store copyrighted material, bootleg movies, MP3�s. How do you handle those?

AS: I knew Mark would ask this question!

JB: I�m glad Mark asked that question!

AS: The answer is actually quite simple, at least in my case, and I�d like to defer to Mark after I make my statement. And that is that we don�t. We simply do not do any kind of content filtering. People get a certain number of terabytes or megabytes or gigabytes of storage and they put whatever they want there. And I think Mark can talk about the legal aspects of filtering.

MB: Actually, I can and what we don�t do, as Anurag says, is any systematic review of that material. But as is the case in a lot of places, I think, now, we will react to complaints. So if somehow a copyright owner comes to find that some material are on this data store or actually anywhere within the university, then they can file an allegation, a complaint and then we�ll investigate at that point, but we don�t systematically go through there and look for those kinds of things.

AS: I should add to that that for the massive data storage it�s physically impossible to do that as well.

JB: Okay, so it�s really not possible to search that way?

AS: Well, if you have so many terabytes of data, I mean, a single search would take quite a long time. And when you realize that most of those data are on tape, I wouldn�t want to subject my system to bringing all of this data back from tape into disk, thus rendering the system inaccessible to a legitimate user.

MB: Hmm! So we�re getting close to the time but I really want to get this last question in if I can.

JB: Before you do that, let me just�our folks out there are pretty quiet today, so just let me remind them that the e-mail address to send questions is expert@cren.net. Okay, let�s go with the question, Mark.

MB: So Anurag, when people mention massive data store, you hear this term SANS. What is that? Is that related to what you�re doing? What is that as it relates to massive store?

AS: SANS stands for Storage Area Network and the idea behind a storage area network is that in the old days, you had lots of servers in your data center and they all had their own directly attached storage, basically directly attached disks. And this number has been growing quite�almost exponentially, whereas the number of people to manage this environment is not growing fast enough. It�s nearly flat or increasing very slowly. And so the question became how do you manage all of this growing storage? And one strategy, at least with disks alone, was this idea of Storage Area Network which says that we build a separate network, separate from the standard IP network, standard network that you have that connects all these servers together, and you build it using an infrastructure that allows data to be transferred faster than over the IP network. So a Storage Area Network basically takes the disk that is attached to individual servers and moved them all into a common pool and connects them to all of the servers at once using a fiber channel based infrastructure which means you connect the storage to these machines via fiber. It is pretty much the building block of storage these days in big data centers. However, it does not address the needs of those users or the kinds of scales we�re talking about. If we�re talking about hundreds of terabytes to petabytes and maybe tens of petabytes, it is not cost effective to build a storage solution with a SAN.

JB: What size, if you�re a campus, needs would a SANS support?

AS: I believe that the cost of disks has gone down substantially enough that if you had needs to store perhaps few or tens of terabytes of data, you could build a disk based SAN system. The cost in SANS comes from the fiber networking. The switches are somewhat expensive, 20 to 35 thousand dollars apiece, and usually you build a redundant infrastructure which means that you have to have more than one switch, generally two switches, and all of the other infrastructure. But the idea is that by doing so you are eliminating the cost in human resources that are required to manage this large storage infrastructure.

JB: So if you were quite small and growing very slowly, that would be one solution to your storage needs?

AS: This is correct.

JB: Okay. Good. Mark, do you have another question ready here?

MB: Actually, [inaudible]. I got most of mine covered!

JB: Actually, you know, there was one I was trying to remember! What about, we�ve talked about the uses, that MP3 files are about five megabytes each. What about the faculty lectures or videoconferences, guest videoconferences, teleconferences? Is there much demand at Indiana for storing those kinds of events?

AS: This is an emerging area. In fact, I recently attended an IT seminar given by the videoconferencing group here within our IT services and it is not possible to provide a service where an entire teleconference or videoconference, I should say, or a teleconference can be archived and this is still not excessive, perhaps hundreds of megabytes range. So this is, a massive data storage system is actually a good place to archive. In fact, since you talked about video, I should mention that when you think of HSM�s, the Hierarchical Storage Systems, people don�t think of them as systems which are suitable for storing video, for example, or streaming media. At Indiana University, we can say with great confidence that this is not exactly true. You can use an HSM by building finely tuned disk caches and large amounts of tape, obviously. For example, our television station on campus has 6,000 Betamax tapes which are falling apart and so what they are going to do is to digitize them and this is probably a few terabytes. And they would [inaudible] them on our massive data storage and because they can predictably recall a particular video because they know that they�re going to be broadcasting that on such and such a day, they can go and pre-fetch it from tape onto their own disks. There�s also groups here that are storing streaming media on our massive data storage system and they just simply make sure they have a cache big enough to keep the hot items hot. That way they�re served from disks and they don�t have to then worry about cache management, and it�s simply a very large amount of storage accessible with a small cache in front of it.

MB: We do have a question from a listener if you want to try to get that in here.

JB: Yes, I think that would be good. Go ahead.

MB: This comes from Don Ingalls who�s the Assistant Director of Systems Management at the American Museum of Natural History in New York. His question is, �Did you use SAN in a box technology or open system technology to build your system?��

AS: We do not use SANS. That is something we will be looking at in the next incarnation of our mass store system. Essentially we have multiple servers that all see their own disks. In the next incarnation, we will attach a SAN to it and it could be SAN in a box. We haven�t looked into that. The technology we�re using, HPSS, could be called open in the sense that it is a collaboration. It�s a consortium of labs, national labs such as Los Alamos, Livermore and Sandia Labs, NASA and some academic institutions. And we get source code by becoming a member of this consortium and we can make changes, etc., so we control our own destiny. It is not open in the sense that you can download it for free.

JB: All right, well, very good. Listen, one of the questions I mentioned that we were going to ask is what about a campus with let�s just say medium storage requirements? Do you have suggestions for that campus or a campus that might want to get into massive storage?

AS: Yes, I believe the HSM�s, the Hierarchical Storage Management systems are going to re-emerge in the next decade and the reason is very simple. Late 80�s saw the emergence of the PC as the preferable platform for people on campuses and with that, an interesting phenomenon occurred and that was that storage became decentralized. It used to be all on the mainframe or a central machine. But then in the 90�s, people could put a lot of data on their personal machines and they didn�t have to go to a central system. However, the scale has now grown to where we know that model is no longer true. Also, the discipline required in insuring the integrity and protection of data long term became quickly very obvious. A lot of people, when they realized that they could trash a 20 megabyte hard disk in 1990 and they could do that easily, and a 100 megabyte hard disk in the year 2000. Of course, they never did backups. So these kinds of issues are bringing storage back to a central computing facility and I believe that it is in the interest of any academic campus considering this because the costs have gone down substantially to actually consider an HSM, a Hierarchical Storage Management system. There�s a number of choices out there. They can use it because it�s highly scalable, they�re highly distributed. If you have a multi-campus setting you can use those in the ways that I described. And if you want to do something really in the short term with disks alone, I think SANS are a good choice.

JB: Okay, very good. Mark, what about you? Do you have a final question or comment for either Anurag or just for the audience?

MB: Well, actually, the last thing that I like to cover, I think, is viruses.

JB: That�s right, we haven�t talked about that, have we?

MB: Talking about this huge amount of space and people are storing things in there and we talked a bit about no systematic review but do you do anything�do you scan for viruses in this space or do you presume that the user is going to take care of that? In this last week, we�ve seen a lot of problems with Klez, for example, and it would seem like you could store a lot of virus code in such a storage system.

AS: This is true. For the reasons I mentioned earlier, no, we don�t do any kind of checking of any of the data that go in our system. They�re just data bits to us simply because we can�t. You were also going to ask me a question about security. Security becomes a big issue for a lot of these users such as medical users. They need to have encryption, they need to make sure the data get there securely, they are stored securely, etc., etc. And we actually are currently handling these kinds of issues. HPSS itself is based on a technology called DCE which has Kerberos 5 inside of it for authentication and it�s built on secure technologies. And I hope that any system that you choose has adequate security.

MB: Obviously I�m glad you did that because I just skipped right over that in my notes here!

JB: Right. Does that mean, though, that you have to�if I�m a campus and I want to establish and use an HSM, that I also have to be using Kerberos or something equivalent to that?

AS: I don�t think that�s a requirement. What I�m saying is that it certainly doesn�t have to be Kerberos. It could be something�could be Windows. In fact, Windows 2000 has Kerberos 5 as well inside of it. So it has to be a technology that is bulletproof, it�s secure, it�s know to be secure and that you�re able to provide support to those users who need to have their data stored in an encrypted fashion, etc., etc. So there�s a number of issues that arise in the area of security and massive data storage systems.

MB: This is not and can�t be, based on what you�re saying, an encrypting file system.

AS: No, this is not an encrypting file system. These are all issues the user must handle himself or herself.

JB: So these files, as you say, they�re just bits to you.

AS: We are just there. We provide a bit bucket and the person who is putting the bits in has to do most of the management themselves. We provide help with that but we don�t do that. The software doesn�t provide that.

JB: Okay, well, with that, I think I�d like to remind our listeners once more not to forget to download your paper on building a massive distributed storage system at Indiana and also to check out some of the other pictures and graphics on the site, including those sites we didn�t talk about, how the retrireview of tapes can be done by robots rather than real people. So let�s invite folks to be sure to explore those. With that, it�s time for our closing notes and I�d like to thank everyone so much for being with us here today. Please join us again in two weeks, on May 16th, for a session on Mobile Computing on Campus with David Brown and Jay Dominick at Wake Forest University. Howard will be back and rejoin us as technology anchor. Many thanks to our CREN member institutions for their support for today�s Tech Talk and many thanks also to our Tech Talk expert, Anurag Shankar. And Mark, thanks so much for standing in�or sitting in�for technology anchor Howard; and to Terry Calhoun, our Tech Talk web guru; to Jason Russell, Gayle Terkeurst and the support team at Merit; to Susie Berneis who is our audio file transcriber; and finally, a thanks to all of you for being here. You were here because it�s time. Bye, Anurag. Bye, Mark. See you all on May 16th.

END OF WEBCAST