Background:
Wilburt Labio is a Ph.D. student in the Computer Science department studying the maintenance of Data Warehouses with the InfoLab group.
Ph.D. Admissions
One of the biggest questions that undergraduate students in CS often have is whether to go out into industry or to stay at school and complete a Ph.D. For Wilburt Labio, the choice to get a Ph.D. came from a desire for opportunity and flexibility. Even though you are sacrificing money in order to stay in school, you're gaining flexibility in job choice. Once you go into industry, you quite likely will be stuck on that track, never to return, whereas if you wait and get your Ph.D., you have the option of going into industry, but you also can go into research labs or teach in Academia. If you want to start your own business, the advanced degree may help with your credibility, making financing that much easier. Once you get a job as a researcher or a teacher, you have the freedom to explore problems that you find interesting for the sake of finding out more about it, rather than being stuck working frantically on some project that has to reach market in some time frame in order to make money.Getting into a Ph.D. program requires one to have good grades and good GREs, much like getting into college. More important, though, are the letters of recommendation, showing that you have distinguished yourself in class and have what it takes to do self-directed research. It is important to get involved in research as soon as you can in order to prove that you can handle it, and if you need more time in order to pursue research opportunities, you might consider getting your MS first.
Ph.D. Life
Once admitted to the program, expect to take about five years completing the program. This entails passing Quals, Comprehensive exams, and taking on TA or RA jobs. All this is comparatively easy, however, to the Dissertation. Doing research is hard and requires a lot of resourcefulness, self-motivation, perseverance, and hard work. Expect to spend about 40% of your time coding, 40% researching, and 5% each on Teaching, Classes, Quals/Comps, and other things. The bulk of this time, then, is on your research project.As far as his own life as a Ph.D. goes, he works with the databases group, which he chose because the projects are interesting and deal with data, which is everywhere. This opens opportunities to many different positions after he finishes the program, since many companies today are trying to find a way to optimize their data. He also enjoys working with the talented professors, industry workers, and worldwide researchers who also study databases.
Data Warehousing
Databases today in the commercial world are typically optimized for what is called Online Transaction Processing, or OLTP. OLTP typically goes through large amounts of queries that are relatively simple. For example, a query might ask to update an account by a certain amount of money or to find out when the last order was from a certain supplier.With all the data that companies are gathering, though, there is a growing trend within management to want to use this data to help make business decisions; that is, not just look up and update simple data queries, but to integrate many components of that data and present it in a form that allows you to make a "big picture" decision. These are analytical queries, which are relatively few compared to OLTP, but are much more difficult. For example, an analytical query could ask, "For each month and product, what advertisements had the most significant effect on sales."
While you could hire a business analyst to go through all the databases and integrate this information manually, it would be ideal if we could optimize databases to handle these queries automatically, in order to take advantage of the computerized data and processing that already exists. Data warehousing allows us to do precisely this.
Data warehousing takes many disparate databases within some organization and integrates them, cleaning them up so that they all are consistent and can be looked at at the same time. This puts all the data into one giant data warehouse, a hierarchy of warehouse tables and views, which an analyst can go to as a single source for all the data they may be looking for.
This solves many problems with accessing multiple databases right now, but still has other problems. It is still hard to efficiently answer analytical queries, and we need advanced algorithms for this. It also is hard to keep such a data warehouse up to date, since it is a large task to update such a large collection of data, and it can typically only be updated once every so often, and only within a limited time period so as not to interrupt other work. So, figuring out how to update all the data required over an eight hour period, for example, can be problematic is there is a large quantity of data to be updated.
Wilburt's research is on this maintenance, keeping things up to date. He looks at managing the storage requirements for a data warehouse, keeping track of what has changed in order to optimize updates, repairing the database after failed transactions, and doing it all efficiently. That is to say, how is it that you can keep a data warehouse working as well as possibly with as little overhead as possible?