Originally published 20 April 2009
This article is the second in a series about multidimensional modeling in SAP BW. I recommend you read Part 1, “Multidimensional Modeling in SAP BW – Art or Science?” before beginning this article.
As a seasoned data warehousing professional I have always wondered why data modeling is considered an art that requires a thorough understanding of the data and a lot of modeling experience. Why not analyze the data scientifically and make accurate predictions on data modeling outcomes?
When compared to a fact table, dimensions ideally have a small cardinality. How can we achieve this same effect and simultaneously limit the number of dimensions?
Within SAP BW it is perfectly possible to analyze the data in a DataStore Object (DSO) or InfoCube with some basic SQL statements. Before initiating the multidimensional modeling process, wouldn’t it be useful to know some basic scientific facts about the data?
This information is a critical input for the dimensional model design process. In SAP BW, the number of dimensions is limited to 16, of which three are defined by SAP (Data Package, Time and Unit/Currency). Keeping in mind that one of the most critical guidelines from SAP about multidimensional modeling in BW is: “The dimension/master (SID) tables should be relatively small with respect to the number of rows in comparison to the fact table (factor 1:10/20),” and that there are only 13 dimensions left to be freely defined, it becomes very clear that the above information plays a crucial role in the definition of an optimal dimensional model where both size and number of dimensions are minimized.
If the ratio of distinct values of a characteristic to the total number of records is higher than 1:5 (question 2:question 1), this characteristic is an excellent candidate for a line-item dimension. This special type of dimension can only contain one characteristic. The SID table is then used directly as the dimension table. This reduces the number of table joins. A typical example would be a document number which could have as many distinct values as the number of records in the data set.
The answer to the question in bullet 3 tells us whether or not two characteristics can be assigned to the same dimension. If there is a 1:N relation between two characteristics, they should be assigned to the same dimension, if there is a N:M relation between two characteristics, it could cause a dimension to degenerate when assigned to the same dimension.
If your data set contains N distinct values of characteristic A and M distinct values of characteristic B, the possible range of distinct combinations of both characteristics in the data set will be from the maximum of N and M to the product of N and M. If the actual distinct number of combinations of both characteristics is closer to the maximum of N and M, then there is a strong relation (1:N) between the characteristics, and they should be assigned to the same dimension. If the actual distinct number of combinations of both characteristics is closer to the product of N and M, then there is a weak relation (N:M) between the characteristics, and they should not be assigned to the same dimension.
If your data set contains 100 distinct values of characteristic A and 1,000 distinct values of characteristic B, the possible range of distinct combinations of both characteristics in the data set will be from 1,000 to 100,000.
If the actual number of distinct combinations of both characteristics is close to 1,000, then there is a good relation between both characteristics, and they should be assigned to the same dimension. An example could be a sales data set with 100 customer groups and 1,000 customers. If every customer always belongs to one and the same customer group, then there will be only 1,000 distinct combinations of customer and customer group in the data set, and both characteristics should be assigned to the same dimension.
If the actual number of distinct combinations of both characteristics is close to 100,000, then there is a weak relation between both characteristics, and they should be assigned to different dimensions. An example could be a sales data set with 100 customers and 1,000 products. If every customer buys all of your products, then there will be 100,000 distinct combinations of customer and product in the data set, and both characteristics should be assigned to different dimensions.
The Dimensional Modeling Optimizer (DMO), an in-house developed BSP application, automates these and even more calculations. The tool analyzes a data set in an existing DSO or InfoCube and generates recommendations for an optimal dimensional model.
Figure 1: Data analysis in DSO
The tool provides several overviews of information about a data set: number of records in the data set, number of distinct values of characteristics, whether or not they are good candidates for a line-item dimension, information about compounding (compounded InfoObjects should be assigned to the same dimension if possible) and the actual number of distinct combinations of all sets of characteristics (and thus their compatibility).
Figure 2: Information on data sets
Using this information the designer can assign characteristics to dimensions in a virtual design of the dimensional model of an InfoCube. The DMO will then calculate the exact expected dimensional model (number of records in fact and dimension tables) within seconds. It eliminates the need to actually build the InfoCube and load data into it. Once the virtual dimensional model is optimized, it only takes a mouseclick to actually convert the virtual optimal model into a real InfoCube.
Figure 3: Modeling for InfoCube
The advantages of an optimal dimensional model are huge. The DMO facilitates the scientific approach to dimensional modeling and typical results are:
Recent articles by Thierry Helderweirt
Comments
Want to post a comment? Login or become a member today!
Be the first to comment!