What You Possibly Don’t Know About Columnar Storage
Columnar storage is a familiar data storage technique that is used by many data warehousing products because of its high effectiveness in many computing scenarios. The technique is usually a synonym of high-performance within the industry.
But is columnar storage a perfect strategy? A google shows that criticisms surrounding it are mainly about data modification. There are few discussions of its application to the read-only data analysis and computing, which will be taken care of in the following.
The technique has little significance for the in-memory computing
The idea behind columnar storage is simple. Hard disks with head jumps perform poor in feeding data, so a full scanning will be performed to retrieve data under row-based storage. Since one computation probably involves only a few columns, much of the retrieval effort becomes useless. The column storage strategy enables only the retrieval of the useful columns, reducing data accesses significantly. In the big data computing particularly, the disk scanning is rather time-consuming and a big data access decrease will greatly speed up the overall computing process. Moreover, chances are big that one column has many same values, thus it’s easier to combine them for compression under the columnar storage to further reduce the data size and enhance performance.
Now we can see that the columnar storage can increase performance because it reduces hard disk accesses. It doesn’t reduce the amount of computations. If data is already loaded into the memory, the technique is unnecessary. Generally the structured data processing is row-based. Storing in-memory data with columnar storage will complicate the construction of records and thus hinder performance. Except for professional vector computing (which is column-based and commonly used in data mining), it is not suitable to apply columnar storage strategy to the handling of relational-style in-memory computing (or use it with in-memory databases).
The technique exacerbates the discontinuous hard disk accesses
Columnar storage stores data column by column, making the access to multiple columns at one time for computing purpose random and discontinuous. The more columns accessed, the more random the access. The random access to the HDD will seriously affect the performance, even worse than the performance of row-based storage, which accesses data continuously, when there are a lot of accessed columns or the total number of columns is small because the retrieval process abounds with head jumps. The concurrent tasks and the to-be-discussed parallel processing will further exacerbate the random access problem. Though head jumps take place with row-based concurrency computing, the frequency is much smaller. With columnar storage, one concurrent task will generate multiple concurrent access requests (the number of involved columns); with row-based storage, only one concurrent access request is generated for one concurrent task.
A solution is to increase the buffer area for storing the retrieved data to reduce the proportion of the hard disk’s seek time. But setting buffer area for each column consumes a large amount of memory space if there are a lot of involved columns. An alternative is to add more hard disks and store columns on different hard disks. One issue is that columnar storage is generally applied in scenarios where there are a large number of columns. Usually the number is much more than the number of hard disks a machine can install. And the large number of columns will most probably cause conflict during random accesses to the hard disk.
The columnar storage is more friendly with the SSD that hasn’t the seek time problem.
The technique causes inefficient index
The commonly-used indexing technique is intended to find desired records from a large data set according to key values. The nature of indexing is sorting, as mentioned in a previous article. An index records the positions of ordered key values and their corresponding records stored in the original data table. With row-based storage, the position of a record can be represented by one number. But with columnar storage, each column of a record has its own position and should in principle be all recorded. Thus the index table becomes almost as large as the original table, making the accesses much more complicated and causing too much space consumption. It’s no better than the method of copying the original table and then sorting it.
Can we store only the position of one column for a record, and then calculate the positions of the rest of the columns via multiplication? Field values of certain data types, like string, have unfixed lengths, and those of other data types, like integer and date, that have fixed lengths may turn to unfixed lengths thanks to compression technique often used with columnar storage. If all field values are stored in fixed lengths, the index becomes simple and access speeds up, but the data amount increases, leading to a less cost-effective traversal, which is the operation for which the columnar storage strategy is mainly used.
A practical method is dividing the original table into a number of data blocks. Within each block, data is stored in a column-based format. The index is created and quickly locates the desired data based on the blocks. Then only in-block data scanning will be needed.
This way of using the index gets a lower performance because of the additional in-block scanning. If the original data table is ordered by index key values (usually the index key is the original table’s primary key), it’s easy to locate the blocks (most of the time only one block) holding the desired data, and the performance loss is acceptable. Typical scenarios are finding records according to a single given key value. If the original data table is unordered by the index key values, the index is useless, and it’s possible that the desired data falls in nearly all blocks, causing a similar performance to the full table scanning.
The technique complicates segment-based parallel processing
Multithreaded parallel processing is a must to make full use of the capability of multi-core CPU. For parallel processing, it’s necessary to have data segmented.
There are two basic requirements for data segmentation. The amount of data in each segment is nearly equal (meaning the task allocated to each thread is almost equal); the segmentation should be flexible (because the number of threads can’t be predicted). It’s easier to divided data into multiple segments with the row-based storage. We can make a mark at the end of each record (or of every N records), divide data into K segments according to the number of bytes, and find the ending marks in each segment to make them the starting points. This is unfeasible for columnar storage because the lengths of field values are unfixed. A segmenting point within a column may not fall in the same record as a corresponding point within another column, resulting in misarranged data.
Block-based segmentation strategy is a solution to the columnar storage’s segmentation problem. A block is the unit of segmentation, where data won’t be further divided for parallel processing. But there is a dilemma. On one hand, the number of blocks should be large enough to enable a flexible segmentation (because you can’t get 10 segments from only 5 blocks). As a modern computer generally has many CPU cores, nearly 100 blocks are needed to achieve a flexible, balanced segmentation. On the other hand, too many blocks means the columnar data is physically divided into many discontinuous blocks, making the traversal code very complicated and causing the retrieval of some useless data between two blocks. For an HDD, there’s also the seek time problem. The more the blocks data is divided into, the more serious these problems become. Only when the space the columnar data in a block used is much larger than the buffer area for storing retrieved data, the proportion of the time spent in retrieving useless data and the seek time will become relatively small. This requires that there should be a sufficient large number of records in each block. In other words, data amount is crucial for performing parallel processing on data in columnar storage format. With an HDD (including a disk array that contains a large group of HDDs), normally there should be at least one billion records in one table on a single computer, with the data amount reaching above 100G. The parallel processing won’t give a noticeable increase to the performance if the data amount isn’t large enough. This is the problem the multidimensional analysis, for which the columnar storage strategy is particularly suitable, faces. Besides, the size of each block is pre-determined, neighboring blocks can’t be physically combined while data is continuously added. As the number of blocks is increasing, the management of them becomes a challenge, demanding a scalable special space to store the index of the blocks.
These problems, however, can’t deny the great advantage of the columnar storage in handling read-only computing scenarios, though clearly any misuse of the technique need to be avoided. For data warehousing products, it’s appropriate to allow the system administer, or the user, to decide whether the columnar storage is needed or not, how to divide columnar data into blocks, which part of data needs to be stored in a columnar format, and which part of data needs to be handled with both row-based storage and columnar storage to achieve higher performance through data redundancy.
What You Possibly Don’t Know About Columnar Storage