Since Databend will compact the table blocks periodically, we can do extracting the virtual column as an additional operation, if the column data type is JSON. The operation of extracting virtual columns is performed asynchronous, so that it will not affect the performance of the data insertion. We can use this statistical information to determine which key paths need to generate virtual columns. "FP" stands for frequent pattern, usually used to calculate item frequencies and identify frequent items.Įvery time the user queries one key path of the JSON data, it will be recorded by the FPGrowth algorithm, and finally a tree-like statistical information is generated. We use the FPGrowth algorithm to count the access frequency of the JSON data key paths. In order to know which key paths are frequently queried, We should only generate virtual columns for frequently queried JSON key paths. Collect frequently accessed JSON key paths Įxtracting and storing virtual columns requires extra parsing processes and storage resources, so it is not appropriate to generate virtual columns for all key paths. Taking advantage of this feature, we can detect and extract frequently queried key path of JSON as virtual column to speed up query processing. However, in practice use, JSON data is often generated by machine and has a fairly rigid and predictable structure. The schema of JSON can be changed arbitrarily. Reference-level explanation Virtual Column Use JSONB instead of JSON as the underlying binary encoding format for Semi-Structured data types, which will optimize the parsing speed and benefit for key path access.Automatically detects frequently queried key paths of JSON columns and extracts them as virtual columns to achieving similar query performance as other columns.This RFC introduced the following two proposals: In order to make the query performance of Semi-Structured data as fast as other data types. Query a single key path requires reading and parsing the whole JSON data, which takes more parsing time and takes up more memory.JSON must be parsed every time used, and text parsing is very slow, especially for large complex object data.It has the following performance problems: Motivation Ĭurrently, Databend Semi-Structured data is stored as the raw text JSON format and use serde_json to do serialization and deserialization. This RFC describes the design of JSON performance optimization. Tracking Issue: datafuselabs/databend#6994.
0 Comments
Leave a Reply. |