LotusSQL: SQL Engine for High-Performance Big Data Systems

In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL stil...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiaohan Li, Bowen Yu, Guanyu Feng, Haojie Wang, Wenguang Chen
Format: Article
Language:English
Published: Tsinghua University Press 2021-12-01
Series:Big Data Mining and Analytics
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/BDMA.2021.9020009
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832573648676847616
author Xiaohan Li
Bowen Yu
Guanyu Feng
Haojie Wang
Wenguang Chen
author_facet Xiaohan Li
Bowen Yu
Guanyu Feng
Haojie Wang
Wenguang Chen
author_sort Xiaohan Li
collection DOAJ
description In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.
format Article
id doaj-art-cb6047f2b47644b58c926f8da2b81947
institution Kabale University
issn 2096-0654
language English
publishDate 2021-12-01
publisher Tsinghua University Press
record_format Article
series Big Data Mining and Analytics
spelling doaj-art-cb6047f2b47644b58c926f8da2b819472025-02-02T03:45:09ZengTsinghua University PressBig Data Mining and Analytics2096-06542021-12-014425226510.26599/BDMA.2021.9020009LotusSQL: SQL Engine for High-Performance Big Data SystemsXiaohan Li0Bowen Yu1Guanyu Feng2Haojie Wang3Wenguang Chen4<institution content-type="dept">Department of Computer Science and Technology</institution>, <institution>Tsinghua University</institution>, <country>China</country><institution content-type="dept">Department of Computer Science and Technology</institution>, <institution>Tsinghua University</institution>, <country>China</country><institution content-type="dept">Department of Computer Science and Technology</institution>, <institution>Tsinghua University</institution>, <country>China</country><institution content-type="dept">Department of Computer Science and Technology</institution>, <institution>Tsinghua University</institution>, <country>China</country><institution content-type="dept">Department of Computer Science and Technology</institution>, <institution>Tsinghua University</institution>, <country>China</country>In recent years, Apache Spark has become the de facto standard for big data processing. SparkSQL is a module offering support for relational analysis on Spark with Structured Query Language (SQL). SparkSQL provides convenient data processing interfaces. Despite its efficient optimizer, SparkSQL still suffers from the inefficiency of Spark resulting from Java virtual machine and the unnecessary data serialization and deserialization. Adopting native languages such as C++ could help to avoid such bottlenecks. Benefiting from a bare-metal runtime environment and template usage, systems with C++ interfaces usually achieve superior performance. However, the complexity of native languages also increases the required programming and debugging efforts. In this work, we present LotusSQL, an engine to provide SQL support for dataset abstraction on a native backend Lotus. We employ a convenient SQL processing framework to deal with frontend jobs. Advanced query optimization technologies are added to improve the quality of execution plans. Above the storage design and user interface of the compute engine, LotusSQL implements a set of structured dataset operations with high efficiency and integrates them with the frontend. Evaluation results show that LotusSQL achieves a speedup of up to 9× in certain queries and outperforms Spark SQL in a standard query benchmark by more than 2× on average.https://www.sciopen.com/article/10.26599/BDMA.2021.9020009big datac++structured query language (sql)query optimization
spellingShingle Xiaohan Li
Bowen Yu
Guanyu Feng
Haojie Wang
Wenguang Chen
LotusSQL: SQL Engine for High-Performance Big Data Systems
Big Data Mining and Analytics
big data
c++
structured query language (sql)
query optimization
title LotusSQL: SQL Engine for High-Performance Big Data Systems
title_full LotusSQL: SQL Engine for High-Performance Big Data Systems
title_fullStr LotusSQL: SQL Engine for High-Performance Big Data Systems
title_full_unstemmed LotusSQL: SQL Engine for High-Performance Big Data Systems
title_short LotusSQL: SQL Engine for High-Performance Big Data Systems
title_sort lotussql sql engine for high performance big data systems
topic big data
c++
structured query language (sql)
query optimization
url https://www.sciopen.com/article/10.26599/BDMA.2021.9020009
work_keys_str_mv AT xiaohanli lotussqlsqlengineforhighperformancebigdatasystems
AT bowenyu lotussqlsqlengineforhighperformancebigdatasystems
AT guanyufeng lotussqlsqlengineforhighperformancebigdatasystems
AT haojiewang lotussqlsqlengineforhighperformancebigdatasystems
AT wenguangchen lotussqlsqlengineforhighperformancebigdatasystems