Alexander Nnakwue Software Engineer. React, Node.js, Python, and other developer tools and libraries.

Exploring SQL and Elasticsearch with Open Distro

9 min read 2763

Exploring SQL And Elasticsearch With Open Distro

Introduction

Open Distro is an open-source, enterprise-grade Elasticsearch distribution with a lot of advanced features, including a robust security option, alerting, integrated event monitoring, performance analysis, and more. It also offers an alternative for developers already familiar with SQL to quickly hit the ground running with writing Elasticsearch queries.

Open Distro is an Apache 2.0-licensed extension of the open-source version of Elasticsearch (ES) and Kibana. It is currently developed and maintained by AWS as part of its open source effort. This became necessary due to the continued need to sustain new and advanced ES features as truly open-source software.

Open Distro allows you to interact with a cluster using SQL and to perform operations for which you would normally have to use the Elasticsearch domain-specific language (DSL). This feature was initially based on the elasticsearch-sql plugin, which is now being gradually phased out or deprecated.

In this tutorial, we are going to focus on how to query Elasticsearch data using SQL with the help of this open-source tool. We are going to cover:

  • Installation and getting started with Open Distro
  • Using the REST API to ingest data into our ES cluster
  • SQL with Elasticsearch
    • Endpoints and operations
    • Features and supported protocols
    • Exploring additional plugin settings and options

Prerequisites

To easily follow along with this tutorial, it is advisable to have a basic knowledge of Elasticsearch and writing basic ES queries with the DSL. This is important so that we have a reference point when we write similar queries with SQL.

However, if you are already familiar with basic SQL commands, then this should not be too difficult to follow. For a primer of working with Elasticsearch, we can check out this awesome tutorial on Understanding Elasticsearch query builder in Node.js.

Since we are interested in Open Distro and its SQL feature, we are going to install the Open Distro plugin as a standalone plugin on our machine, the same way we can install and use any other ES plugin on a compatible cluster.

Note: Open Distro for ES and Kibana is available via a Docker image, etc. More details in the next section.

First steps: Getting started

In order to get started with Open Distro for ElasticSearch, let’s begin by installing and configuring it for development. It’s important to note, though, that there is a variety of available options for installing and getting started with Open Distro.

As we mentioned earlier, we are going to be installing the standalone plugin and developing on our machine. Note that if we already use AWS and intend to have this feature, we can make use of the cloud hosted version offered as a service.

Note: For more information on other available options for installing and configuring this tool, we can check this section on installation and configuration in the documentation.

Let’s make sure we have a compatible ES version on our machine. Details about plugin compatibility with ES versions can be found in the documentation.

We made a custom demo for .
No really. Click here to check it out.

Here, we will be using Elasticsearch version 7.1.1 and Open Distro plugin version 1.1.0.0 for the SQL feature, since we are only interested in this for now. To install the plugin, we can run the following command on our terminal/command prompt:

sudo bin/elasticsearch-plugin install https://d3g5vo6xdbdb9a.cloudfront.net/downloads/elasticsearch-plugins/opendistro-sql/opendistro_sql-1.1.0.0.zip

Before running this command, we should ensure that we have Elasticsearch version 7.1.1 installed on our machine. For macOS, we can run the following command to download the gzipped folder:

curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.1.1-darwin-x86_64.tar.gz

After the download is complete, we can then run this command to untar the gzipped folder:

tar -xzvf elasticsearch-7.1.1-darwin-x86_64.tar.gz

Then we can navigate into the folder to test our installation. We can run this command below:

cd elasticsearch-7.1.1  // navigate into our elasticsearch downloaded binaries
./bin/elasticsearch // test your elastic search installation by starting it up
ls // list all files and folders inside the distribution

The output after the plugin installation should be similar to the below:

-> Downloading https://d3g5vo6xdbdb9a.cloudfront.net/downloads/elasticsearch-plugins/opendistro-sql/opendistro_sql-1.1.0.0.zip
[=================================================] 100%   
-> Installed opendistro_sql

Note: We can confirm our plugin installation is successful by running this command in the ES folder:

sudo bin/elasticsearch-plugin list

Now that we are done setting up open_distro_sql for development, let’s look at how we can ingest data into our ES cluster with a simple Node.js script. After that, we can start using SQL statements to write our queries.

Using the REST API to ingest data into our ES cluster

From our earlier tutorial on working with Elasticsearch, we can take a look at the section on how to ingest data into our ES cluster with Node.js. For reference, here is a similar script that does the same thing:

'use strict'

const fs = require('fs'); // Node JS file system module 
require('dotenv').config()  // for environment variables
const { Client } = require('@elastic/elasticsearch'). // ES client
const client = new Client({ node: 'http://localhost:9200' }). // ES client connection

// import json from file to ingest to our cluster
const data = JSON.parse(fs.readFileSync(__dirname + '/demo.json'));

const index= process.env.ES_INDEX; // ES index
const type = process.env.ES_TYPE;  // ES type

const writeDataToES = async (index, data) => {
    for (let i = 0; i < data.length; i++ ) {
      await client.create({
        refresh: true,
        index: index,
        id: i,
        body: data[i]
      }, (error, res) => {
        if (error) {
          console.error("Failed to import data to ES", error);
          return;
        }
        else {
          console.log("Successfully imported data", res);
        }
      });
    }
};

writeDataToES(index, data);

//to run the script, create a new index.js file and copy the above code snippet
//then run <node index.js> from the terminal...

Note that we can give the ES index any name we choose. Also, make sure we have installed all the imported dependencies for the script to run successfully.

After running the script with the above command, we can confirm that the data has been successfully written to the cluster by running this command:

http://127.0.0.1:9200/index_name/_search?pretty

Note: In this tutorial, our queries would be run against this same JSON data available here.

SQL with Elasticsearch

Endpoints and operations

As reviewed earlier, Open Distro for Elasticsearch SQL lets us write queries in SQL rather than the DSL. In order to use this feature, all we need to do is send all our requests to this → http://<host:port>/_opendistro/_sql URI or endpoint.

While all calls to this endpoint support both request parameter or request body, the latter is usually the recommended approach for making calls to this URI.

Note: POST requests are the recommended approach because there are no limitations as per the content length. Also, other parameters can be passed to the plugin when making a call for extra functionalities like the prepared statements, for example.

Additionally, the explain endpoint is used often for query translation and troubleshooting. The endpoints include:

  • GET – here, we can send regular HTTP GET requests with our query embedded in URL parameter. An example is shown below:
    curl -H 'Content-Type: application/json' -X GET localhost:9200/_opendistro/_sql?sql=SELECT * FROM cars
  • POST – we can also send HTTP POST request with our query in the request body, as shown below:
    curl -H 'Content-Type: application/json' -X POST localhost:9200/_opendistro/_sql -d '{"query" : "SELECT * FROM cars"}'
  • explain – to translate a query, we can make use of the explain endpoint. The output is usually the ES DSL in JSON format.

Note: We can also do a copy and paste to the console to run it against our ES cluster. Let’s see an example below:

curl -H 'Content-Type: application/json' -X POST localhost:9200/_opendistro/_sql/_explain -d '{"query" : "SELECT Origin,Horsepower FROM cars WHERE Cylinders > 4 LIMIT 5"}'

The result:

{
        "from": 0,
        "size": 5,
        "query": {
                "bool": {
                        "filter": [{
                                "bool": {
                                        "must": [{
                                                "range": {
                                                        "Cylinders": {
                                                          "from": 4,
                                                          "to": null,
                                                          "include_lower": false,
                                                          "include_upper": true,
                                                          "boost": 1.0
                                                    }
                                              }
                                        }],
                                        "adjust_pure_negative": true,
                                        "boost": 1.0
                                }
                        }],
                        "adjust_pure_negative": true,
                        "boost": 1.0
                }
        },
        "_source": {
                "includes": ["Origin", "Horsepower"],
                "excludes": []
        }
}

As an aside, note that the features or commands supported by native SQL queries are also supported by Open Distro for SQL. They include SQL Select, Delete, Where, Order By, Group By, Inner Join, Show, and so on.

For example, using the WHERE conditional statement is as easy as this:

curl -XPOST localhost:9200/_opendistro/_sql -d '{"query": "SELECT * FROM cars WHERE ['Acceleration']='20' LIMIT 5"}' -H 'Content-Type: application/json'

Note: The WHERE clause indicates we want to filter our data to include only rows where the specified or given conditions are true.

Here’s the result of the query above:

{
        "took": 3,
        "timed_out": false,
        "_shards": {
                "total": 1,
                "successful": 1,
                "skipped": 0,
                "failed": 0
        },
        "hits": {
                "total": {
                        "value": 8,
                        "relation": "eq"
                },
                "max_score": 0.0,
                "hits": [{
                        "_index": "cars",
                        "_type": "_doc",
                        "_id": "39",
                        "_score": 0.0,
                        "_source": {
                                "Name": "volkswagen super beetle 117",
                                "Miles_per_Gallon": null,
                                "Cylinders": 4,
                                "Displacement": 97,
                                "Horsepower": 48,
                                "Weight_in_lbs": 1978,
                                "Acceleration": 20,
                                "Year": "1971-01-01",
                                "Origin": "Europe"
                        }
                }, {
                        "_index": "cars",
                        "_type": "_doc",
                        "_id": "382",
                        "_score": 0.0,
                        "_source": {
                                "Name": "amc concord dl",
                                "Miles_per_Gallon": 23,
                                "Cylinders": 4,
                                "Displacement": 151,
                                "Horsepower": null,
                                "Weight_in_lbs": 3035,
                                "Acceleration": 20.5,
                                "Year": "1982-01-01",
                                "Origin": "USA"
                        }
                }, {
                        "_index": "cars",
                        "_type": "_doc",
                        "_id": "63",
                        "_score": 0.0,
                        "_source": {
                                "Name": "plymouth cricket",
                                "Miles_per_Gallon": 26,
                                "Cylinders": 4,
                                "Displacement": 91,
                                "Horsepower": 70,
                                "Weight_in_lbs": 1955,
                                "Acceleration": 20.5,
                                "Year": "1971-01-01",
                                "Origin": "USA"
                        }
                }, {
                        "_index": "cars",
                        "_type": "_doc",
                        "_id": "359",
                        "_score": 0.0,
                        "_source": {
                                "Name": "ford escort 2h",
                                "Miles_per_Gallon": 29.9,
                                "Cylinders": 4,
                                "Displacement": 98,
                                "Horsepower": 65,
                                "Weight_in_lbs": 2380,
                                "Acceleration": 20.7,
                                "Year": "1982-01-01",
                                "Origin": "USA"
                        }
                }, {
                        "_index": "cars",
                        "_type": "_doc",
                        "_id": "25",
                        "_score": 0.0,
                        "_source": {
                                "Name": "volkswagen 1131 deluxe sedan",
                                "Miles_per_Gallon": 26,
                                "Cylinders": 4,
                                "Displacement": 97,
                                "Horsepower": 46,
                                "Weight_in_lbs": 1835,
                                "Acceleration": 20.5,
                                "Year": "1970-01-01",
                                "Origin": "Europe"
                        }
                }]
        }
}

Note: In this tutorial, we are make use of the curl command line, too, to make all our requests. We can install curl on macOS with Homebrew by running this command: brew install curl.

Likewise for the DSL, we can also make use of the SQL feature for aggregation, queries, join, show, and many more.

Querying our cluster

Statements

  • WHERE – this clause does a filter of our dataset to return only rows where a given conditional statement is true. A sample query and result for the WHERE clause can be found here on the Github repo.
  • ORDER BY – this clause allows us sort our data by a particular column (in this case we are using the _id column) numerically or alphabetically. Here, we are doing so numerically and in an ascending order. A sample query and result for the ORDER BY clause can be found here on the Github repo.
  • GROUP BY – this clause is used with aggregate functions, e.g., the range function, so as to have a group of data with similar attributes in a single bucket. A sample query and result for the GROUP BY clause can be found here on the Github repo.

Aggregations

  • COUNT() – as we can see from the sample query on GitHub, the COUNT() function takes the name of a column — in this case, Miles_per_Gallon — as an argument and counts the number of rows where it is not NULL.
  • MIN() – the MIN() function takes the name of a column as an argument — Horsepower, in our sample query — and returns the smallest value in that column.
  • SHOW – an aggregate function that returns the indices and mappings that match the pattern of an index name. Note that we can use either * or % commands for wild cards. A sample query and result for the SHOW command can be found here on GitHub.
  • AVG() – this is an aggregate function that returns the average value for a numeric column. In our sample query, that is Weight_in_lbs.

Functions

  • DATE_FORMAT() – the DATE_FORMAT function takes a timestamp column as an argument, with the field we intend to format, and returns the formatted response for that column. A sample query and result for the DATE_FORMAT function can be found in the GitHub repo here.
  • FLOOR() – this function rounds off a number value for a specified numeric column to the nearest whole. Note that it can be combined with other aggregate functions. In our sample query and result, we combine it with AS.

Conditions

  • BETWEEN – this is used to filter out the data within a certain range. Note that the conditions can be numbers, text, or dates. In our sample query, we are making use of numbers.
  • COUNT(DISTINCT … ) – this is used for selecting only unique values of a column. A sample query and result for this command can be found in the GitHub repo.
  • IN – this clause returns results based on multiple values in a common column. A sample query and result for the IN clause can be found in the GitHub repo here.

Features and supported protocols

For the protocol, the SQL plugin provides multiple response formats for different purposes, while the request format is the same for all. Among them, JDBC format is widely used because it provides schema information and additional functionality, such as pagination. Besides the JDBC driver, various clients can benefit from the detailed and well-formatted response.

For example, the body of an HTTP POST request can take a few more other fields with our SQL query:

curl -H 'Content-Type: application/json' -X POST localhost:9200/_opendistro/_sql -d '{"query" : "SELECT Origin, Horsepower, Miles_per_Gallon FROM cars LIMIT 5","filter" : {"range" :{"Weight_in_lbs" : {"lt" : 4000}}}}'

The result:

{
        "took": 6,
        "timed_out": false,
        "_shards": {
                "total": 1,
                "successful": 1,
                "skipped": 0,
                "failed": 0
        },
        "hits": {
                "total": {
                        "value": 339,
                        "relation": "eq"
                },
                "max_score": 0.0,
                "hits": [{
                        "_index": "cars",
                        "_type": "_doc",
                        "_id": "182",
                        "_score": 0.0,
                        "_source": {
                                "Origin": "Europe",
                                "Horsepower": 70,
                                "Miles_per_Gallon": 29
                        }
                }, {
                        "_index": "cars",
                        "_type": "_doc",
                        "_id": "128",
                        "_score": 0.0,
                        "_source": {
                                "Origin": "USA",
                                "Horsepower": 150,
                                "Miles_per_Gallon": 15
                        }
                }, {
                        "_index": "cars",
                        "_type": "_doc",
                        "_id": "156",
                        "_score": 0.0,
                        "_source": {
                                "Origin": "Japan",
                                "Horsepower": 97,
                                "Miles_per_Gallon": 24
                        }
                }, {
                        "_index": "cars",
                        "_type": "_doc",
                        "_id": "170",
                        "_score": 0.0,
                        "_source": {
                                "Origin": "USA",
                                "Horsepower": 95,
                                "Miles_per_Gallon": 18
                        }
                }, {
                        "_index": "cars",
                        "_type": "_doc",
                        "_id": "57",
                        "_score": 0.0,
                        "_source": {
                                "Origin": "Europe",
                                "Horsepower": 90,
                                "Miles_per_Gallon": 28
                        }
                }]
        }
}

Note: We can make use of the filter command to add more conditions to the ES DSL directly. A sample query and result for this extra field feature can be found in the GitHub repo here.

Exploring additional plugin settings and options

When Elasticsearch bootstraps, the SQL plugin will register a few settings in the ES cluster settings. Most of these settings can change dynamically, as this will give us some control over the behavior of our installed plugin without the need to touch our cluster configurations.

They include:

1.) opendistro.sql.enabled – we can enable or disable the SQL plugin to reject all coming requests. Note the following:

  • The default value is true
  • This setting is node scope
  • This setting can be updated dynamically

We can update the setting with a new value like this:

curl -H 'Content-Type: application/json' -X PUT localhost:9200/_cluster/settings -d '{"transient" : {"opendistro.sql.enabled" : false}}'

Here’s the result:

{
        "acknowledged": true,
        "persistent": {},
        "transient": {
                "opendistro": {
                        "sql": {
                                "enabled": "false"
                        }
                }
        }
}

2.) opendistro.sql.query.analysis.enabled – we can also enable or disable the query analyzer to bypass strict syntactic and semantic analysis. Note the following:

  • The default value is true
  • This setting is node scope
  • This setting can only be updated dynamically

We can update this setting with a new value like this:

curl -H 'Content-Type: application/json' -X PUT localhost:9200/_cluster/settings -d '{"transient" : {"opendistro.sql.query.analysis.enabled" : false}}'

Result:

{
        "acknowledged": true,
        "persistent": {},
        "transient": {
                "opendistro": {
                        "sql": {
                                "query": {
                                        "analysis": {
                                                "enabled": "false"
                                        }
                                }
                        }
                }
        }
}

Plugin monitoring

By making use of the /stats endpoint, we can collect metrics for our plugin within a particular time duration. Note that only the statistics as per node level are currently supported. In essence, we only get the metrics for the node we are able to get access to. Let’s see an example below –

curl -H 'Content-Type: application/json' -X GET localhost:9200/_opendistro/_sql/stats

Result:

{
        "failed_request_count_cb": 0,  
        "failed_request_count_cuserr": 0,
        "circuit_breaker": 0,
        "request_total": 12,
        "request_count": 0,
        "failed_request_count_syserr": 0
}

Conclusion

From our earlier tutorial, we have learned that Elasticsearch is a fully distributed search and analytics engine. It’s also schemaless since it offers support for both structured and unstructured data. With Open Distro, it is now possible for users who are already comfortable with SQL to interact with an Elasticsearch cluster and to perform similar operations as they would with the DSL.

In this tutorial, we have covered majorly the SQL feature that affords an easy way to integrate with other SQL-compliant systems. But this is just the tip of the iceberg — to learn more about other features not covered in this tutorial, do not hesitate to check out the awesome documentation to learn more about other capabilities of Open Distro for ES.

Finally, the repository for this tutorial is available here on Github.

You come here a lot! We hope you enjoy the LogRocket blog. Could you fill out a survey about what you want us to write about?

    Which of these topics are you most interested in?
    ReactVueAngularNew frameworks
    Do you spend a lot of time reproducing errors in your apps?
    YesNo
    Which, if any, do you think would help you reproduce errors more effectively?
    A solution to see exactly what a user did to trigger an errorProactive monitoring which automatically surfaces issuesHaving a support team triage issues more efficiently
    Thanks! Interested to hear how LogRocket can improve your bug fixing processes? Leave your email:

    : Full visibility into your web apps

    LogRocket is a frontend application monitoring solution that lets you replay problems as if they happened in your own browser. Instead of guessing why errors happen, or asking users for screenshots and log dumps, LogRocket lets you replay the session to quickly understand what went wrong. It works perfectly with any app, regardless of framework, and has plugins to log additional context from Redux, Vuex, and @ngrx/store.

    In addition to logging Redux actions and state, LogRocket records console logs, JavaScript errors, stacktraces, network requests/responses with headers + bodies, browser metadata, and custom logs. It also instruments the DOM to record the HTML and CSS on the page, recreating pixel-perfect videos of even the most complex single-page apps.

    .
    Alexander Nnakwue Software Engineer. React, Node.js, Python, and other developer tools and libraries.

    5 Replies to “Exploring SQL and Elasticsearch with Open Distro”

    1. > Among them, JDBC format is widely used because it provides schema information and additional functionality, such as pagination.

      Does pagination work for you? I can’t seem to get past 200 results.

    2. Oh thanks for that Abbas. I see the error now in the first two sections on HTTP GET and POST calls. The underscores are missing

    Leave a Reply