Open Distro is an open-source, enterprise-grade Elasticsearch distribution with a lot of advanced features, including a robust security option, alerting, integrated event monitoring, performance analysis, and more. It also offers an alternative for developers already familiar with SQL to quickly hit the ground running with writing Elasticsearch queries.
Open Distro is an Apache 2.0-licensed extension of the open-source version of Elasticsearch (ES) and Kibana. It is currently developed and maintained by AWS as part of its open source effort. This became necessary due to the continued need to sustain new and advanced ES features as truly open-source software.
Open Distro allows you to interact with a cluster using SQL and to perform operations for which you would normally have to use the Elasticsearch domain-specific language (DSL). This feature was initially based on the elasticsearch-sql plugin, which is now being gradually phased out or deprecated.
In this tutorial, we are going to focus on how to query Elasticsearch data using SQL with the help of this open-source tool. We are going to cover:
To easily follow along with this tutorial, it is advisable to have a basic knowledge of Elasticsearch and writing basic ES queries with the DSL. This is important so that we have a reference point when we write similar queries with SQL.
However, if you are already familiar with basic SQL commands, then this should not be too difficult to follow. For a primer of working with Elasticsearch, we can check out this awesome tutorial on Understanding Elasticsearch query builder in Node.js.
Since we are interested in Open Distro and its SQL feature, we are going to install the Open Distro plugin as a standalone plugin on our machine, the same way we can install and use any other ES plugin on a compatible cluster.
Note: Open Distro for ES and Kibana is available via a Docker image, etc. More details in the next section.
In order to get started with Open Distro for ElasticSearch, let’s begin by installing and configuring it for development. It’s important to note, though, that there is a variety of available options for installing and getting started with Open Distro.
As we mentioned earlier, we are going to be installing the standalone plugin and developing on our machine. Note that if we already use AWS and intend to have this feature, we can make use of the cloud hosted version offered as a service.
Note: For more information on other available options for installing and configuring this tool, we can check this section on installation and configuration in the documentation.
Let’s make sure we have a compatible ES version on our machine. Details about plugin compatibility with ES versions can be found in the documentation.
Here, we will be using Elasticsearch version 7.1.1
and Open Distro plugin version 1.1.0.0
for the SQL feature, since we are only interested in this for now. To install the plugin, we can run the following command on our terminal/command prompt:
sudo bin/elasticsearch-plugin install https://d3g5vo6xdbdb9a.cloudfront.net/downloads/elasticsearch-plugins/opendistro-sql/opendistro_sql-1.1.0.0.zip
Before running this command, we should ensure that we have Elasticsearch version 7.1.1
installed on our machine. For macOS, we can run the following command to download the gzipped folder:
curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.1.1-darwin-x86_64.tar.gz
After the download is complete, we can then run this command to untar the gzipped folder:
tar -xzvf elasticsearch-7.1.1-darwin-x86_64.tar.gz
Then we can navigate into the folder to test our installation. We can run this command below:
cd elasticsearch-7.1.1 // navigate into our elasticsearch downloaded binaries ./bin/elasticsearch // test your elastic search installation by starting it up ls // list all files and folders inside the distribution
The output after the plugin installation should be similar to the below:
-> Downloading https://d3g5vo6xdbdb9a.cloudfront.net/downloads/elasticsearch-plugins/opendistro-sql/opendistro_sql-1.1.0.0.zip [=================================================] 100% -> Installed opendistro_sql
Note: We can confirm our plugin installation is successful by running this command in the ES folder:
sudo bin/elasticsearch-plugin list
Now that we are done setting up open_distro_sql
for development, let’s look at how we can ingest data into our ES cluster with a simple Node.js script. After that, we can start using SQL statements to write our queries.
From our earlier tutorial on working with Elasticsearch, we can take a look at the section on how to ingest data into our ES cluster with Node.js. For reference, here is a similar script that does the same thing:
'use strict' const fs = require('fs'); // Node JS file system module require('dotenv').config() // for environment variables const { Client } = require('@elastic/elasticsearch'). // ES client const client = new Client({ node: 'http://localhost:9200' }). // ES client connection // import json from file to ingest to our cluster const data = JSON.parse(fs.readFileSync(__dirname + '/demo.json')); const index= process.env.ES_INDEX; // ES index const type = process.env.ES_TYPE; // ES type const writeDataToES = async (index, data) => { for (let i = 0; i < data.length; i++ ) { await client.create({ refresh: true, index: index, id: i, body: data[i] }, (error, res) => { if (error) { console.error("Failed to import data to ES", error); return; } else { console.log("Successfully imported data", res); } }); } }; writeDataToES(index, data); //to run the script, create a new index.js file and copy the above code snippet //then run <node index.js> from the terminal...
Note that we can give the ES index any name we choose. Also, make sure we have installed all the imported dependencies for the script to run successfully.
After running the script with the above command, we can confirm that the data has been successfully written to the cluster by running this command:
http://127.0.0.1:9200/index_name/_search?pretty
Note: In this tutorial, our queries would be run against this same JSON data available here.
As reviewed earlier, Open Distro for Elasticsearch SQL lets us write queries in SQL rather than the DSL. In order to use this feature, all we need to do is send all our requests to this → http://<host:port>/_opendistro/_sql
URI or endpoint.
While all calls to this endpoint support both request parameter or request body, the latter is usually the recommended approach for making calls to this URI.
Note:
POST
requests are the recommended approach because there are no limitations as per the content length. Also, other parameters can be passed to the plugin when making a call for extra functionalities like theprepared statements
, for example.
Additionally, the explain
endpoint is used often for query translation and troubleshooting. The endpoints include:
GET
– here, we can send regular HTTP GET
requests with our query embedded in URL parameter. An example is shown below:
curl -H 'Content-Type: application/json' -X GET localhost:9200/_opendistro/_sql?sql=SELECT * FROM cars
POST
– we can also send HTTP POST
request with our query in the request body, as shown below:
curl -H 'Content-Type: application/json' -X POST localhost:9200/_opendistro/_sql -d '{"query" : "SELECT * FROM cars"}'
explain
– to translate a query, we can make use of the explain
endpoint. The output is usually the ES DSL in JSON format.Note: We can also do a copy and paste to the console to run it against our ES cluster. Let’s see an example below:
curl -H 'Content-Type: application/json' -X POST localhost:9200/_opendistro/_sql/_explain -d '{"query" : "SELECT Origin,Horsepower FROM cars WHERE Cylinders > 4 LIMIT 5"}'The result:
{ "from": 0, "size": 5, "query": { "bool": { "filter": [{ "bool": { "must": [{ "range": { "Cylinders": { "from": 4, "to": null, "include_lower": false, "include_upper": true, "boost": 1.0 } } }], "adjust_pure_negative": true, "boost": 1.0 } }], "adjust_pure_negative": true, "boost": 1.0 } }, "_source": { "includes": ["Origin", "Horsepower"], "excludes": [] } }
As an aside, note that the features or commands supported by native SQL queries are also supported by Open Distro for SQL. They include SQL Select
, Delete
, Where
, Order By
, Group By
, Inner Join
, Show
, and so on.
For example, using the WHERE
conditional statement is as easy as this:
curl -XPOST localhost:9200/_opendistro/_sql -d '{"query": "SELECT * FROM cars WHERE ['Acceleration']='20' LIMIT 5"}' -H 'Content-Type: application/json'
Note: The
WHERE
clause indicates we want to filter our data to include only rows where the specified or given conditions are true.
Here’s the result of the query above:
{ "took": 3, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 8, "relation": "eq" }, "max_score": 0.0, "hits": [{ "_index": "cars", "_type": "_doc", "_id": "39", "_score": 0.0, "_source": { "Name": "volkswagen super beetle 117", "Miles_per_Gallon": null, "Cylinders": 4, "Displacement": 97, "Horsepower": 48, "Weight_in_lbs": 1978, "Acceleration": 20, "Year": "1971-01-01", "Origin": "Europe" } }, { "_index": "cars", "_type": "_doc", "_id": "382", "_score": 0.0, "_source": { "Name": "amc concord dl", "Miles_per_Gallon": 23, "Cylinders": 4, "Displacement": 151, "Horsepower": null, "Weight_in_lbs": 3035, "Acceleration": 20.5, "Year": "1982-01-01", "Origin": "USA" } }, { "_index": "cars", "_type": "_doc", "_id": "63", "_score": 0.0, "_source": { "Name": "plymouth cricket", "Miles_per_Gallon": 26, "Cylinders": 4, "Displacement": 91, "Horsepower": 70, "Weight_in_lbs": 1955, "Acceleration": 20.5, "Year": "1971-01-01", "Origin": "USA" } }, { "_index": "cars", "_type": "_doc", "_id": "359", "_score": 0.0, "_source": { "Name": "ford escort 2h", "Miles_per_Gallon": 29.9, "Cylinders": 4, "Displacement": 98, "Horsepower": 65, "Weight_in_lbs": 2380, "Acceleration": 20.7, "Year": "1982-01-01", "Origin": "USA" } }, { "_index": "cars", "_type": "_doc", "_id": "25", "_score": 0.0, "_source": { "Name": "volkswagen 1131 deluxe sedan", "Miles_per_Gallon": 26, "Cylinders": 4, "Displacement": 97, "Horsepower": 46, "Weight_in_lbs": 1835, "Acceleration": 20.5, "Year": "1970-01-01", "Origin": "Europe" } }] } }
Note: In this tutorial, we are make use of the
curl
command line, too, to make all our requests. We can install curl on macOS with Homebrew by running this command:brew install curl
.
Likewise for the DSL, we can also make use of the SQL feature for aggregation
, queries
, join
, show
, and many more.
WHERE
– this clause does a filter of our dataset to return only rows where a given conditional statement is true. A sample query and result for the WHERE
clause can be found here on the Github repo.ORDER BY
– this clause allows us sort our data by a particular column (in this case we are using the _id
column) numerically or alphabetically. Here, we are doing so numerically and in an ascending order. A sample query and result for the ORDER BY
clause can be found here on the Github repo.GROUP BY
– this clause is used with aggregate functions, e.g., the range
function, so as to have a group of data with similar attributes in a single bucket. A sample query and result for the GROUP BY
clause can be found here on the Github repo.COUNT()
– as we can see from the sample query on GitHub, the COUNT()
function takes the name of a column — in this case, Miles_per_Gallon
— as an argument and counts the number of rows where it is not NULL
.MIN()
– the MIN()
function takes the name of a column as an argument — Horsepower
, in our sample query — and returns the smallest value in that column.SHOW
– an aggregate function that returns the indices and mappings that match the pattern of an index name. Note that we can use either *
or %
commands for wild cards. A sample query and result for the SHOW
command can be found here on GitHub.AVG()
– this is an aggregate function that returns the average value for a numeric column. In our sample query, that is Weight_in_lbs
.DATE_FORMAT()
– the DATE_FORMAT
function takes a timestamp column as an argument, with the field we intend to format, and returns the formatted response for that column. A sample query and result for the DATE_FORMAT
function can be found in the GitHub repo here.FLOOR()
– this function rounds off a number value for a specified numeric column to the nearest whole. Note that it can be combined with other aggregate functions. In our sample query and result, we combine it with AS
.BETWEEN
– this is used to filter out the data within a certain range. Note that the conditions can be numbers, text, or dates. In our sample query, we are making use of numbers.COUNT(DISTINCT … )
– this is used for selecting only unique values of a column. A sample query and result for this command can be found in the GitHub repo.IN
– this clause returns results based on multiple values in a common column. A sample query and result for the IN
clause can be found in the GitHub repo here.For the protocol, the SQL plugin provides multiple response formats for different purposes, while the request format is the same for all. Among them, JDBC format is widely used because it provides schema information and additional functionality, such as pagination. Besides the JDBC driver, various clients can benefit from the detailed and well-formatted response.
For example, the body of an HTTP POST
request can take a few more other fields with our SQL query:
curl -H 'Content-Type: application/json' -X POST localhost:9200/_opendistro/_sql -d '{"query" : "SELECT Origin, Horsepower, Miles_per_Gallon FROM cars LIMIT 5","filter" : {"range" :{"Weight_in_lbs" : {"lt" : 4000}}}}'
The result:
{ "took": 6, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 339, "relation": "eq" }, "max_score": 0.0, "hits": [{ "_index": "cars", "_type": "_doc", "_id": "182", "_score": 0.0, "_source": { "Origin": "Europe", "Horsepower": 70, "Miles_per_Gallon": 29 } }, { "_index": "cars", "_type": "_doc", "_id": "128", "_score": 0.0, "_source": { "Origin": "USA", "Horsepower": 150, "Miles_per_Gallon": 15 } }, { "_index": "cars", "_type": "_doc", "_id": "156", "_score": 0.0, "_source": { "Origin": "Japan", "Horsepower": 97, "Miles_per_Gallon": 24 } }, { "_index": "cars", "_type": "_doc", "_id": "170", "_score": 0.0, "_source": { "Origin": "USA", "Horsepower": 95, "Miles_per_Gallon": 18 } }, { "_index": "cars", "_type": "_doc", "_id": "57", "_score": 0.0, "_source": { "Origin": "Europe", "Horsepower": 90, "Miles_per_Gallon": 28 } }] } }
Note: We can make use of the
filter
command to add more conditions to the ES DSL directly. A sample query and result for this extra field feature can be found in the GitHub repo here.
When Elasticsearch bootstraps, the SQL plugin will register a few settings in the ES cluster settings. Most of these settings can change dynamically, as this will give us some control over the behavior of our installed plugin without the need to touch our cluster configurations.
They include:
1.) opendistro.sql.enabled
– we can enable or disable the SQL plugin to reject all coming requests. Note the following:
We can update the setting with a new value like this:
curl -H 'Content-Type: application/json' -X PUT localhost:9200/_cluster/settings -d '{"transient" : {"opendistro.sql.enabled" : false}}'
Here’s the result:
{ "acknowledged": true, "persistent": {}, "transient": { "opendistro": { "sql": { "enabled": "false" } } } }
2.) opendistro.sql.query.analysis.enabled
– we can also enable or disable the query analyzer to bypass strict syntactic and semantic analysis. Note the following:
We can update this setting with a new value like this:
curl -H 'Content-Type: application/json' -X PUT localhost:9200/_cluster/settings -d '{"transient" : {"opendistro.sql.query.analysis.enabled" : false}}'
Result:
{ "acknowledged": true, "persistent": {}, "transient": { "opendistro": { "sql": { "query": { "analysis": { "enabled": "false" } } } } } }
By making use of the /stats
endpoint, we can collect metrics for our plugin within a particular time duration. Note that only the statistics as per node level are currently supported. In essence, we only get the metrics for the node we are able to get access to. Let’s see an example below –
curl -H 'Content-Type: application/json' -X GET localhost:9200/_opendistro/_sql/stats
Result:
{ "failed_request_count_cb": 0, "failed_request_count_cuserr": 0, "circuit_breaker": 0, "request_total": 12, "request_count": 0, "failed_request_count_syserr": 0 }
From our earlier tutorial, we have learned that Elasticsearch is a fully distributed search and analytics engine. It’s also schemaless since it offers support for both structured and unstructured data. With Open Distro, it is now possible for users who are already comfortable with SQL to interact with an Elasticsearch cluster and to perform similar operations as they would with the DSL.
In this tutorial, we have covered majorly the SQL feature that affords an easy way to integrate with other SQL-compliant systems. But this is just the tip of the iceberg — to learn more about other features not covered in this tutorial, do not hesitate to check out the awesome documentation to learn more about other capabilities of Open Distro for ES.
Finally, the repository for this tutorial is available here on Github.
Install LogRocket via npm or script tag. LogRocket.init()
must be called client-side, not
server-side
$ npm i --save logrocket // Code: import LogRocket from 'logrocket'; LogRocket.init('app/id');
// Add to your HTML: <script src="https://cdn.lr-ingest.com/LogRocket.min.js"></script> <script>window.LogRocket && window.LogRocket.init('app/id');</script>
Hey there, want to help make our blog better?
Join LogRocket’s Content Advisory Board. You’ll help inform the type of content we create and get access to exclusive meetups, social accreditation, and swag.
Sign up nowIt’s easy for devs to default to JavaScript to fix every problem. Let’s use the RoLP to find simpler alternatives with HTML and CSS.
Learn how to manage memory leaks in Rust, avoid unsafe behavior, and use tools like weak references to ensure efficient programs.
Bypass anti-bot measures in Node.js with curl-impersonate. Learn how it mimics browsers to overcome bot detection for web scraping.
Handle frontend data discrepancies with eventual consistency using WebSockets, Docker Compose, and practical code examples.
5 Replies to "Exploring SQL and Elasticsearch with Open Distro"
The endpoint should be `_opendistro/_sql` instead of `opendistro/sql` for code blocks in Section `Endpoints and operations`
> Among them, JDBC format is widely used because it provides schema information and additional functionality, such as pagination.
Does pagination work for you? I can’t seem to get past 200 results.
Great intro, thanks.
Great article.
Is there support for Moving average in SQL for Open Distro ?
Oh thanks for that Abbas. I see the error now in the first two sections on HTTP GET and POST calls. The underscores are missing