A practical guide to working with Elasticdump

Introduction

Generally speaking, databases will have a mechanism for migrating, copying/backing up, or, better still, transferring stored data to either a different database or to a file in supported formats. As its name implies, Elasticdump is a tool for importing and exporting data stored in an Elasticsearch index or cluster.

A Practical Guide To Working With Elasticdump

Therefore, for cases where we intend to generally manage data transfer between Elasticsearch (ES) indices, Elasticdump is an awesome tool for the job. It works by sending an input to an output, thereby allowing us to export saved data from one ES server, acting as a source and output, directly to another, acting as the destination.

Additionally, it allows us to export a group of datasets (as well as the mappings) from an ES index/cluster to a file in JSON format, or even gzipped. It also supports exporting multiple indices at the same time to a supported destination.

Getting started with Elasticdump

With Elasticdump, we can export indices into/out of JSON files, or from one cluster to another. In this article, we are going to explore how to use this awesome tool to do just that — to serve as a point of reference for those who intend to do this (and also for my future self).

As an exercise, we will create an Elasticsearch index with some dummy data, then export the same index to JSON. Also, we will show how to move or dump some dummy data from one ES server/cluster to another.

Note: Elasticdump is open-source (Apache-2.0 licensed) and actively maintained. In recent versions, performance updates on the “dump/upload” algorithm have resulted in increased parallel processing speed. This change, however, comes at a cost, as records or datasets are no longer processed in a sequential order.

Prerequisites

To follow along with this tutorial, it is advisable to have a basic knowledge of how Elasticsearch works. Also, readers should be sure to have Elasticsearch installed locally on their machines. Instructions to do so can be found here.

Alternatively, we can choose to make use of a cloud-hosted Elasticsearch provider. To learn about how to set it up, we can reference this earlier article on working with Elasticsearch.

It should be noted that whatever method we choose to interact with our Elasticsearch cluster, it will work the same on both our local development environment and in cloud hosted versions.

Installation

To begin with, we should have Elasticdump installed on our local machines since we intend to work with it locally. Here, we can either install it per project or globally. To do so globally, we can run the following command:

npm install elasticdump -g

On a per-project basis, we can run:

npm install elasticdump --save

Note: There are other available means of installing and running this tool via docker, and also via the non-standard install.

Usage of Elasticdump

The usage of this tool is shown below:

elasticdump --input SOURCE --output DESTINATION [OPTIONS]

As we can see from the command above, we have both an input source and an output destination. The options property is used to specify extra parameters needed for the command to run.

Additionally, as we have also mentioned previously, Elasticdump works by sending an input to an output, where the output or input could either be an Elastic URL or a file, or vice versa.

As usual, the format for an Elasticsearch URL is shown below:

{protocol}://{host}:{port}/{index}

Which is equivalent to this URL shown below:

http://localhost:9200/sample_index?pretty

Alternatively, an example file format is shown below:

/Users/retina/Desktop/sample_file.json

Then, we can use the elastic dump tool like so to transfer a backup of the data in our sample index to a file:

elasticdump \
    --input=http://localhost:9200/sample_index \
    --output=/Users/retina/Desktop/sample_file.json \
    --type=data

As we can see from the command above, we are making use of the elasticdump command with the appropriate option flags specifying the --input and --output sources. We are specifying the type with a --type options flag as well. We can also run the same command for our mappings or schema, too:

elasticdump \
    --input=http://localhost:9200/sample_index \
    --output=/Users/retina/Desktop/sample_mapping.json \
    --type=mapping

This above command copies the output from the Elasticsearch URL we input. This specifies the index to an output, which is a file, sample_mapping.json. We can also run other commands. For transferring data from one Elasticsearch server/cluster to another, for example, we can run the following commands below:

elasticdump \
  --input=http://sample-es-url/sample_index \
  --output=http://localhost:9200/sample_index \
  --type=analyzer


elasticdump \
  --input=http://sample-es-url/sample_index \
  --output=http://localhost:9200/sample_index \
  --type=mapping


elasticdump \
  --input=http://sample-es-url/sample_index \
  --output=http://localhost:9200/sample_index \
  --type=data

The above commands would copy the data in the said index and also the mappings and analyzer. Note that we can also run other commands, which include:

gzip data in an ES index and perform a backup to a suitable destination
Back up the results of an Elasticsearch query to a file
Import data from an S3 bucket into Elasticsearch, making use of the S3 bucket URL. Note that we can also export data from an ES cluster to an S3 bucket via the URL
Back up aliases and templates to a file and import same to Elasticsearch
Split files into multiple parts based on the --fileSize options flag, and so on

More details about the signature for the above operations and other operations we can run with the help of Elasticdump can be found in the readme file on GitHub.

Note: For cases where we need to create a dump with basic authentication, we can either add basic auth on the URL or we can use a file that contains the auth credentials. More details can be found in this wiki.

Notes on the options parameters

For the options parameter we pass to the dump command, only the --input and --output flags are required. The reason for this is obvious: we need a source for the data we are trying to migrate and also a destination. Other options include:

--input-index – we can pass the source index and type (default: all)
--output-index – we can pass the destination index and type (default: all)
--overwrite – we can pass this optional flag to overwrite the output file if it exists (default: false)
--limit – we can also pass a limit flag to specify the number of objects we intend to move in batches per operation (default: 100)
--size – we can also pass this flag to specify how many objects to retrieve (default: -1 to no limit)
--debug – we can use this flag to display the Elasticsearch command being used (default: false)
--searchBody – this flag helps us perform a partial extract based on search results. Note that we can only use this flag when Elasticsearch is our input data source
--transform – this flag is useful when we intend to modify documents on the fly before writing it to our destination. Details about this tool’s internals can be found here

Details about other flags we can pass as options to the elasticdump command, including --headers, --params, --ignore-errors, --timeout, --awsUrlRegex, and so on, can be found here in the docs.

Version improvements worthy of note

Because Elasticdump relies on Elasticsearch, this tool is likely to require Elasticsearch version 1.0.0 or higher
Elasticdump has dropped support for Node v8. Node ≥v10 is now required for the tool to work properly
Elasticdump now supports specifying a comma-separated list of fields that should be checked for bigint
As earlier mentioned, there is also an upgrade in the dump algorithm to make it process datasets in parallel, leading to an improved performance.

More details about version changes can be found in this section of the readme document. Also, for gotchas or things to note while using this tool, we can reference this section of the same document.

Using Elasticdump with real-world data

In this section, we are going to demo how to use this tool to dump data from one index to another, and also to a file. To do so, we would need two separate ES clusters. We will be following the steps outlined in this tutorial to provision a cloud-hosted version of Elasticsearch.

Note that to copy or write sample data to our ES cluster or index, we can reference the script from the earlier article linked in the paragraph above. Also, the sample data can be found here.

Steps

Since we are developing locally, we should ensure that our ES cluster is up and running
After that, we can run the elasticdump command on the CLI
Here, we have installed elasticdump globally by running npm install elasticdump -g

When we are done with the setup, the result of running elasticdump on the terminal should be:

Mon, 17 Aug 2020 22:39:24 GMT | Error Emitted => {"errors":["input is a required input","output is a required input"]}

Of course, the reason for this is that we have not included the required input and output fields as mentioned earlier. We can include them by running the following command:

elasticdump \
  --input=http://localhost:9200/cars \            
  --output=/Users/retina/Desktop/my_index_mapping.json \
  --type=mapping

elasticdump \
  --input=http://localhost:9200/cars \            
  --output=/Users/retina/Desktop/my_index.json \
  --type=data

This copies or dumps the data from our local ES cluster to a file in JSON format. Note that the file is created automatically on the specified path if not available and the data written to it. The result of running the command is shown below:

Mon, 17 Aug 2020 22:42:59 GMT | starting dump
Mon, 17 Aug 2020 22:43:00 GMT | got 1 objects from source elasticsearch (offset: 0)
Mon, 17 Aug 2020 22:43:00 GMT | sent 1 objects to destination file, wrote 1
Mon, 17 Aug 2020 22:43:00 GMT | got 0 objects from source elasticsearch (offset: 1)
Mon, 17 Aug 2020 22:43:00 GMT | Total Writes: 1
Mon, 17 Aug 2020 22:43:00 GMT | dump complete
Mon, 17 Aug 2020 22:43:01 GMT | starting dump
Mon, 17 Aug 2020 22:43:02 GMT | got 100 objects from source elasticsearch (offset: 0)
Mon, 17 Aug 2020 22:43:02 GMT | sent 100 objects to destination file, wrote 100
Mon, 17 Aug 2020 22:43:02 GMT | got 100 objects from source elasticsearch (offset: 100)
Mon, 17 Aug 2020 22:43:02 GMT | sent 100 objects to destination file, wrote 100
Mon, 17 Aug 2020 22:43:02 GMT | got 100 objects from source elasticsearch (offset: 200)
Mon, 17 Aug 2020 22:43:02 GMT | sent 100 objects to destination file, wrote 100
Mon, 17 Aug 2020 22:43:02 GMT | got 100 objects from source elasticsearch (offset: 300)
Mon, 17 Aug 2020 22:43:02 GMT | sent 100 objects to destination file, wrote 100
Mon, 17 Aug 2020 22:43:02 GMT | got 6 objects from source elasticsearch (offset: 400)
Mon, 17 Aug 2020 22:43:02 GMT | sent 6 objects to destination file, wrote 6
Mon, 17 Aug 2020 22:43:02 GMT | got 0 objects from source elasticsearch (offset: 406)
Mon, 17 Aug 2020 22:43:02 GMT | Total Writes: 406
Mon, 17 Aug 2020 22:43:02 GMT | dump complete

Writing that dump creates the JSON files on the specified paths. In this case, the files were created on my desktop.

The my_index_mapping.json File — my_index_mapping.json

Note: As we can see from the above, the file format generated by the dump tool is not valid JSON; however, each line is valid. As is, the dump file is a line-delimited JSON file. Note that this is done so that dump files can be streamed and appended easily.

Now, let’s attempt to back up data from our local ES cluster to a cluster I recently provisioned on Elastic cloud. Here, we specify the input as our local Elasticsearch and the destination or output to our Elastic cluster in the cloud.

retina@alex ~ % elasticdump \
  --input=http://localhost:9200/cars \
  --output=https://elastic:3e53pNgkCVWHvMoiddtCUvTx@4c94d54df32d4b5b99f37ea248f686e7.eu-central-1.aws.cloud.es.io:9243/car \
  --type=analyzer
elasticdump \
  --input=http://localhost:9200/cars \
  --output=https://elastic:3e53pNgkCVWHvMoiddtCUvTx@4c94d54df32d4b5b99f37ea248f686e7.eu-central-1.aws.cloud.es.io:9243/car \
  --type=mapping
elasticdump \
  --input=http://localhost:9200/cars \
  --output=https://elastic:3e53pNgkCVWHvMoiddtCUvTx@4c94d54df32d4b5b99f37ea248f686e7.eu-central-1.aws.cloud.es.io:9243/cars \
  --type=data

The output is shown below:

Mon, 17 Aug 2020 23:10:26 GMT | starting dump
Mon, 17 Aug 2020 23:10:26 GMT | got 1 objects from source elasticsearch (offset: 0)
Mon, 17 Aug 2020 23:10:34 GMT | sent 1 objects to destination elasticsearch, wrote 1
Mon, 17 Aug 2020 23:10:34 GMT | got 0 objects from source elasticsearch (offset: 1)
Mon, 17 Aug 2020 23:10:34 GMT | Total Writes: 1
Mon, 17 Aug 2020 23:10:34 GMT | dump complete
Mon, 17 Aug 2020 23:10:35 GMT | starting dump
Mon, 17 Aug 2020 23:10:35 GMT | got 1 objects from source elasticsearch (offset: 0)
Mon, 17 Aug 2020 23:10:38 GMT | sent 1 objects to destination elasticsearch, wrote 1
Mon, 17 Aug 2020 23:10:38 GMT | got 0 objects from source elasticsearch (offset: 1)
Mon, 17 Aug 2020 23:10:38 GMT | Total Writes: 1
Mon, 17 Aug 2020 23:10:38 GMT | dump complete
Mon, 17 Aug 2020 23:10:38 GMT | starting dump
Mon, 17 Aug 2020 23:10:38 GMT | got 100 objects from source elasticsearch (offset: 0)
Mon, 17 Aug 2020 23:10:42 GMT | sent 100 objects to destination elasticsearch, wrote 100
Mon, 17 Aug 2020 23:10:43 GMT | got 100 objects from source elasticsearch (offset: 100)
Mon, 17 Aug 2020 23:10:46 GMT | sent 100 objects to destination elasticsearch, wrote 100
Mon, 17 Aug 2020 23:10:46 GMT | got 100 objects from source elasticsearch (offset: 200)
Mon, 17 Aug 2020 23:10:49 GMT | sent 100 objects to destination elasticsearch, wrote 100
Mon, 17 Aug 2020 23:10:49 GMT | got 100 objects from source elasticsearch (offset: 300)
Mon, 17 Aug 2020 23:10:52 GMT | sent 100 objects to destination elasticsearch, wrote 100
Mon, 17 Aug 2020 23:10:52 GMT | got 6 objects from source elasticsearch (offset: 400)
Mon, 17 Aug 2020 23:10:54 GMT | sent 6 objects to destination elasticsearch, wrote 6
Mon, 17 Aug 2020 23:10:54 GMT | got 0 objects from source elasticsearch (offset: 406)
Mon, 17 Aug 2020 23:10:54 GMT | Total Writes: 406
Mon, 17 Aug 2020 23:10:54 GMT | dump complete

With the dump completed, we can now proceed to check that the index is available in the Elasticsearch service we had initially provisioned.

When we visit the API console on the cloud-hosted version and perform a get request on the car index, we get our index displayed with the correct number of records copied, as seen in the screenshots below.

Over 200k developers use LogRocket to create better digital experiences

Learn more →

Our Backup Index On The Cloud Version — Showing the backup index on the cloud version.

Our Mappings On The Cloud Version — Showing the mappings on the cloud version.

Next, let’s look at this example of backing up the result of a query to a file. The command is shown below:

retina@alex ~ % elasticdump \
  --input=http://localhost:9200/cars \
  --output=/Users/retina/Desktop/query.json \
  --searchBody="{\"query\":{\"range\":{\"Horsepower\": {\"gte\": "201", \"lte\": "300"}}}}"

The output of running the above command is shown below:

Mon, 17 Aug 2020 23:42:46 GMT | starting dump
Mon, 17 Aug 2020 23:42:47 GMT | got 10 objects from source elasticsearch (offset: 0)
Mon, 17 Aug 2020 23:42:47 GMT | sent 10 objects to destination file, wrote 10
Mon, 17 Aug 2020 23:42:47 GMT | got 0 objects from source elasticsearch (offset: 10)
Mon, 17 Aug 2020 23:42:47 GMT | Total Writes: 10
Mon, 17 Aug 2020 23:42:47 GMT | dump complete

If we check the contents of the file, we can see our query results copied to the file:

Query Result In JSON Format — Result of writing a query to a file in JSON format.

If we check it out, we are doing a range query where the results of the Horsepower field should return values greater than 201 but less than 300, which is what we got!

Finally, our last example would be on splitting files into multiple parts while backing them up. To do so, we run the below command:

retina@alex ~ % elasticdump \
  --input=http://localhost:9200/cars \
  --output=/Users/retina/Desktop/my_index2.json \
  --fileSize=10kb

We will get the output shown below:

Tue, 18 Aug 2020 00:05:01 GMT | starting dump
Tue, 18 Aug 2020 00:05:01 GMT | got 100 objects from source elasticsearch (offset: 0)
Tue, 18 Aug 2020 00:05:02 GMT | sent 100 objects to destination file, wrote 100
Tue, 18 Aug 2020 00:05:02 GMT | got 100 objects from source elasticsearch (offset: 100)
Tue, 18 Aug 2020 00:05:02 GMT | sent 100 objects to destination file, wrote 100
Tue, 18 Aug 2020 00:05:02 GMT | got 100 objects from source elasticsearch (offset: 200)
Tue, 18 Aug 2020 00:05:02 GMT | sent 100 objects to destination file, wrote 100
Tue, 18 Aug 2020 00:05:02 GMT | got 100 objects from source elasticsearch (offset: 300)
Tue, 18 Aug 2020 00:05:02 GMT | sent 100 objects to destination file, wrote 100
Tue, 18 Aug 2020 00:05:02 GMT | got 6 objects from source elasticsearch (offset: 400)
Tue, 18 Aug 2020 00:05:02 GMT | sent 6 objects to destination file, wrote 6
Tue, 18 Aug 2020 00:05:02 GMT | got 0 objects from source elasticsearch (offset: 406)
Tue, 18 Aug 2020 00:05:02 GMT | Total Writes: 406
Tue, 18 Aug 2020 00:05:02 GMT | dump complete

If we check the output path specified, we will discover that the files have been split into eight different paths. A sample screenshot is shown below:

Output Of Splitting Our Files Part 1 — Output of splitting our files, part 1.

Output Of Splitting Our Files Part 2 — Output of splitting our files, part 2.

Note that if we check the output file above, we will notice that the file names are labelled accordingly from 1 to 8.

Finally, it should be pointed out that native Elasticsearch comes with snapshot and restore modules that can also help us back up a running ES cluster.

Conclusion

Elasticdump is a tool for moving and saving ES indices. As we have seen from this tutorial, we have explored this awesome tool to play around with about 406 records in our ES cluster, and it was pretty fast.

As an exercise, we can also decide to try out a backup of a larger data dump to validate the performance. We could also decide to explore other things we can do, like performing a data dump on multiple Elasticsearch indices and other available commands, which we mentioned earlier.

Extra details about the usage of this tool can always be found in the readme file, and the source code is also available on GitHub.

Thanks again for coming this far, and I hope you have learned a thing or two about using this awesome tool to perform data migrations or dumps on an ES cluster.

Also, don’t hesitate to drop your comments in case you have any or questions, or you can alternatively reach me on Twitter.

Get set up with LogRocket's modern error tracking in minutes:

Visit https://logrocket.com/signup/ to get an app ID

Install LogRocket via npm or script tag. LogRocket.init() must be called client-side, not server-side

npm
Script tag

$ npm i --save logrocket 

// Code:

import LogRocket from 'logrocket'; 
LogRocket.init('app/id');

// Add to your HTML:

<script src="https://cdn.lr-ingest.com/LogRocket.min.js"></script>
<script>window.LogRocket && window.LogRocket.init('app/id');</script>

(Optional) Install plugins for deeper integrations with your stack:
- Redux middleware
- NgRx middleware
- Vuex plugin

Get started now

#elasticsearch

RxJS adoption guide: Overview, examples, and alternatives

Get to know RxJS features, benefits, and more to help you understand what it is, how it works, and why you should use it.

Emmanuel Odioko

Jul 26, 2024 ⋅ 13 min read

Decoupling monoliths into microservices with feature flags

Explore how to effectively break down a monolithic application into microservices using feature flags and Flagsmith.

Kayode Adeniyi

Jul 25, 2024 ⋅ 10 min read

Lots of multi-colored blue and purplish rectangles.

Animating dialog and popover elements with CSS @starting-style

Native dialog and popover elements have their own well-defined roles in modern-day frontend web development. Dialog elements are known to […]

Rahul Chhodde

Jul 24, 2024 ⋅ 10 min read

Using Llama Index To Add Personal Data To Large Language Models

Using LlamaIndex to add personal data to LLMs

LlamaIndex provides tools for ingesting, processing, and implementing complex query workflows that combine data access with LLM prompting.

Ukeje Goodness

Jul 23, 2024 ⋅ 5 min read

View all posts

Advisory boards aren’t only for executives. Join the LogRocket Content Advisory Board today →

A practical guide to working with Elasticdump

Introduction

Getting started with Elasticdump

Prerequisites

Installation

Usage of Elasticdump

Notes on the options parameters

Version improvements worthy of note

Using Elasticdump with real-world data

Steps

Over 200k developers use LogRocket to create better digital experiences

Conclusion

Get set up with LogRocket's modern error tracking in minutes:

Stop guessing about your digital experience with LogRocket

Recent posts:

RxJS adoption guide: Overview, examples, and alternatives

Decoupling monoliths into microservices with feature flags

Animating dialog and popover elements with CSS @starting-style

Using LlamaIndex to add personal data to LLMs

Leave a ReplyCancel reply

Advisory boards aren’t only for executives. Join the LogRocket Content Advisory Board today →

Introduction

Getting started with Elasticdump

Prerequisites

Installation

Usage of Elasticdump

Notes on the options parameters

Version improvements worthy of note

Using Elasticdump with real-world data

Steps

Over 200k developers use LogRocket to create better digital experiences

Conclusion

Get set up with LogRocket's modern error tracking in minutes:

Share this:

Stop guessing about your digital experience with LogRocket

Recent posts:

RxJS adoption guide: Overview, examples, and alternatives

Decoupling monoliths into microservices with feature flags

Animating dialog and popover elements with CSS @starting-style

Using LlamaIndex to add personal data to LLMs

Leave a ReplyCancel reply