Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

›Druid SQL

Getting started

  • Introduction to Apache Druid
  • Quickstart (local)
  • Single server deployment
  • Clustered deployment

Tutorials

  • Load files using SQL
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Query data
  • Aggregate data with rollup
  • Theta sketches
  • Configure data retention
  • Update existing data
  • Compact segments
  • Deleting data
  • Write an ingestion spec
  • Transform input data
  • Convert ingestion spec to SQL
  • Run with Docker
  • Kerberized HDFS deep storage
  • Get to know Query view
  • Unnesting arrays
  • Query from deep storage
  • Jupyter Notebook tutorials
  • Docker for tutorials
  • JDBC connector

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Overview
  • Ingestion concepts

    • Source input formats
    • Input sources
    • Schema model
    • Rollup
    • Partitioning
    • Task reference

    SQL-based batch

    • SQL-based ingestion
    • Key concepts
    • Security
    • Examples
    • Reference
    • Known issues

    Streaming

    • Apache Kafka ingestion
    • Apache Kafka supervisor
    • Apache Kafka operations
    • Amazon Kinesis

    Classic batch

    • JSON-based batch
    • Hadoop-based
  • Ingestion spec reference
  • Schema design tips
  • Troubleshooting FAQ

Data management

  • Overview
  • Data updates
  • Data deletion
  • Schema changes
  • Compaction
  • Automatic compaction

Querying

    Druid SQL

    • Overview and syntax
    • Query from deep storage
    • SQL data types
    • Operators
    • Scalar functions
    • Aggregation functions
    • Array functions
    • Multi-value string functions
    • JSON functions
    • All functions
    • SQL query context
    • SQL metadata tables
    • SQL query translation
  • Native queries
  • Query execution
  • Troubleshooting
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Nested columns
    • Multitenancy
    • Query caching
    • Using query caching
    • Query context

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

API reference

  • Overview
  • HTTP APIs

    • Druid SQL
    • SQL-based ingestion
    • JSON querying
    • Tasks
    • Supervisors
    • Retention rules
    • Data management
    • Automatic compaction
    • Lookups
    • Service status
    • Dynamic configuration
    • Legacy metadata

    Java APIs

    • SQL JDBC driver

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Java runtime
  • Durable storage
  • Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Password providers
    • Dynamic Config Providers
    • TLS support

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup

    Monitoring

    • Request logging
    • Metrics
    • Alerts
  • High availability
  • Rolling updates
  • Using rules to drop and retain data
  • Migrate from firehose
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • insert-segment-to-db tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Contribute to Druid docs
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Druid AWS RDS Module
  • Kubernetes
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Prometheus Emitter
  • Firehose (deprecated)
  • JSON-based batch (simple)
  • Realtime Process
  • kubernetes
  • Cardinality/HyperUnique aggregators
  • Select
  • Load files natively
Edit

Query from deep storage

Query from deep storage is an experimental feature.

Druid can query segments that are only stored in deep storage. Running a query from deep storage is slower than running queries from segments that are loaded on Historical processes, but it's a great tool for data that you either access infrequently or where the low latency results that typical Druid queries provide is not necessary. Queries from deep storage can increase the surface area of data available to query without requiring you to scale your Historical processes to accommodate more segments.

Keep segments in deep storage only

Any data you ingest into Druid is already stored in deep storage, so you don't need to perform any additional configuration from that perspective. However, to take advantage of the cost savings that querying from deep storage provides, make sure not all your segments get loaded onto Historical processes.

To do this, configure load rules to manage the which segments are only in deep storage and which get loaded onto Historical processes.

The easiest way to do this is to explicitly configure the segments that don't get loaded onto Historical processes. Set tieredReplicants to an empty array and useDefaultTierForNull to false. For example, if you configure the following rule for a datasource:

[
  {
    "interval": "2016-06-27T00:00:00.000Z/2016-06-27T02:59:00.000Z",
    "tieredReplicants": {},
    "useDefaultTierForNull": false,
    "type": "loadByInterval"
  }
]

Any segment that falls within the specified interval exists only in deep storage. For segments that aren't in this interval, they'll use the default cluster load rules or any other load rules you configure.

To configure the load rules through the Druid console, go to Datasources > ... in the Actions column > Edit retention rules. Then, paste the provided JSON into the JSON tab:

You can verify that a segment is not loaded on any Historical tiers by querying the Druid metadata table:

SELECT "segment_id", "replication_factor" FROM sys."segments" WHERE "replication_factor" = 0 AND "datasource" = YOUR_DATASOURCE

Segments with a replication_factor of 0 are not assigned to any Historical tiers. Queries against these segments are run directly against the segment in deep storage.

You can also confirm this through the Druid console. On the Segments page, see the Replication factor column.

Keep the following in mind when working with load rules to control what exists only in deep storage:

  • At least one of the segments in a datasource must be loaded onto a Historical process so that Druid can plan the query. The segment on the Historical process can be any segment from the datasource. It does not need to be a specific segment. One way to verify that a datasource has at least one segment on a Historical process is if it's visible in the Druid console.
  • The actual number of replicas may differ from the replication factor temporarily as Druid processes your load rules.

Run a query from deep storage

Submit a query

You can query data from deep storage by submitting a query to the API using POST /sql/statements or the Druid console. Druid uses the multi-stage query (MSQ) task engine to perform the query.

To run a query from deep storage, send your query to the Router using the POST method:

POST https://ROUTER:8888/druid/v2/sql/statements

Submitting a query from deep storage uses the same syntax as any other Druid SQL query where the query is contained in the "query" field in the JSON object within the request payload. For example:

{"query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar'"}

Generally, the request body fields are the same between the sql and sql/statements endpoints.

There are additional context parameters for sql/statements specifically:

  • executionMode (required) determines how query results are fetched. Set this to ASYNC.
  • selectDestination (optional) set to durableStorage instructs Druid to write the results from SELECT queries to durable storage. Note that this requires you to have durable storage for MSQ enabled.

The following sample query includes the two additional context parameters that querying from deep storage supports:

curl --location 'http://localhost:8888/druid/v2/sql/statements' \
--header 'Content-Type: application/json' \
--data '{
    "query":"SELECT * FROM \"YOUR_DATASOURCE\" where \"__time\" >TIMESTAMP'\''2017-09-01'\'' and \"__time\" <= TIMESTAMP'\''2017-09-02'\''",
    "context":{
        "executionMode":"ASYNC",
        "selectDestination": "durableStorage"

    }  
}'

The response for submitting a query includes the query ID along with basic information, such as when you submitted the query and the schema of the results:

{
  "queryId": "query-ALPHANUMBERIC-STRING",
  "state": "ACCEPTED",
  "createdAt": CREATION_TIMESTAMP,
"schema": [
  {
    "name": COLUMN_NAME,
    "type": COLUMN_TYPE,
    "nativeType": COLUMN_TYPE
  },
  ...
],
"durationMs": DURATION_IN_MS,
}

Get query status

You can check the status of a query with the following API call:

GET https://ROUTER:8888/druid/v2/sql/statements/QUERYID

The query returns the status of the query, such as ACCEPTED or RUNNING. Before you attempt to get results, make sure the state is SUCCESS.

When you check the status on a successful query, it includes useful information about your query results including a sample record and information about how the results are organized by pages. The information for each page includes the following:

  • numRows: the number of rows in that page of results
  • sizeInBytes: the size of the page
  • id: the indexed page number that you can use to reference a specific page when you get query results

You can use page as a parameter to refine the results you retrieve.

The following snippet shows the structure of the result object:

{
  ...
  "result": {
    "numTotalRows": INTEGER,
    "totalSizeInBytes": INTEGER,
    "dataSource": "__query_select",
    "sampleRecords": [
      [
        RECORD_1,
        RECORD_2,
        ...
      ]
    ],
    "pages": [
      {
        "numRows": INTEGER,
        "sizeInBytes": INTEGER,
        "id": INTEGER_PAGE_NUMBER
      }
      ...
    ]
}
}

Get query results

Only the user who submitted a query can retrieve the results for the query.

Use the following endpoint to retrieve results:

GET https://ROUTER:8888/druid/v2/sql/statements/QUERYID/results?page=PAGENUMBER&size=RESULT_SIZE&timeout=TIMEOUT_MS

Results are returned in JSON format.

You can use the optional page, size, and timeout parameters to refine your results. You can retrieve the page information for your results by fetching the status of the completed query.

When you try to get results for a query from deep storage, you may receive an error that states the query is still running. Wait until the query completes before you try again.

Further reading

  • Query from deep storage tutorial
  • Query from deep storage API reference
← Overview and syntaxSQL data types →
  • Keep segments in deep storage only
  • Run a query from deep storage
    • Submit a query
    • Get query status
    • Get query results
  • Further reading

Technology · Use Cases · Powered by Druid · Docs · Community · Download · FAQ

 ·  ·  · 
Copyright © 2022 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.