Apache Druid
  • Technology
  • Use Cases
  • Powered By
  • Docs
  • Community
  • Apache
  • Download

›Hidden

Getting started

  • Introduction to Apache Druid
  • Quickstart (local)
  • Single server deployment
  • Clustered deployment

Tutorials

  • Load files using SQL
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Query data
  • Aggregate data with rollup
  • Theta sketches
  • Configure data retention
  • Update existing data
  • Compact segments
  • Deleting data
  • Write an ingestion spec
  • Transform input data
  • Convert ingestion spec to SQL
  • Run with Docker
  • Kerberized HDFS deep storage
  • Get to know Query view
  • Unnesting arrays
  • Query from deep storage
  • Jupyter Notebook tutorials
  • Docker for tutorials
  • JDBC connector

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Overview
  • Ingestion concepts

    • Source input formats
    • Input sources
    • Schema model
    • Rollup
    • Partitioning
    • Task reference

    SQL-based batch

    • SQL-based ingestion
    • Key concepts
    • Security
    • Examples
    • Reference
    • Known issues

    Streaming

    • Apache Kafka ingestion
    • Apache Kafka supervisor
    • Apache Kafka operations
    • Amazon Kinesis

    Classic batch

    • JSON-based batch
    • Hadoop-based
  • Ingestion spec reference
  • Schema design tips
  • Troubleshooting FAQ

Data management

  • Overview
  • Data updates
  • Data deletion
  • Schema changes
  • Compaction
  • Automatic compaction

Querying

    Druid SQL

    • Overview and syntax
    • Query from deep storage
    • SQL data types
    • Operators
    • Scalar functions
    • Aggregation functions
    • Array functions
    • Multi-value string functions
    • JSON functions
    • All functions
    • SQL query context
    • SQL metadata tables
    • SQL query translation
  • Native queries
  • Query execution
  • Troubleshooting
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Nested columns
    • Multitenancy
    • Query caching
    • Using query caching
    • Query context

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

API reference

  • Overview
  • HTTP APIs

    • Druid SQL
    • SQL-based ingestion
    • JSON querying
    • Tasks
    • Supervisors
    • Retention rules
    • Data management
    • Automatic compaction
    • Lookups
    • Service status
    • Dynamic configuration
    • Legacy metadata

    Java APIs

    • SQL JDBC driver

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Java runtime
  • Durable storage
  • Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Password providers
    • Dynamic Config Providers
    • TLS support

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup

    Monitoring

    • Request logging
    • Metrics
    • Alerts
  • High availability
  • Rolling updates
  • Using rules to drop and retain data
  • Migrate from firehose
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • insert-segment-to-db tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Contribute to Druid docs
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Druid AWS RDS Module
  • Kubernetes
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Prometheus Emitter
  • Firehose (deprecated)
  • JSON-based batch (simple)
  • Realtime Process
  • kubernetes
  • Cardinality/HyperUnique aggregators
  • Select
  • Load files natively
Edit

Test Stats Aggregators

This Apache Druid extension incorporates test statistics related aggregators, including z-score and p-value. Please refer to https://www.paypal-engineering.com/2017/06/29/democratizing-experimentation-data-for-product-innovations/ for math background and details.

Make sure to include druid-stats extension in order to use these aggregators.

Z-Score for two sample ztests post aggregator

Please refer to https://www.isixsigma.com/tools-templates/hypothesis-testing/making-sense-two-proportions-test/ and http://www.ucs.louisiana.edu/~jcb0773/Berry_statbook/Berry_statbook_chpt6.pdf for more details.

z = (p1 - p2) / S.E. (assuming null hypothesis is true)

Please see below for p1 and p2. Please note S.E. stands for standard error where

S.E. = sqrt{ p1 * ( 1 - p1 )/n1 + p2 * (1 - p2)/n2) }

(p1 – p2) is the observed difference between two sample proportions.

zscore2sample post aggregator

  • zscore2sample: calculate the z-score using two-sample z-test while converting binary variables (e.g. success or not) to continuous variables (e.g. conversion rate).
{
  "type": "zscore2sample",
  "name": "<output_name>",
  "successCount1": <post_aggregator> success count of sample 1,
  "sample1Size": <post_aggregaror> sample 1 size,
  "successCount2": <post_aggregator> success count of sample 2,
  "sample2Size" : <post_aggregator> sample 2 size
}

Please note the post aggregator will be converting binary variables to continuous variables for two population proportions. Specifically

p1 = (successCount1) / (sample size 1)

p2 = (successCount2) / (sample size 2)

pvalue2tailedZtest post aggregator

  • pvalue2tailedZtest: calculate p-value of two-sided z-test from zscore
    • pvalue2tailedZtest(zscore) - the input is a z-score which can be calculated using the zscore2sample post aggregator
{
  "type": "pvalue2tailedZtest",
  "name": "<output_name>",
  "zScore": <zscore post_aggregator>
}

Example Usage

In this example, we use zscore2sample post aggregator to calculate z-score, and then feed the z-score to pvalue2tailedZtest post aggregator to calculate p-value.

A JSON query example can be as follows:

{
  ...
    "postAggregations" : {
    "type"   : "pvalue2tailedZtest",
    "name"   : "pvalue",
    "zScore" :
    {
     "type"   : "zscore2sample",
     "name"   : "zscore",
     "successCount1" :
       { "type"   : "constant",
         "name"   : "successCountFromPopulation1Sample",
         "value"  : 300
       },
     "sample1Size" :
       { "type"   : "constant",
         "name"   : "sampleSizeOfPopulation1",
         "value"  : 500
       },
     "successCount2":
       { "type"   : "constant",
         "name"   : "successCountFromPopulation2Sample",
         "value"  : 450
       },
     "sample2Size" :
       { "type"   : "constant",
         "name"   : "sampleSizeOfPopulation2",
         "value"  : 600
       }
     }
    }
}

← Stats aggregatorDruid AWS RDS Module →
  • Z-Score for two sample ztests post aggregator
    • zscore2sample post aggregator
    • pvalue2tailedZtest post aggregator
  • Example Usage

Technology · Use Cases · Powered by Druid · Docs · Community · Download · FAQ

 ·  ·  · 
Copyright © 2022 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.