- 10 years of Fortune 500 stocks
- 30M rows
- Single AWS node
- interactive visual query for stock prices …
What is Hbase
- Runs on HDFS
- K/V store – distributed, sorted, map
- low level APIs – get, put values given row key & scan over range given start and stop row keys.
Why use HBase?
- you have lots of data. scale linearly. shards automatically. data is split and redistributed
- no transactions out of the box.
- you need strict consistency.
- hbase handle incremental data changes by building log structure merge trees
Why use Phoenix?
Phoenix versus Hive performance
- Phoenix (key filter) >> phoenix (full table) > >> Hive over HBase
- Hive and Impala have similar perf. Does not understand the structure of row-keys
- from oracle into HBase while maintaining salesforce developer API
- Platform monitoring
Why is it important?
- scalable, low latency app development starts with Phoenix
- Phoenix worries about physical scale and fast performance so that app developers don’t have to
- Looks, tastes, feels like SOQL to force.com developer
Archival problem set
- Field history tracking grows unbounded. call centers call individual change. No data is deleted for 8.5 years.
- Enterprise customers require long term storage of ‘cold’ data.
- Data retention policies can require of data to be kept around
Archive pilot demonstration
- insret data from Oracle into Phoenix
Monitoring use case
- security, compliance and audit
- product support and “Limits analysis”
- product usage and management (roll up into aggregates)
- Identity fraud
Phoenix under the hood
- aggregate co-processor on severside per region, parallelised.
- returns only the distinct set of features
- preserves network bandwidth
- client does final merge-sort.
- client returns the result as a JDBC result
Roadmap (Nov, 2013)
- Apache incubator
- multi-tenant tables
- monitoring and manageent
- cost-based query optimizer
- sql OLAP extensions
(WINDOW, PARTITION OVER, RANK)
State of the union (Jun 2015)
Join and subquery support
- Left/Right/Full outer Join; cross join
- additional: semi join; anti-join
- Algorithms: hash-joni; sort-merge-join
- Predicate push-down
- FK-to-PK join optimisation
- Global index with missing data columns
- Correlated subquery rewrite
- TPC - benchmarking suite
- HBase 1.0 support
- Functional indexes – 4.3 release of Ph. Indexing on expression not on column.
SELECT AVG(response_time) FROM SERVER_METRICS
Adding the following index will turn it into a range scan:
CREATE INDEX day_of_month_idx
- User defined functions – extension points to phoenix. Define the metadata of the fun (input, output) params.
CREATE FUNCTION WOEID_DISTANCE(INTEGER, INTEGER)
RETURN INTEGER AS 'org.apache.geo.woeidDistance'
USING JAR '/lib/geo/geoloc.jar'
SELECT FROM woeid a JOIN woeid b on a.country = b.country
WHERE woeid_distance(a.ID, b.ID) < 5;
This query executes on server side. This is tenant aware, which means.. there is no chance of conflict between tenants defining the same UDF.
Query Server + Thin Driver – offloads queryplanning and execution to different server(s). Minimises client dependencies
Connection conn = DriverManager.getConnection("jdbc:phoenix:thin:url=http://localhost:8765");
Union ALL support – query multiple tables to get single resultset.
- testing at scale with Pherf – define a scenario and deploy pherf to the cluster. It will generate the data and collect the stats. Testing with representattive data size. functional data sets.
- MR Index build
- Spark Integration .. RDD backed by a Phoe table.
Data built-in functions –
Tranactions – using http://tephra.io
- supports repeable_read isolation level
- allows reading your own uncommitted data
- Optional – enable on a table by table basis
- no performance penalty when not used
- Optimistic concurrency control
- no deadlocks or lock escalation
- cost of conflict detection and possible rollback is higher
- good id conflicts are rare: short transactions, disjoin partitioning of work
- conflict detection is not necessary if the data is immutable. ie., write-once/append-only data.
- next step for Phoenix
- Query parser, compiler and planner framework
- SQL-92 compliant
- Pluggable cost-based optimizer framework
- Interop with Calcite adaptors
- already used by Drill, Hive, Kylin, Samza
- Supports any JDBC source
- One Cost model to rule them all
Setting up a local phoenix cluster using Docker
Download and install Docker beta for Mac
We will setup a three node hadoop cluster using this recipe as a template.
git clone firstname.lastname@example.org:kiwenlau/hadoop-cluster-docker.git
docker build .
Lot of things fly by …
Successfully built ed4ece2b19f2
Create hadoop network
sudo docker network create --driver=bridge hadoop