1 billion rides in SAP HANA on AWS

As we have an upcoming BrightTALK webinar on Wednesday, we needed a nice Dataset to show how cloud native power looks like.

With our friends from Cloud9 Charts we selected a public dataset with NYC Yellow Taxi Rides from 2009 to 2016. Nicely stored in Amazon S3 like so many public datasets.

Properties: 217 GB RAW CSV data with 1.231 billion rows.

A word on data oceans

I think it is no longer a discussion that the data persistence layer for data oceans is cloud object storage such as Amazon S3, Azure Blob Storage or Google Cloud Storage.

Standard price points are 2.4 to 3.0 cents / GB / month across the 3 providers.

Also each of the 3 providers have the equivalent of infrequently accessed object storage, which comes in at 1.0 to 1.25 cents / GB / month.

AWS and Google both promise a durability of 99.999999999 %. Azure has no published durability statement at this point.

Don't think though that object storage is a commodity. Between the 3 providers there are massive differences in the domains of performance, security and best practices.

A word on data lakes

There might also be no better storage for your corporate datasets, including your most confidential ones. That is because

Strong Security, Data Protection and Compliance capabilities
Durability, Availability design
unlimited immediate scaling

And the price points above.

Security Options include that you can lock down access in a number of ways and the specify strong encryption methods. However as mentioned earlier the 3 providers are not equals on Security and Data Protection.

SAP HANA and data lakes and oceans

In the following I will describe a low tech and high tech approach to load datasets from the data ocean into SAP HANA.

Let me describe some general rules that you should consider.

General Rules

1) Co-Location

You will want to put your SAP HANA System next to the data set. This will provide you with maximum performance of the data load and minimum cost (outbound dataset traffic to your HANA system).

In the NYC Yellow Taxi case, we can deduce from the Dataset URL that it is stored in Amazon S3 - us-east-1. So you will want to place your HANA System there.

2) Private Network Optimization

If your HANA System runs in a private subnet then you should configure an AWS VPC Endpoint for S3, to provide you with High Performance access at minimum cost. Otherwise all S3 traffic will go through your NAT layer which will certainly not bring you higher performance but it will lead to higher cost.

3) Use an instance type with 10 Gbit Networking for High Performance Data Loading

We will see later what Amazon S3 can provide in terms of throughput. If you care about data load speed then you should select

R3.8xlarge (244 GB RAM)
M4.10xlarge (160 GB RAM)
C3 / C4.8xlarge (60 GB RAM)

Why am I not listing X1 with 2 TB RAM? We are still waiting for the High Performance Network Drivers for SUSE Linux to activate the 20 Gbit mode. Until then X1 will not reach the throughput performance of the instances above.

Manual Approach

After taking care of the above you can continue with

4) Copy the Dataset to the HANA Machine.

You should use AWS Command Line Interface for achieving a decent storage throughput of 200 MB / second. I would also suggest to store the dataset on a different storage device than what is underneath /hana/data and /hana/log. Otherwise SAP HANA will compete for Storage Bandwidth during data load.

5) Load Data

Now you can load the data into SAP HANA, either using HANA Studio or the HANA Command Line. This blog post describes nicely how to do this.

Cloud Native Approach

Well a cloud native approach is to do the above in two clicks. One for Provisioning the system and one for loading the data.

Ocean9 provides this to you and more.

At Ocean9 we are permanently seeking to get the maximum out of the cloud. We have realized a direct path for data loading from S3 to SAP HANA without persistence in between.

CPU was permanently in the corridor of 80-90%. Disk I/O was not an issue with less than 170 MB / second.

The data load took exactly 60 minutes. (and yes there is further room for improvement)

On X1.32xlarge data loading took 125 minutes (because of the Driver status described above)

HANA Performance

I just ran one statement to get an impression.

select sum(total_amt) from nyc.yellow_taxi

The SQL statement needs to touch all "rows" and can theoretically use all vCPUs in parallel.

The command completes in

r3.8xlarge : 1.0 second (32 vCPUs)
r3.4xlarge : 1.8 seconds (16 vCPUs)
x1.32xlarge : 0.4 seconds (128 vCPUs)

Backup and Restore

Using our Advanced Backup and Restore Implementation for SAP HANA, it is always a good idea to perform a data backup to S3 after data load.

For this dataset backup takes 10 minutes and restore takes 8 minutes.

Now we can create a new brand new system including this dataset in 20 minutes, with data loaded in Memory !

For the brighttalk webinar we will launch the system on Wednesday morning and terminate it before noon.

This is how the cloud should be used.

What's next ?

See it yourself ! Join the BrightTALK webinar on Wednesday to see this in action combined with an Analytics Service from Cloud9 Charts !

Try it yourself ! Use my HANA SQL Schema for the NYC Taxi Dataset.

Love to reply to your comments and questions below !

1 billion rides in SAP HANA on AWS

NYC Yellow Cab Analytics

What’s a Rich Text element?

Static and dynamic content editing

How to customize formatting for each rich text