Wednesday, August 26, 2009

Traditional SaaS vs Cloud enabled SaaS

Inspired by Gilad's great summary on the Cloud Programming model, I try to summarize the difference that I observe between the traditional SaaS model and the "cloud-enabled SaaS model". Although cloud providers advocates zero effort is need to migrate existing applications into the cloud, it is my belief that this "strict-port" approach doesn't fully exploit the full power of cloud computing. There are a number of characteristic that cloud is different from traditional data center environment, application which design along these characteristic will take more advantages from the cloud.

I believe a Cloud-enabled-Application should have the following characteristic in its fundamental design.

Latency Awareness

Traditional SaaS App typically run within a single data center and assume low latency among server components. Now in the cloud environment that span many distant geographic locations, but the assumption of low latency cannot hold any more. We need to be “smarter” when choosing where to deploy to avoid the situation of putting frequently communicating components between far-distant locations. “Cloud-enabled SaaS app” need to be aware of latency difference and built in self-configuring and self-tuning mechanism to cope with that.

Cost Awareness

Traditional SaaS app typically run on already purposed hardware where utilization efficiency is not a concern. Now with the “pay as you go” model, application need to pay more attention to its usage pattern and efficiency of underlying resources because it will affect the operation cost. Cloud-enabled SaaS application need to understand the cost model of different resources utilization (such as CPU cost may be very different from Bandwidth cost) and adjust their usage strategy to minimize the operation cost.

Security Awareness

Traditional SaaS app typically run on a fully trusted data center based on perimeter security. But in the Hybrid cloud model, the perimeter being drawn is very different now. Application need to carefully select where to store its data such that sensitivity will not be leaking. This involve careful determination of storage provider or use encryption for protection.

Capitalize on Elasticity

Traditional SaaS App is not used to large-scale growth / shrink of compute resources and typically haven’t designed well to handle how data get distributed to newly joined machines (in a growth scenario) or redistributed among remaining machines (in a shrink scenario). This ends up having a very inefficient use of network bandwidth and results in high cost and low performance. More sophisticated data distribution protocol that align with the growth and shrink dimension is needed for “Cloud-enabled SaaS app”

Monday, August 17, 2009

Multi-tenancy in cloud computing

Followup on an interesting discussion in Cloud Computing discussion group. What is a tenant ? Is multi-tenancy an important feature of cloud ? Who are the participants and their roles in the cloud ecosystem ?

Participants in the cloud
In my model, a "SaaS provider" is the organization that provides a domain specific SaaS App to its users (e.g. SmugMug for photo sharing). In this case, the SaaS consumer is just any individual who has a SmugMug account. The SaaS provider may choose an infrastructure provider (e.g. Amazon) to host its SaaS App. In this example, SmugMug is a SaaS provider and Infrastructure consumer at the same time.

Definition of a Tenant
Now, who is the "tenant" in this picture. I think Amazon will consider SmugMug as a tenant. But I doubt SmugMug will consider its individual user a tenant.

But what if SmugMug offer a services to car manufacturers so they can store, organize and image process their photos, which will show up in the car manufacturer's website. Will SmugMug consider BMW a tenant ? I think the answer is "yes". So maybe the definition of a tenant is "my user who has her own users".

You can see there can be a value chain built up. So except the start and end point of this value chain, everyone is a "tenant" to its service provider.

After we defne what a "tenant" is, what does "multi-tenancy" mean ? In my opinion, "multi-tenancy" is for the benefit of the service provider so they can manage the resource ultization more efficiently, but multi-tenancy is not to the tenant's advantage at all. In the fake example I gave above, would BMW prefers a multi-tenancy environment from SmugMug ? My guess is that BMW would in fact worry if their data is sitting together with their competitors in a shared infrastructure. I bet they would prefer an environment which is isolated as much as possible.

While "multi-tenancy" indicates that some infrastructure is shared, at what layers are things being shared can make a big difference. For example, Amazon AWS is multi-tenant at the hardware level in that its users may be sharing a physical machine. On the other hand, is multi-tenant at the DB level in that its users are sharing data in the same DB tables. And Amazon is relying on the hypervisor to provide the isolation between tenants while is relying on a query rewriter to do the same.

While "multi-tenancy" at the highest layer basically advocates a shared-DB approach, does it enables better collaboration or sharing between tenants ? I don't think so. I think all we need is to have an authentication model such that spontaneous workgroup can be formed and membership can be identified. Then it is just a matter of a requesting tenant to presents his membership to another tenant when making a SaaS service call. What I mean is they are using an SOA approach to access data, rather than directly access a shared-DB.

Sunday, August 9, 2009

Skinny Straw in the Cloud Shake

There is recently an article by Bernard Golden talking about network constraint (bandwidth and latency) as well as the associated bandwidth usage cost continues to become one main obstacle in cloud computing.

There are two concerns here. One is about not meeting the application's performance goal (throughput and response time). The other is about the cost of running in the cloud. (receive a large phone bill from your cloud provider)

The goal is to reduce the total amount of data transfer. A number of cloud app design patterns can be used ...

How do you put the code and data together before the processing can start ?

Try to be as stateless as possible
There is zero data data transfer to be transferred if your component is stateless by nature. Following techniques are assuming that there are some unavoidable stateful components involved.

Move your data creation process into the cloud first
Instead of uploading huge volume of data from your data center into the cloud so processing can be started, can you move the data creation process into the cloud ? Of course, you need to carefully evaluate the security implications here.

Distribute the architecture of your data creation
If the subsequent processing is based on a parallel execution architecture, why not distribute the data creation processing also. This will save a data repartition step.

Move the code to the data
Code usually has a much smaller footprint than the data it processes. Therefore it is more economical to move processing logic to the data rather than downloading the data to process. Of course, we need to check to make sure the machine hosting the data has enough CPU power to execute the processing logic.

Do as much as possible along current partition
A typically parallel processing architecture partitions data along some dimensions, conduct the processing in parallel, and then repartition data along other dimensions, conduct the next stage of processing, and so on ...

See if you can rearrange the order of processing such that you can do as much as possible within the current partition. The goal is to minimize the number of repartitions where a lot of data transfer is needed.

Minimize data redistribution at grow/shrink
How do you redistribute data to newly joined VM such that the overall data transfer can be minimized ? For example, "consistent hashing" algorithm can be used such that data redistribution only happens within the neighbor of newly joined VM rather than every other existing VMs.

Conduct data redistribution in the background
Data redistribution should have an impact on performance but not accuracy. In other words, the newly joined VMs should be able to serve immediately while doing data redistribution in the background. The data redistribution algorithm (which may take a longer time to finish) also need to adapt to continuous joining VMs. In other words, data redistribution can be just an ongoing performance improvement process in a highly dynamic workload environment.

Place component with bandwidth cost in mind
Other than the amount of data being transferred (which should be minimized anyway), it is equally important to look into bandwidth cost. Typically the cloud provider will charge a substantial amount in bandwidth usage across the cloud boundary. Therefore, it is important to place the components such that if data transfer do need to occur, it will occur within the cloud rather than across the cloud boundary. This requires a careful analysis of the communication pattern among application components and group frequently communicating components so they will be deployed within the same cloud.

Migrate data as communication pattern changes
Communication pattern may change after the system is deployed. It is important to continuously monitor the actual communication patterns and determine if a migration is needed to minimize the bandwidth cost. It is important to consider the gain versus the cost of migration. Gain is estimated by multiplying the communication frequency with the time that the new communication pattern is going to persist. Cost is estimated by the total among of data redistribution traffic caused by component migration. And only when the migration cost is smaller than the gain will the migration take place.

Exploit Caching
Use a local cache to reduce the need of data access, especially if the data is relatively static.

Allow direct access to data
This is against the philosophy of SOA where the internal state should be encapsulated behind an API interface. In this model, when a client want to extract the data, it need to first make a request to the owning application, which then make a request to the DB, get the data, encode that into the web service response, and pass the result back to the client. Is network bandwidth is costly, it will be much more efficient if the client can have direct access to the DB.

Expose latency information to the application
Provide latency map so application can dynamically adjust their communication partners who they want to communicate with.