Most software out there relies so much on shared data, that it does not easily scale.
The problem with that is, that somehow developers don't understand that most of the time the data may be allowed to be stale(though still relevant) - see how Amazon Dynamo's clients handle it. Or that most data is accessed only by one process at a time.
In fact, when looking at human interaction systems we see that most of the data is stale when people perform a certain task.
Here are some point that I describe below:
- Allow data to be stale, there is very little reason of making sure everything is always up to date all the time, spread the data as much as possible
- Check out the data access patterns, most data is access
- Understand that need for data does not involve one process/thread being hung up on one resource
- If you need stuff to be up-to-date in real time, share the responsibility - aggregating data is faster and easier
On the freshness of data.
Sample from most of the governmental institutions: laws and regulations have a time of activation that is shared in advance. A lot of decisions are made at a low level without checking the central(at a national level) legal repository every time. That is when data is stale, but still very much relevant.
Sample from most businesses: business rules are much more dynamic than at the governmental institution, but still we see that most of them change very little. Sure you can change it at once, but you would never expect any of you own businesses employees to act on the changed rules, until the employee gets the message about a change in the rules.
Sample from every society that has laws that change or may be changed: when a law is passed, you don't know that is has changes some other law or made something illegal or legal instantly. It has a date of coming into force. Up until then every person will know most important laws that are in effect and when the new law comes into effect.
Then there is the access to data.
Sample from banking: chances are that there will be no more than ONE "process" using an account data at a time. In fact, I bet that, the absolute majority of operations are performed by one process accessing ONE account. The reality is, that the only part that the bank needs to be absolutely sure about is the "credit" operation(Say bye to "credit->debit|undo transaction"). The debit operation can be reasonably fast in asynchronous mode and can send notification of a failure.
Sample from real life: if you don't find your personal mug in the cupboard you don't just stand at the cupboard until someone returns it! You go and look around. The mug is the data here and the person is the process. Standing and waiting for it ti be returned is equivalent of locking a resource.
As this sample shows, locking and waiting is not always the best option. The reality is, that it's not the process that has taken the resource that locks, it's the waiting resource that locks and waits.
My proposed strategy, is acquire ownership of the data and leave a note stating who knows it's whereabouts.
So the morale is don't lock, but take. Don't wait, but look around.
Now. I can't really comment on systems where real time data is essential. But in medicine real time means means the difference between life and death. But those systems are embarrassingly parallelizable.
Remember, our world works on Eventually Consistent Model, not the, ironically fatally named, ACID one.
Just look around, the world is already massively parallel, why create new ideas when mother nature has provided us with a lot of the answers.

0 comments:
Post a Comment