Hello, everyone.
Welcome or welcome back to the Postgres Day Room.
We have a great speaker again.
Umar is going to talk about clustering in Postgres.
Thank you.
Hello and good afternoon, everyone.
Thank you for being here and not going out for lunch instead.
So we're going to be talking about clustering in Postgres,
and we're going to be walking through an abstracted level concept
as to why clustering is required, the various different architectures
that are typically used in production to make your database reliable.
And the challenges that are associated with the concept.
A little bit of an introduction about myself.
The name is Umar Shahid.
I came all the way from Islamabad, Pakistan to talk to you about clustering.
I've been working in the Postgres space for more than 20 years.
And I am currently running a company, the name of Stormatics,
which was founded about a year ago.
So working on a startup focused on professional services.
And yeah, my past has been associated with various other Postgres organizations,
including EDB, Second Quadrant, OpenSTG and Percona.
Two of these companies do not exist anymore.
OpenSTG was acquired by Amazon and Second Quadrant was acquired by EDB.
So Stormatics is focused on providing professional services for Postgres.
We don't want to talk about that a whole lot.
So, on to the topic.
Now, in order to understand why clustering is required,
it's important to understand what high availability is
and why you need a highly available database.
Now, you want your database to remain operational,
even in the face of failure of hardware or network or your application.
You want to minimize the downtime so that the users of your application
do not experience any interruption in their experience.
And it's absolutely essential for mission critical applications
that need to be running 24 by 7.
Now, if you go back about maybe 10 odd years,
it was okay to, let's say, have your credit card declined
or just not working on a machine because the network was not available
or somehow the connection was broken
or there was some problem with the communication.
In this day and age, what you expect to be able to do
is just to tap your phone and instantaneously get the transaction through.
And you never expect it to be dropped
or an error to pop up unless, of course, you run out of credit.
That's a different story.
But it just works. Everything just works.
And the only reason everything just works
is because the entire infrastructure is highly available.
It's always available.
And that availability is measured in 9s.
And I'm going to explain in just a second what those 9s are.
Now, we're going to start with a very basic
and that's 90% availability.
When I say 90%, it sounds like a lot,
but that's one line of availability.
And if you span that over the course of an entire year,
that means that the potential of the database to be down
while maintaining 90% of availability is 36.5 days.
So in a given year, if your database is 90% available,
it's going to be down for 36.5 days.
Anybody find that acceptable?
I really hope not.
Now, you go to two 9s of availability.
That's 99% and that's a better number.
That's equal in order of magnitude higher in terms of availability.
And the downtime goes down from 36 days to 3.6 days.
Now, again, in a given year, if your database is 99% available,
it is going to be down for almost 3.5 days.
And again, for most mission critical applications,
that is not acceptable.
99.9% is 3.9s of availability
with an allowance of 8.7 hours of downtime.
You go off to 4.9s, 99.99, that's 52.6 minutes per year.
And 5.9s of availability translate into 5.26 minutes per year.
So just to make sure that we understand what availability is
and how it is calculated.
Now, the database runs on the cloud, so you don't care, right?
How many people agree with this?
Oh, I'm so glad that nobody agrees with that.
We've got a bunch of experts over here.
Yeah, so just because your database is on the cloud
does not mean that it is always available.
And here's, I just copied and pasted the service level agreement
that Amazon has up on their website for highly available RDF clutches
that are in multi-availability zone configuration.
So you have the highest form of availability that you can get with RDS, right?
They talk about MySQL, MariaDB, and Oracle, and Postgres,
and they specifically say that it's only, this SLA,
is only for multi-availability zone configurations,
and they're going to make commercially reasonable efforts to make this available, right?
So if they're losing money, they're not going to do it.
And what they promise is three and a half nines.
Not four, not five, three and a half nines, 99.95%.
And if they're unable to give you those three and a half nines,
what they say is that affected customers will be eligible to receive a service credit,
i.e. the service that had gone down on you, you get more of it.
So three and a half nines translate into 4.38 hours of downtime per year.
So if you're running an RDF cluster that is spread over multiple availability zones,
you can expect your database to be down for almost four and a half hours every year.
Now, what do you do if you want better availability?
That's one of the reasons why you have clustering.
And I'm going to run through now a few basic architectures of how clustering works with Postgres.
This is probably true generally for databases as well, but we are in the Postgres Dev Room,
so we are going to be talking about Postgres.
Now, in this very simple basic cluster, you've got one primary database and two standby nodes.
The way this cluster is structured is that the internals of the architecture are invisible to the application.
The application simply talks to the cluster, which of the nodes it's talking to doesn't care.
It reads and writes to the cluster, and the two standbys are essentially read replicas,
so there's redundancy in data in each of these replicas.
And in case the primary overhead goes down for whatever reason, there's a hardware error,
there's a software error, there's a network communication error, whatever it is,
one of the standbys can take over and the whole cluster will just continue working.
And diagrammatically, just to explain how that works, just so that we're able to visualize the whole sequence,
that's the sequence of events.
If the primary goes down, step one, the standby one, or well, it could be the standby two as well,
but just for illustration purposes, the standby one takes over as the primary.
The previous primary is retired from the cluster.
A new replication path is set up from standby one to standby two.
The standby one is labeled as a primary.
Standby two becomes the prime replica of this new primary, and a new node is spun up,
or the old primary is recovered and included in part of the cluster.
So this is, you know, at a very abstracted and high level, diagrammatically,
how an auto-failover procedure works.
Now, there are other forms or other variations of clusters that you can set up with Postgres,
with various different intents.
This illustration talks about a cluster that has load balancing,
and what load balancing does is that it focuses the right processes to one node, which is the primary.
Those rights, that data, is replicated over to the two standbys,
and the application can actually read from one or both of the standbys.
The idea over here is not to allow the primary node to get hogged by read operations.
It can focus on the rights, and the read can be served from the two standbys.
This is load balancing.
And again, you know, the auto-failover, etc., goes on as previously discussed.
Same cluster, now with a backup process in place for disaster recovery purposes.
Now, notice that the backup is taken outside of the cluster.
It's an off-site backup, and for good purpose, if the entire cluster goes down for some reason,
if, you know, there's an earthquake, there's a fire, a whole, you know, data center goes down,
you don't want your backups to also go down with it.
So the backup is taken at a different location.
These backups are taken with requirements in mind that are mostly fall under the two concepts of RTO and RPO.
Both of these stand for recovery time objective and recovery point objective.
Anybody over here who is hearing these two terms for the very first time?
RTO and RPO? Okay, okay, so I'll, you know, just take a moment to explain this a little bit.
So recovery time objective means that in case your cluster crashes or goes down for whatever reason,
how much time is acceptable for you to be able to recover the entire database, right?
Depending on the criticality of your application and the criticality of your cluster,
that time could be very, very small, or you could allow, you know, a few hours, a few minutes,
maybe a couple of days for the recovery time.
Recovery point objective is how much data can you afford to lose in case the cluster crashes, right?
From what point is it acceptable to be able to recover the cluster?
Now, again, for critical clusters, the RPO might be very, very close to zero, i.e. you don't want to lose any data,
but of course there are implications to it.
There are efficiency in space and financial implications to trying to achieve both RPO and RPO that are close to zero.
So you keep that in mind as you design your architecture and your disaster recovery strategy.
Point in time recovery is something that is kind of aligned with that, you know, it's about what point you can recover your database from.
You want to go back in time and recover your database to that point in time, that's the PITR concept.
Also, it's a footnote over here, but just, you know, a piece of advice, it is extremely important to make sure that you're periodically testing your backups,
because if the restore does not work, the backup is absolutely useless.
And you will only discover that in the case of a disaster, and then, well, that's a double disaster.
Another form of clusters that you can have is a multi-node cluster with an active-active configuration.
In the previous configuration, we had a single active node with two stand-byes.
In this configuration, you've got multiple actives where your application can both read and write on all of the nodes in the cluster.
Now, this is a little tricky, and the topic also tends to be a little thorny when you're discussing this with enthusiasts.
And the key point over here is that you have to have your conflict resolution at the application level.
The database, at least Postgres, the way the open source Postgres works, does not have the capability to resolve the conflicts for you.
So in case the application writes on active one and does an update of the same data on active two,
there's a conflict as active one and active two try to replicate data to each other,
that conflict where the database will not be able to resolve that conflict.
It's the application that needs to be active-active aware.
This is asynchronous replication between nodes, and this architecture is shared everything,
which means that all of the data is written to all of the nodes.
The data is replicated to all of the nodes.
And then, you've got another cluster, which is multi-node with data sharding and horizontal scaling.
This architecture is shared nothing, which means that no data is shared between the nodes across which the data is being sharded.
So data is distributed and you can scale this cluster out horizontally, as much as you, well theoretically at least as much as you want to.
There is a requirement of having a coordinator node up there, which decides which node to route the query to, which node to route the data to.
And you could set up automatic sharding.
You could also have the read and write operation automatically directed to the relevant nodes.
And then, last of the architectures that I'm going to discuss in this conversation is globally distributed clusters.
Now, theoretically speaking, the last two clusters that I described with active-active configurations and the sharding, you could have them globally distributed as well.
But I have a separate slide for this, primarily because of one reason, and that is the specific requirement that different regulations can have about geofencing of your data.
So many different jurisdictions of the world are increasingly enforcing that their resident's data does not get outside of the country that they reside in.
And you want to make sure that you've got local data being stored locally and read locally.
And with geographically distributed clusters and with the right configurations in place, you can implement that geofencing.
That, of course, also has a side impact of better performance because you're reading and writing locally instead of somewhere that's 10,000 miles away.
Now, talking about replication, primarily dividing it into two technologies, synchronous and asynchronous replication.
I was just trying to explain over here a little bit about the differences between the two.
Anyone over here who has not come across the concepts of synchronous and asynchronous replication have no idea what these two terms mean.
Everybody already knows. I could have just skipped this slide.
That's fine. So very quickly, walking through some of these points, in synchronous replication, data is transferred immediately.
It is not committed till all nodes in the cluster acknowledge that they have the data and they can commit it.
In case of asynchronous, the primary does not wait for that acknowledgement, that handshake.
It will just commit the data locally and will assume that the replicas will commit that data in due time.
What it does is with synchronous replication, there is a performance hit that you get because you need to wait for all of the nodes to agree that the data has been committed.
And in asynchronous, you achieve much better efficiency.
But also, there is that chance of inconsistency of data if you have an asynchronous replication set up in your cluster.
It's faster, it's more scalable, but there is that little bit of data inconsistency problem.
So in case it is absolutely critical for your application to have all of the data, all this consistent in all nodes of the cluster,
synchronous replication is the way to go and you will need to take that performance hit.
Any questions so far before we move on?
Yes?
In asynchronous replication, you are saying data may be inconsistent.
It doesn't mean that some data may be lost in one of the replicas, so if you have to recover your data, some fun may be missing.
And then how do you find that in the case of this?
Okay, so the question is that in case of asynchronous replication, if I say over here that the data may be inconsistent,
does that mean the data gets lost and if it does get lost, how do we recover it?
That's the question, right?
Okay, thank you, very good.
So the idea over here is that as the data is being shipped from the primary to the replica, there is a certain time lag.
It could be in microseconds, but there is a certain time lag where the data exists in the primary and does not exist in the replica.
And in that fraction of a second, if there is a query that runs across both of those nodes, it will return different datasets.
That is a risk that you take.
Now, in case during that lag, during that time, the primary goes down,
there are chances that the replica will never get that data and hence that data can be considered lost.
Now, there are different ways to protect yourself against that kind of an eventuality.
That includes being able to replay the right head logs.
That includes just making sure that any data that is written is actually sent across.
And so even if the primary node goes down or it crashes, the data is still in transit and the standby is going to eventually commit it.
But yes, there is a slight risk there of data loss.
In case if you don't have this kind of disaster happening in the meantime, is there still a possibility that like,
because like you're saying, the commits are not waiting.
So is there a possibility that there could be an incident that will fail or something goes wrong and like one commit just is missing?
If nothing goes wrong, is there a guarantee that there is?
So at the database level, because Postgres is compliant with Acid, it is going to be consistent.
Within the cluster, however, there is a lag.
We're going to discuss the application lag in just a little bit.
It's one of the challenges in setting up clusters like this.
But you're right.
When we talked about the load balanced cluster in just a couple of slides ago,
one of the things to keep in mind when you have a load balanced cluster and you're reading from the replica instead of the primary node is the fact that there is a lag between the primary and the replica.
And when you are reading data from the replica, there's a possibility that some of the data has not yet been written.
Does that help?
Yes.
What's the maximum network latency?
I'm sorry, I can barely hear you.
What's the maximum network latency to build up the cluster for synchronous and asynchronous?
So the question is what's the maximum network latency that you can use to build up the replica?
That's the question.
I think that's a fairly open-ended question and I'm afraid I may not be able to give you a very precise answer.
There's a lot of variables involved in designing that kind of an architecture.
Network latency, and again, this is something that we're going to be discussing in a moment, depends on a lot of factors including, well, actually, it depends on a lot of factors and not all of those factors are directly related to your database.
So it is related to your hardware, it is related to the network connectivity, it is related to the medium of connections that you've established between the two nodes, how far the two nodes are, you know, spread.
So there's a lot of variables involved.
And as you design the cluster, you need to be, you need to recognize those variables and you need to design the cluster based on what you have and allow for,
and you should have allowances for some of those, some of that lag and some of those nuances of the network that you have.
All right.
Okay, let's move forward.
Actually, this has absolutely nothing to do with my presentation.
I just put it up there because, well, I don't want it to be too dry, right?
Okay, so, yeah, now we come to the part of the challenges that you face in clustering, as you set up clusters of Postgres, and there are four, and this is in no way a comprehensive list of challenges.
And as we go into each of these challenges, I will also not be able to cover all aspects of these four points, but this is just to give you an overview of the kind of variables and the kinds of, you know, points that a DBA would typically need to keep in mind as they go about designing a cluster and make sure that they are highly available.
And the first point that we're going to discuss is split brain.
Now, anybody over here who, again, and I'm going to ask that question in a different way, has never heard of split brain, does not know what that is.
Okay, a few hands went up. Good. So the next few slides are not wasted.
Okay, so what is a split brain?
It's a situation where two or more nodes in a cluster start to think that they are the primary.
For whatever reason, there are different reasons for that. There could be different reasons for that, but for whatever reason, two or more nodes, if they start to think that they are the primary, they will lead to a situation that can cause data inconsistency, inconsistency that can cause data loss.
And the scenario that is called split brain.
And it could be caused by connectivity problems. It could be caused by latency. It could be caused by a server locking up because of, I don't know, a long running query.
There are many different things that could cause it, but whatever causes it, it's a difficult situation to be in and it's a difficult situation to resolve.
Now, a few ways to prevent a split brain scenario.
So the first one is to use a reliable cluster manager.
Doing it manually, writing scripts, et cetera, you know, it will still leave a few holes that are, that can cause the problem to recur.
There are cluster managers, there are tools out there that can help you. We'll talk about them a little later in this presentation as well.
And what they do is that they implement algorithms and heart rate mechanisms to monitor and automate the whole process of cluster management.
And because these tools are designed to make the decisions for auto failover, they will help you prevent a split brain situation.
Another thing to keep in mind is do what's called quorum based decision making.
And essentially what that means is that a majority of the nodes need to agree which node is the primary.
This also means that there's a requirement that an odd number of nodes, a cluster should be made off an odd number of nodes instead of even.
Because if you want to do, if you want to rely on some voting on some quorum based process, you need to have an odd number of nodes that could vote in.
Let's say in a particular case, you've got a primary that is operating as per what it thinks is normal.
And one of the stand-byes loses contact with that primary and begins to think that it needs to take over as the primary.
Now you've got the original primary and one node both acting as primary.
You need to have a tiebreaker in place that will say that, hey, stand by one, you're wrong. The primary is still working.
You just lost connection with it. So you need to, you know, stand down.
So that's what the, what quorum based decision making is.
Now in case there's some, and this is something that we, you know, sometimes work with our customers at times that are requirements from the customer that says that, well, you know, we can only have two nodes.
We cannot have more than that. Or we can only have an even number of nodes, not an odd number of nodes for whatever reason.
In that case, we implement a witness node, which does not hold data, but can be a voter in the quorum process in order to act as a tiebreaker.
So that's what the witness server does.
And, you know, you want to make sure in order to prevent a split print scenario, you want to make sure that your network is reliable.
And it's, you have redundancy in the network. So if one path goes down for whatever reason, you, the traffic can take a different path.
And you want to minimize the risk of partitions in the network.
And you want to make sure that you've got reliable connectivity between data centers if your nodes happen to be split across data centers.
And then there are a few miscellaneous housekeeping items that make sure that you've got a good monitoring and alerting mechanism in place.
So in case, you know, your cluster is approaching a situation where the resources are running out or the network is getting congested or the CPU is being maxed out or whatever.
You know, you get alerted in time so that you can get, you can act and take preventive measures.
Regularly test your cluster. You can simulate situations where, you know, connectivity is lost to test how your cluster behaves in case of that.
And you need to have very precise and clear documentation because if, let's say, I'm the one who's implementing this cluster and I take a few decisions as to what thresholds to set and what configurations to, to, to program into my cluster.
A person coming in, let's say two years later or three years later may not know what the decision making was and why it was done a certain way.
We want to make sure that you have very clear and precise documentation that is coupled with training with new resources that are coming on and are helping maintain and manage your cluster.
Now, in case a split frame does occur, what are, you know, the recommended best practices to recover from it?
So you get into a situation where now two nodes are thinking that they are the primary and they're ready to take, you know, data in and they want to establish, establish themselves as the publishers of the data and expect standby nodes to become the subscribers.
What do you do? So the first thing, of course, is to actually identify that that has, that has happened. You won't be able to do anything if you don't know that the split frame has occurred.
So in order to identify that kind of a situation, again, monitoring and alerting are crucial elements to it. You need to have a good monitoring plan in place.
Stop all traffic from your application and stop all replication between the nodes. You know, this, this, this will mean that your application goes down, but your application stopping is a lot better than your application feeding in or reading the wrong data.
So just stop the application. Now, this is all manual. I am not aware of a tool that will do this in an entirely automated fashion, but this is something that a DBA and an expert will need to do.
So determine which node is the most up to date. Two nodes are competing to be the primary. It's now you who decide which one is the actual primary.
Or maybe, you know, you're unable to decide that because there are some transactions that got committed on one primary and some transactions got committed to the second primary.
What do you do now? You want to make sure that you, that you replay the transactions that are missing and make one primary the de facto leader of the, of the cluster.
You want to make sure that the nodes are isolated from each other till the, till you've rectified the situation.
And then you reapply either through backups or through the right-ahead logs and, you know, just, just reapply the transactions that are missing on the, on the primary that you've decided and then reconfigure configuration.
So let's say, you know, you might decide that the standby who decided to take over actually has more transactions.
So you make it the primary, make it the new primary. And now you need to reconfigure applications such that the other nodes are actually taking data or replicating data from this new primary.
You had a question? Yes.
You did mention twice already, right-ahead log. I think it would be helpful if you could also decipher why is it called right-ahead log, what it is.
Okay. Thank you for asking that question. I will run under the assumption that, you know, it's something that everybody would know. So thank you.
So the question is, I refer to right-ahead logs and what are they?
So the way Postgres works is that every transaction that is written to the database goes into what's called wall buffers, wall, WAL wall that stands for right-ahead logs.
It goes into buffers and then those buffers write to the logs on disk.
And, you know, it's those logs that are getting committed to the database and the incremental transactions as they come in, the right-ahead logs keep track of those incremental transactions.
And it's those logs that are used for replication, those logs are actually transferred to the replica, to the standby, and they are replayed on the replica in order to get the replica into the same state as the primary.
So these are files that are on disk that contain all of the transactional data that the database is handling. Does that help? Yeah. Thank you for pointing it out.
Now, once you confirm the integrity of your cluster is that, you know, is when you can start re-enabling the traffic coming into the cluster.
But before you allow traffic coming in, you know, it might be a good idea to just run that cluster in read-only mode for a bit so that you can cross-check and double-check and re-verify that everything is working.
And then you're working to your expectation before you allow write operations.
And then, you know, make sure that you run a retrospective because a split-brain scenario is scary. It's difficult to recover from. You don't want it happening every other day.
Right? Yes.
You do not just have the fancy mechanism that fills up a secondary, a second primary, and then have it failover.
I'm not sure I understand what...
So, you're referring to shoot the node in the head, right?
I think... I'm not sure if I can shoot the node in the head. No, I think it's... Oh, offending node in the head. Yes, that's what it is.
So, yeah, there is a mechanism. I haven't talked about it in these slides, but in case there's an offending node that, well, you can't really rectify you shoot it in the head.
Right? You just kill it and then you rebuild a new standby. So, yeah, that's what you're referring to, right? Or, what is something else?
Why would you need this complicated rectification if you could just immediately stop the brain and then pay for it?
So, because before you do this, you don't know which of the primaries is actually farther along in the right-to-head logs, or if there are transactions that are in one and not in the other.
Right? So, you want to establish that fact first and then, you know, recover from there.
So, this is in order to just make sure that you don't lose transactions. Right?
Okay. So, yeah, running a retrospective, extremely important. Make sure that it doesn't happen again.
We're going to go through some of the other challenges. I think split-brain is the most important one, but the other ones, you know, they're kind of like a variation that can cause split-brain,
but we're going to go through these. Network latency is one of the things that we, you know, a question that was asked a little while earlier.
So, what network latency means is that it's the time delay between when data starts off from one location and reaches the destination.
So, any delay that it encounters going from one place to the other is called latency.
And the challenge it causes is that delayed replication could possibly cause data losses because, you know, as we discussed, in case of disaster, the primary is going to shut down and there's possible data loss in there.
And also, more lag or more latency can lead one of the stand-byes to believe that the primary has gone down. Right?
And, you know, they can try, that can trigger a false failover.
Causes of latency. The network could be getting choked. Low-quality network hardware.
The hardware, it's easy to get wrong, especially when it's costly hardware that we're dealing with.
The distance between the two nodes, at best, data travels at the speed of light and it takes a finite amount of time to go from one place to the other.
And the longer the distance is between the two nodes, the longer it takes for data to replicate from one to the other.
If you have a virtualization setup, it can cause overheads. There can be bandwidth limitations.
And security policies can force inspection of all of the data packets causing further delay.
And transmission medium will also cause some latency. For example, fiber optics are going to be much faster than something that's based on copper.
Right? That's plain physics.
And there are ways that you can prevent false positive resulting from latency.
You want to make sure that all of your monitoring and alerting and mechanism that you set up during the design of your cluster are fine-tuned such that you adjust the heartbeat, you adjust the time-out settings,
and you make sure that your cluster does not read latency as a trigger for failover.
Some of the best practices include making sure that you're testing your cluster periodically.
There are different workloads that you would want to run on your cluster to simulate different environments.
So you want to know what kind of time pressures your cluster is going to encounter with different kinds of workloads applied to it and want to configure and tune your time-out and heartbeat accordingly.
And of course, documentation and training are ever important.
The third challenge is about false alarms.
So we talked about network latency as one of the causes of causing a false alarm.
And a false alarm essentially means that an issue is reported when an issue does not actually exist.
And again, when an issue is reported, it can trigger a failover when a failover is not really needed.
And a failover is an expensive operation. You don't want to do it needlessly. It impacts performance.
And false alarms, of course, network issues are there.
The configuration and the way your cluster has been set up could cause false alarms if your thresholds are too low.
You might want your failover to happen instantaneously the moment the cluster detects that the primary has gone down.
But the primary might not have gone down. It might have been just running a long, running query and is unresponsive.
So you want to make sure that your configurations are correct.
Resource constraints, if the load is too high, the network traffic is too high, the CPU is maxed out.
Somebody had planned a schedule maintenance and not told you.
Something as simple as that could cause a false alarm where you think, well, okay, the network has gone down.
We need to do something about it. You don't want to do that.
And some of the long running queries can create exclusive logs from the database which can make the database appear to be nonresponsive.
And the automated systems will not double and triple check going into the logs and going into the stack tables to figure out which of the queries are running
and whether the database is locked or it's just simply unresponsive. And they can cause a false alarm.
And prevention techniques include making sure that your thresholds are optimized, testing,
and making sure that you run simulations is the way to go in order to optimize those thresholds.
You also want to make sure that your software and all components that are part of the cluster are up to date.
You want the latest versions of your software. You want them to be bug free.
And yeah, monitoring and alerting, comprehensive strategies, best practices, documenting, training your stuff.
The last of the challenges to be discussed is data inconsistency.
And what this means is that you call it data inconsistency.
It doesn't happen within the database because as we discussed that Postgres is asset compliant.
So the database will not be inconsistent, but within a cluster there is a chance of inconsistency if the nodes are not in sync with each other.
And the challenge is, well, if you run the same query across different nodes of the cluster, there's a possibility that you get different results.
You don't want that.
The causes, one of them is replication lag. We've been talking about this over and over.
In case data is written into the primary and is yet to be written to the replica and is being delayed for whatever reason,
you will get inconsistent data between the two nodes.
Network latency and high workloads could be a cause.
And this can cause loss of data in case during that time a failover is triggered.
That's one of the risks with this.
Split brain can cause the data inconsistency as well because, well, if two nodes think they are the primary,
they are going to try and take writing of the data or they are going to establish themselves as the publishers of the data
and they are going to have different pieces of data, you don't want that to happen either.
And any configuration that is not optimized for the functioning of your cluster,
incorrect configuration, can cause inconsistency of data.
How do you prevent it?
You manage your asynchronous replication very closely.
And now notice that I did not say synchronous replication over here.
I said that you just use synchronous replication primarily because it has a huge impact on performance.
And to do the extent possible, our advice typically is to avoid synchronous replication.
And not only does it have an impact on performance, one of the downsides is that in case the primary is working
and the replica goes down for whatever reason, the primary is going to continue waiting for an eclotage from the replica
and the replica has essentially taken the entire cluster down with it.
So there are very few challenges involved with synchronous replication.
Regularly check transaction IDs across the cluster, monitor replication conflicts,
there are statistics and tables that are and views that are available within Postgres to allow you to monitor this replication.
You can monitor them and then detect those conflicts and resolve them promptly.
And make sure that you have regular maintenance done on your database.
Vacuum, we had a talk just a little while back that talked about why table is bloated, why dead tuples are there
and why vacuum is needed in order to remove those dead tuples.
And we want to also make sure that analyzes run frequently on your tables so that it can optimize query planning
and you want to prevent a transaction ID wraparound which is probably something that is a whole talk in itself.
We won't go into that during this conversation.
And yes, this all sounds really, really hard.
It is next to impossible for a single human being to be able to think about all of these variables
and actually correctly configure clusters and be mindful of everything involved over here,
which is why we've got tooling around it that does not automate the entire thing,
but it takes care of the critical aspects of your cluster.
I mentioned three tools over here.
There are other tools available as well.
All three are open source with reasonable license for usage.
Repmanager at the top is licensed as GPL.
It provides automatic failover and it can manage and monitor the application for you.
PG pool has a license that's very similar to BST and MIT, which means it's a very liberal license.
And it acts as a middleware between Postgres and client applications
and it provides functionality much beyond simply clustering,
so it will give you connection pooling and load balancing and caching as well,
along with automatic failover.
Petroni is a name that just keeps coming up.
It's wildly popular to set up clusters with Postgres.
The license is MIT and it provides a template for highly available Postgres clusters
with the smallest cluster being three node.
And it can help you with cluster management, auto failover and configuration management.
And that brings us to the end of our presentation.
Two minutes to go.
That's the QR code for my LinkedIn.
Thank you.
Thank you.
We actually have a question.
The gentleman earlier alluded to network fence and Kubernetes.
You'll have to be louder.
The gentleman earlier referred to network fence and should denote,
which is only possible because of PVCs, right?
Like persistent volumes, they're saying Kubernetes, right?
But the kicker is most often than not, the volumes themselves, the PVCs,
are the cause of those transient issues.
What if we don't want to use persistent volumes?
What if we want to use ephemeral NVMe?
Is it currently possible with Postgres to manage a cluster
without using persistent storage and defaulting to shoot a node?
So the thing is that when you're working with...
They might have gone off, but let me try and answer you loudly over here.
So the thing is that when you're working with databases,
you want persistent storage, right?
A Kubernetes kind of cluster is designed for stateless applications,
at least on the ground up, but for databases, you want persistent storage, right?
In case that you're working with a scenario that is just completely...
That does not use persistent storage,
those are cases where I don't have expertise in.
So I won't be able to definitively tell you how to go about handling it.
So Matix, those are like EC2, please, I imagine.