This morning, while waiting for Kathryn to enter the office, I decided to tackle the initial diagrams for both of the replication alerts. This made me realize I know very little about replication. I started my hunt for information by looking through the replication alerts for the last 60 days, to see what comments and resolutions the DBAs had made/implemented. Rather than answering any questions, this task actually raised more questions; What are trace tokens and how do they work? What does NFAR mean? How does replication actually work?. And so begun the googling. (On a side note, I also stumbled across job deadlock. So made a note to look into this and how it can be resolved).
The first question I decided to answer was simple, NFAR means No Further Action Required. I feel like I should have been able to work this out without google, but there you have it, pretty self explanatory.
I then got on to trace tokens. So, what exactly are they? I found this clear, succinct answer in another blog post:
Microsoft introduced Trace Tokens feature in SQL Server 2005 to measure SQL Server transactional replication latency. A trace token is nothing but a small amount of data which is written to the transactional log of the publication database which is marked for replication. Hence, trace tokens can help one measure the latency from Publisher to Distributor and from Distributor to Subscriber. There by, measuring the total latency time from Publisher to Subscriber. However, it is expected that Publisher, Distributor and Subscriber Servers are up and running and they are able to connect to each other.
Again, this just highlighted my scarce knowledge of replication and how it actually works. So upon further investigation, I uncovered the following:
What is replication
The word replication comes from the Latin word replicare which means to repeat. Replication describes the process of reproducing or duplicating.
Replication in SQL Server does exactly that; it reproduces or duplicates data. Any time you need to create a copy of your data, or to reproduce a change to that data, replication can be used. That copy can be created in the same database or at a remote location on a separate server.The copy can be continuously kept in sync with the source data, or synchronized at scheduled intervals.
SQL Server replication consists of three components: The Publisher, the Distributor and the Subscriber. These components act on articles that are defined within publications and subscriptions.
For each SQL Server object that should be replicated, a replication article needs to be defined. Each article corresponds to a single SQL Server object, or a subset of an object. The objects that are replicated most often are tables, views and stored procedures. For a complete list of objects that can be replicated, check out Publishing Data and Database Objects in Books Online. The properties of an article determine whether the article contains the entire object, or if a filtered subset of the objects makes up the replicated article. With some restrictions, multiple articles can be created on a single object.
A group of articles that logically belong together can be combined into a publication. The publication has options defined that apply to all the articles in that publication. The main option defined by a publication is the type of replication that is to be used.
The SQL Server instance that makes a publication available for replication is called the publisher.The publisher monitors all articles for changes, and makes information about those changes available to the distributor.
The distributor is the SQL Server instance keeps track of all subscribers and all published changes and makes sure that each subscriber gets notified of each change. Most of the changes are tracked in a distribution database. The distributor can be a separate SQL Server instance, but often the distribution service runs on the same machine as the publisher.
The subscriber is the SQL Server instance that receives all the published information through subscriptions.
A subscription is the counterpart of the publication. A subscription defines which server (subscriber) is to receive the updates published in a publication. Each subscription creates a link between one publication and one subscriber. There are two types of subscriptions: push subscriptions and pull subscriptions. In a push subscription, the distributor directly updates the data in the subscriber database. In a pull subscription, the subscriber asks the distributor regularly if any new changes are available, and then updates the data itself.
There are three main types of replication available in SQL Server. They are snapshot replication, merge replication and transactional replication.
- Snapshot Replication
Snapshot replication creates a complete copy of the replicated objects and their data each time it runs. It uses SQL Server’s BCP utility to write the contents of each table into the snapshot folder. The snapshot folder is a shared folder location that has to be set up on the distributor when enabling replication. Each participant in a replication setup needs to have access to the snapshot folder.
Every time snapshot replication is run, everything is recopied from scratch, so it has high bandwidth and storage requirements. All other types of replication use – by default – a single replication snapshot to sync up all subscribers with the distributor only during the initial setup.
- Transactional Replication
Transactional replication works, as the name suggests, on a transaction basis. Every committed transaction gets scanned for changes applied to replication articles. Scanning of the changes is done by a log reader agent, which reads through the transaction log of the publisher database. If there are changes affecting a published object, those changes get logged on the distributor in the distribution database. From there they make their way to the subscribers.
Transactional replication allows for close to real time synchronization and leaves only a small footprint on the publisher. While there are several options to allow for bidirectional data movement, transactional replication was originally designed to work one way only.
- Merge Replication
Merge replication was designed from the beginning to allow for changes to the data to be made on the publisher as well as the subscriber side. Merge replication also allows for disconnected scenarios, where a subscriber might not be connected during the day. That subscriber would synchronize after reconnecting in the evening. If a row gets updated in two different places at the same time, a conflict occurs. Merge replication comes with several built in options to resolve those conflicts.
So to summarize all of that:
Objects in a database on a SQL Server instance referred to as the publisher that are marked for replication are called articles. Articles are grouped together into publications. The subscriber gets updated with the changes that occur to the articles through a subscription. The data flows through the distribution database which resides on the distributor. Publisher, distributor and subscriber can be the same instance or separate instances on the same or on different machines. The source and the target database can be the same (if publisher and subscriber are in fact the same SQL Server instance), but the distribution database has to be separate.
That brings me to job deadlock. What exactly is job deadlock, and how it can be resolved?
A deadlock is defined in the dictionary as “a standstill resulting from the action of equal and opposed forces,” and this turns out to be a reasonable description of a deadlock in SQL Server: two or more sessions inside of the database engine end up waiting for access to locked resources held by each other. In a deadlock situation, none of the sessions can continue to execute until one of those sessions releases its locks, so allowing the other session(s) access to the locked resource. Multiple processes persistently blocking each other, in an irresolvable state, will eventually result in a halt to processing inside the database engine.
There are many ways to resolve deadlocking, all based on which type of deadlock has occurred. So the second part of my question about how can a deadlock be resolved is a bit trickier to answer. When troubleshooting any type of problem in SQL Server, you learn with experience how to recognize, from a distance, the particular varieties of problem that tend to crop up on a regular basis. The same is true of deadlocks; the same types of deadlock tend to appear with predictable regularity and, once you understand what patterns to look for, resolving the deadlock becomes much more straightforward. But as stated earlier, this is something which is taken on a case by case basis, and there is no one size fits all solution when it comes to deadlocking.
After lunch, I decided to follow through a replication tutorial I found online. The first part of the tutorial took me through preparing a SQL Server instance to be a replication distributor. I also prepared another SQL Server instance to use the first one as its distributor. The second instance will be setup to be the publisher in the next section of the tutorial.
At this point there was not really a lot to see on the distributor. None of the agents are running yet and nothing else would show any activity. There is now a distribution database on the server but it is well hidden among the system databases.
The only really visible change is a change in the context menu of the replication folder in Object Explorer, which now contains the option “Distributor Properties…” instead of “Configure Distribution…” when you right click.
The next stage of the tutorial introduced the publisher. I learned who is allowed to create a publication, how to add articles to a publication and what article types there are. I discovered how to schedule snapshot generation and what permissions are required by the different agents involved in a transactional replication publication.
As with the distributor, after setting up the publisher there was not really a lot to see. But there are two signs now. One you can see by drilling down into the Local Publications in Object Explorer as seen below
The other one is not directly visible but might become a problem. It is the fact that all changes to the publication articles are now recorded in the distribution database, but nobody is reading it from there. So the distribution database will grow, potentially a lot. To prevent this from happening, I need to set up a subscriber to this publication.
This brings me to the next and final step in the tutorial, setting up a subscriber. In this step I learned how to connect to a publisher, how to select the subscription type and the subscription database. I discovered the differences between push and pull subscriptions and the importance of the retention periods. I also learned about the security requirements and the publication access list.
Now that all the parts of replication are in place, I can actually see something happening. After a few minutes the first synchronization had finished. When I connect to the subscription database now, I can see all the replicated objects have been created. Now that the first synchronization is completed, I can insert and update a row in the publication database. And I can see that change happening in the subscription database as well as it is replicated to the subscriber.
After completing the tutorial, I checked in with Kathryn to see if she could still squeeze me in to her busy schedule. Finding the time to meet with the DBA’s seems to be my biggest challenge so far!
But on a more positive note, Rosie is back in the office this week and has replied to my email, so it is looking hopeful that I will be able to meet with her this week to go over how she handles the first two alerts.
Kathryn managed to find time to meet with me. I had the chance to show her my initial diagrams for a few of the alerts, which she said were looking very good and ‘logical’. As she discussed with me her process for resolving the alerts, she followed along with my diagram – checking off the tasks she did do, and making small amendments where necessary. She did make a couple of suggestions, not particularly on content of the diagrams, more on the order in which she would do things, or things she wouldn’t do after hours, only during work hours. So my time spend with Kathryn was very valuable!
I spent the morning tidying up my diagrams while waiting for Rosie to come grab me to go over my questions etc. The time I spent with Rosie was again extremely helpful. By walking through how she dealt with the alerts in question, I uncovered more changes that needed to be made to the diagrams. Having the diagrams to follow through with proved invaluable again today, they are a fantastic tool as far as a guide and point for comparison. Minor differences were highlighted between what I had already documented and the process Rosie follows for the following alerts: Low disk space, Job Failed SSL Backup All Databases and CustCode Server Appears Down. These differences were noted down to be amended later in the day.
While with Rosie I also had the chance to ask her about the Database full alert which I had been struggling to create initially based solely on the SharePoint pages. Talking it through with Rosie and getting a brief insight into how she handles the alert really clarified some of the information provided in SharePoint and allowed me to continue with my initial attempt at creating a flowchart for that alert. So, thanks Rosie!
My afternoon was spent making amendments to my diagrams based on the information provided by Rosie. I also managed to almost complete the diagram for the Database Full alert which I had previously struggled with. I did still have a couple of further questions, which I organised to work on with Rosie again in the morning.
This morning I met with Rosie again to go over a few more questions and to review the changes I had made since last seeing her. I would have to say, this caused more confusion as opposed to answering questions and clarifying processes. It made me realize I really need to clarify what level my diagrams need to be, therefore how much detail needs to be included.
If for example, every time an alert is received I simply say investigate and resole, all of the flows will end up being the same and not very useful at all. However, if I expand this and note every potential reason the error could have occurred, and then the resolution process for each individual cause, my diagrams become hugely convoluted and difficult to follow. Which makes me question weather I should be creating a high level diagram which then directs people to a further diagram based on the established cause for the error? These are things I need to discuss with Adam.
After lunch, I tried to work on the log full diagram, which lead me down a rabbit hole of transaction log related research.
Adam approached me in the afternoon, just to check how I was getting on. I mentioned my main issue was the level of detail I needed to include. He cleared this up for me by saying I need to mention what they need to investigate, but now how to resolve what it is they find. This will allow them to go through a sort of checklist of items to investigate, without providing the method of investigation or resolution. So this will definitely help. We agreed to meet again properly on Tuesday when he has time in order to make sure I am on the right track and my diagrams are at the right level.
By this stage I have spent 135.5 hours at SQL Services.