If I cannot find a book or info in a library, I always go to a librarian for advice and I have never been disappointed. However, whenever I need to find a solution or information about a specific problem in database adminitratin domain, I always do a google search and then spend hours and hours reading, analyzing, comparing, testing and finally deciding whether what I find is useful or not.
This prompts me to think why companies with complex database administration needs do not create a database administration librarian position, and I consider this librarian position will assume the following responsibilities:
1. Collect and categorize the database administration related scripts, which can be t-sql, PowerShell, VBScript, C++, C# etc
2. Collect and build the issue case library, for example,the librarian will collect the issues / error messages other users encounter and report in various forum, and the responses/solution from other users. This may also include the KB articles MS have published addressing a specific issue etc.
3. Research / collect / analyze the info of the SQL Server MVPs (and other distinguished experts, though they may not have MVP title), such as who is an expert at what and who frequently answer question in which sql server community etc.
4. Research / collect and document other useful information related to database administration, such as tips, undocumented SP / function, third-party tools, book reviews, interview questions etc.
I believe that if a team has a libraian of this type, the other DBAs will work more efficiently and effectively.
Whenver we start a somewhat big IT project, it is natural for us to build a team dedicated to the project, and the team is usually made up of a project manager, a technical architect, a business analyst, developers, and QAs etc. (Of course, the team can also have some other members like, a data modeller, a DBA or a technical writer etc)
However, I seldom see any company which is big enough has paid enough attention to building a strong DBA team from a project perspective. These days, the database administration domain is so big that any individual can hardly be a person of all trades in the db administration world. As we know, a team's power (of course, if managed well) is much bigger than the sum of each individual in the team because of the synergy inside the team.
IMO, a DBA team should be somewhat similar to a natural project team because in essense, a project team is formed based on the assumption that there should be at least one person dedicated to a specific field in the project, and this assumption also applies to the database administration because of the complexity of this domain.
For DBAs, our ultimate goal is to keep a a highly efficient / available database system within the boundary of existing hardware / OS / system resources. Though this is a never-ending "project", I still think that to achieve the best result, it deserves to build a DBA team dedicated for this forever on-going project. To maximize the synergy , I come up with the following roles and their tasks in a DBA team
#
Role
Task
1
Database administration Manager (DBAM)
Resource allocation, task prioritization, stakeholder communication, and responsible for the whole database administration standards esp. in the context of Service Level Agreement
2
Database administration architect (DBAA)
Explore and research the new efficient ways to tackle current administration work, prototype innovative idea, introduce the new best practice, and ensure the administration quality / efficiency.(see my detail thoughts about this role at http://www.sqlservercentral.com/articles/Miscellaneous/2989/)
3
Database administration librarian (DBAL)
Build team knowledge base (including categorizing / version-controlling the resources) and set up a knowledge-sharing framework which everyone can contribute but what contributed will be QA'ed by this DBAL This person should be the key person for other DBA to look for resources (I will blog more details about this role in future)
4
Database administration developer (DBAD)
Develop a fully functional solution based on DBAA's prototype and also develop scripts / reports at DBAM's requests. (I will blog more thoughts about this role in future)
5
Database Administration tester (DBAT)
Set up the system benchmark by developing various test cases, test and tune the application components in a database. (for details of this, pls see my previous blog at http://blogs.sqlservercentral.com/jeffrey_yao/archive/2008/02/29/new-dba-type-test-dba.aspx)
6
Production DBA
Monitor and manage the sql server system, all the regular tasks a DBA should do, such as backup / restore, db creation, system setup, applying hotfixes / patches, and trouble-shooting any issues
Of course, in a team, one DBA can be assigned multiple roles depending on the workload involved.
For this "DBA team" concept (similar to Center of Excellence), I think this is mostly useful for consulting companies, who take the outsoucing contracts to take care of the clients' database administration work.
In short, I believe a strong DBA team is made up of people who are experts in their specific fields (necessary to tackle the tough and complicated db administration environment) and when getting together can generate a huge ROI for the consulting companies.
I have been working on sql server replication (on sql server 2K5) for more than 1 year now, starting from designing a replication deployment framework that will take care of multiple publishers to one subscriber and one publisher to multiple subscribers. Now reflecting on the work I have done, I put the following wish-list that I hope MS can address in future
1. Replication Monitor: I wish I can set up groups in the monitor so I can group diffent publications into different groups for easy management, this is similar to the "Server Group" in the SSMS
2. Replication Verification:
(1) sp_article_validation does not have any output parameter, and this makes the auto replication validation harder as BOL seems to ask DBA to do manual check because it says "Check the agent output for the result of the validation", the output is a message like "table 'xxxx' might be out of synchronization. RowCounts (Actual:nnnnnnn, Expected: mmmmmm)...."
I wish sp_artilce_validation has the same output parameters as sp_table_validation does, plus two more parameters @Actual_RowCount and @Actual_CheckSum
(2) The validation is only based on: "One subscriber has one and ONLY one publisher" assumption. In my case when one Subscriber has multiple Publishers, the publication validation is always "out-of-synchronization"
I wish MS can come up with a solution for this multiple-publisher to one subscriber validation
3. Optimize the system-generated update trigger on updatable-susscription table.
We have a case where we have one transactional replication with updatable subscription, and in the publication, we have one article whose columns are not entirely published, actually there are four columns we do not publish in this article. Since this is an updatable subscription, we can see some automatically-generated triggers on the subscription table. The problem with the update trigger is when we explicitly update the columns that are not involved in the replication at subscriber side, the update trigger is fired and updates the publication table on the publisher side with the existing values.
An example is: Say table T has column C1, C2 and C2 is not involved in the replication, when update T set C2='something' at the subscriber side, the update trigger will issue the following sql to the publication table (since this is an updatable subscription)
begin distributed tran
update T set C1 = inserted.C1
commit tran
Functionally, this does not hurt anything, but performance-wise, it is a waste of time.
What I wish is when I update explicitly the columns not involved in the replication on the subscriber side, the updatable trigger can recognize this column is not involved in the replication and thus simply returns without going to do a distributed transaction on the publisher side.
I have another wishlist item regarding the replication with updatable subscription, but this is more complex, and actually is a bug to me, and I will blog it in different post later.
"Culture", according to Merriam-Webster Dictionary http://www.merriam-webster.com/dictionary/culture, has one definition as
"the set of shared attitudes, values, goals, and practices that characterizes an institution or organization".
I am not sure how many CIOs or senior managers, who are responsible for a DBA team, have considered to build a healthy database administration culture (DAC hereafter) customized to fit their own institution or organization.
DAC itself is very abstract. It is similar to any cultures, like "American Culture" or "Chinese Culture". However, just as "American Culture" or "Chinese Culture" is composed of the elements from art, society, politics, religion and etc, DAC can be characterized by the following:
1. database administration daily operational practice2. database administration disaster recovery policy and practice3. database administration auditing / security policy and practice4. database administration documentation policy and practice5. database administration best practice research / implementation6. database administration team collaboration and knowledge-sharing policy and practice7. database administration quality and performance assurance practice / framework8. database administration achievement celebration, recognition and reward policy and practice(Of course, these items may have overlap and there are many others we can add to the list, )
There are two reasons that I see a healthy DAC can be an invaluable asset to a company
1. A healthy culture is a system-wide insurance to the quality of database administration work. (so the investment cost to build DAC can be considered as a premium )
2. Culture can be inherited, so long as there is one person who is fully cultivated in this DAC, even with all other peoples leaving the company, the person left can be a "culture seed" for new hires (of course, if everyone leaves, that is called culture extinction )
These days with "data" as a critical asset to almost any company for the normal business operation, building up and promoting a healthy DAC should be a priority for the management in the corporation world. As for how to implement a healthy and characteristic DAC, that's a topic as big as how to cultivate a good corporate culture. But one thing is for sure, to cultivate an internal culture, it should be an intended action from top to bottom instead of a grassroots spontaneous one.
I read Andy's post Changing Jobs - Should You? today with great interest. I think it is good topic in the sense that by fully exploring/researching this topic, both employers (usually represented by HR and senior management) and employees (like us IT people) will better understand each other, and thus avoid the intangible "opportunity expense" that may occur to both parties and from an IT industry perspective, this may increase the industry production, quality and efficiency.
I'd like to take myself as an example, in the past 10 years, I have changed five jobs, some changes are unavoidable. For example, I immigrated to Canada in 1999, so no doubt I had to quit my previous job. But the following two job changes were absolutely not voluntary, and each was associated with company acquisation. The first occured in the end of 2000, my original company (in Halifax, NS) was purchased by another big company in US, and I was a SQL Server DBA there supporting HR, accounting departments who were using Epicor products based on SQL Server, the new parent company was using Oracle Financial products, and there was no possibility that Epicor products would continue to be used after the acquisation and transition was done in about 6 months. I made a decision to jump on the first coming opportunity in Nov, 2000. In late 2003, my company (in Ottawa, ON) was purchased again by a bigger company, and I saw my colleagues being laid off every few weeks, and when the project I was in was over, I became "unbillable according to HR" as there were no projects available. This time, I was laid off, the first time in my career life. I was astonished and helpless at that time as I always thought I was very strong in technical skills / knowledge and thus would avoid being laid off.
From then on, I started my contractor career path as I do not want my fate being decided by a HR people who thought I am "unbillable". With this said, I tend to disagree with Andy's point that "if you survived you weren't in the bottom 10-15 percent", this is not true. To me, if you do not survive, it does not mean you are in the bottom 10-15 percent, you may still be in top 10, it is just because the company wants to maximize profit and cannot afford or does not want to keep you a little bit longer.
From my experience, I somehow come to the following conclusion (I may be wrong): In the first place, there is no loyalty from an employer to an employee, once the tough time comes, no matter how good you are, how loyal you have been to the employer, you will be let go. I guess because of this, there is no loyalty from the employee side as well, and that's the root cause of changing job as "loyalty" is not an "expensive" word here.
I may be off topic, but anyway I come up with a list of pros and cons for changing jobs from my own experience
Pros
Cons:
To further continue this "changing job - should you?", I think it may be interesting to explore the following scenario/questions:
You are a CEO of a IT consulting company, the company is in a bad financial shape due to the slow-down economy, what can you do to avoid lay-off, which no-doubt will impact the morale of the compay staff?
I recently got a request to do a "cold" backup of a small yet complex db environment, I say "small" because the sum of six db sizes is less than 60 GB, yet "complex" means the six dbs are on five physical servers. three of the which are clustered servers, the six dbs are involved in pretty complex replication framework, with multiple publishers to one susbscriber and mulitple subscribers to one publisher and also some subscriptions are updatable subscriptions (in the transactional replication).
Anyway, the environment is for QA team, and once the QA team is done with one test, they want the environment to be restored back to the state before the test starts. Because there is replication involved here, it will be pretty complex to restore the environment to a specific state. I finally come up with the following plan:
Before the environment is handed over to QA team, we will do a file level backup of all databases involved (including the distribution db), and we call this "cold" backup because we need to shutdown each sql server service, and then copy the db files to a backup folder.
It will be very inconvenient to go to each server to do the sql server service shutdown, so I explored the possiblity to shutdown each sql server instance remotely from a central place, and here is what I found
1. To shutdown no-clustered sql server services
sc \\<server_name> stop "SQLServerAgent"
sc \\<server_name> stop "MSSQLServer"
for name instance
sc \\<server_name> stop "SQLAgent$<instance_name>"
sc \\<server_name> stop "MSSQL$<instance_name>"
To start sql server service, just change the key word from stop to start in the above commands.
2. To shutdown clustered sql server service
cluster /cluster:<cluser name> res "sql server agent" /off
cluster /cluster:<cluser name> res "sql server" /off
(I have no clustered named instance so I cannot test the clustered named instance sql server service )
To start the clustered clustered sql server service, just change the key word from stop to start in the above commands.
Put these cmds in a batch file, together with a copy file command, then a cold backup will be done easily. When you need to restore, do the similar thing, stop the sql server services and then overwrite the current db files with the backup db files.
I was fighting an error that was very rare to appear. I originally put a post on a MS forum
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.sqlserver.replication&tid=57229124-0a5a-4ec3-9055-a06b7869b872&cat=&lang=&cr=&sloc=&p=1
The issue has delayed our project (with hundreds of people for the project) for one day, the pressure on me is intense, to say the least. Fortunately after 36 hrs (5 hrs sleep included ), the issue is finally solved.
Simple background introudction
Environment: SQL Server 2K5 EE (CU8 applied) + Win 2K3 SP1
Our replication is of transactional type, and one publication with 5 articles has one "push" subscription. One article, let's call it MyTable has 29 million records.
Publisher and Distributor is on the same box, while subscriber is on another box. The network share used in the replication is called \\MyServer\ReplData
After replication set up, I start the subscription re-init with a new snapshot generated. However I keep getting errors in the replication monitor when the distribution agent (referred to DA hereafter ) tries to replicate MyTable over to the subcriber. When DA tries to read the snapshot-generated files for MyTable, at some time, it always compains and gives out the following error msg.
The process could not bulk copy into table '"dbo"."MyTable"'. (Source: MSSQL_REPL, Error number: MSSQL_REPL20037) Get help: http://help/MSSQL_REPL20037 memory mapped file read failed To obtain an error file with details on the errors encountered when initializing the subscribing table, execute the bcp command that appears below. Consult the BOL for more information on the bcp utility and its supported options. (Source: MSSQLServer, Error number: 20253)
So initial thoughts / guesses were:
(1) the replication is not set up correctly ?
(2) the created file by snapshot agent is not right ?
(3) the folder \\MyServer\ReplData is corrupt ?
(4) the OS memory is corrupt (we have 32 GB on each server) ?
So I have tried various ways, including:
1. Delete MyTable in the publication, and then do a subscription re-init with a new snapshot, when the publication works fine, add back MyTable to the publication and then do a subscription re-init with a new snapshot again to address concern (1)
2. Delete everything in \\MyServer\ReplData, and then restart the snapshot for the publication to address concern (2)
3. Change the network shared folder from its original drive D: to a new drive E: to address concern (3)
4. Reboot the all servers involved in the replication to address the potential OS memory issue to address conern (4)
The error still comes out when DA is trying to replicate MyTable.
Ok, what can be wrong? Yes, I have some other candidates for concern
(5) MyTable corrupts and thus snapshot agent generates some bad files based on the corrupt MyTable ?
(6) Push subscription does not work for MyTable ?
(7) MyTable does not work in a publication when the publication has more than one article?
So let's do something to address these new concerns
5, Run dbcc checktable on MyTable, but there is no error msg reported
6, Drop the subscription, and then recreate a new "Pull" type subscription, Do a subscription re-init with a new snapshot
7. Remove MyTable from its original publication, and then create a new publication which contains one and only one article, i.e. MyTable. Do a subscription re-init with a new snapshot
However the error still appears when DA tries to replication MyTable.
However during these tests, I found that the error always occurs when DA tries to read a specific file (generated by snapshot agent in \\MyServer\ReplData), let's call this file MyTable_File_7.
Now I guess this MyTable_File_7 may contain some row data that DA cannot read after the file was loaded into memory, and thus "memory mapped file read failed"? My logic is that you can corrupt a text file by inserting a EOF in middle of the text file, so if MyTable_File_7 has some weird binary code in it, the file may be corrupted logically.
So I decide to output MyTable to a new table by running
select * into dbo.MyTable_2 from dbo.MyTable
then create a PK for MyTable_2 and then use this new table to replace MyTable in the publication. And the result is :
I Succeeded !!
How excited I am, now what I need to do is:
Truncate table dbo.MyTable
Insert into dbo.MyTable select * from dbo.MyTable2
and then remove MyTable_2 from the publication and replace it with MyTable, and do a subscription re-init. After another 30 min waiting, what? The error comes again? ! ! Come on !!
I cannot believe it. But the success of MyTable2 leads me to believe something must be wrong with the data in MyTable, and doing table truncate may actually only release the pages/extents that the table occupies, and then the insertion will probably re-use those occupied extents again. At this moment, I think the hard-disk may have some bad spots which MyTable may happen to use. So I did another way,
drop table dbo.MyTable -- this is to ensure all GAM, SGAM, PFS aer totally cleaned regarding this table
Create table dbo.MyTable (....)
Now add MyTable back to the publication and do a subscription re-init.
Finally, it is successful !
The possible reason: some bad tracks on the hard-disk.
Our system adimin group is now involved in checking whether there is anything wrong with the hard-disk drives.
Yesterday, I was fighting with an error I have never seen in the replication and not much info can be obtained via google either.
My replication environment is pretty much complex, with multiple pulishers to one single subscriber, and also one publisher to multiple subscribers. But they are all transactional replications (some with updatable subscriptions)
A few days ago, our databases got updated by the development team, new stored procedure, new tables, new columns etc were made to the existing dbs. After that, I set up the replication on the test servers, and soon I got the following errors in some of the publications:
Error messages:
The process could not bulk copy into table '"dbo"."Table_PUB"'. (Source: MSSQL_REPL, Error number: MSSQL_REPL20037)Get help: http://help/MSSQL_REPL20037
Field size too large
To obtain an error file with details on the errors encountered when initializing the subscribing table, execute the bcp command that appears below. Consult the BOL for more information on the bcp utility and its supported options. (Source: MSSQLServer, Error number: 20253)Get help: http://help/20253
bcp "DB_Report"."dbo"."Table_Pub" in "\\Server1\repldata\unc\Prod1_DB1_REPORTINGPUB5\20080602182491\Table_Pub_2.bcp" -e "errorfile" -t"\n<x$3>\n" -r"\n<,@g>\n" -m10000 -SProd2 -T -w (Source: MSSQLServer, Error number: 20253)Get help: http://help/20253
When I tried to run the following bcp command as indicated in the error msg,
bcp "DB_Report"."dbo"."Table_Pub" in "\\Server1\repldata\unc\Prod1_DB1_REPORTINGPUB5\20080602182491\Table_Pub_2.bcp" -e "errorfile" -t"\n<x$3>\n" -r"\n<,@g>\n" -m10000 -SProd2 -T -w
I did not go very far as I got the error msg, like
"unexpected EOF encountered in bcp data file
0 rows copied "
and if I check the errorfile, it has nothing as it is 0-byte size large.
After numerous different attempts, I finally figured out why.
In this case, we have two publishers (A and B) and one subscriber (S). The two publishers publish a same table, let's call it TABLE_PUB, but TABLE_PUB on A and B have different data sets, From A to S, once the snapshot starts, the replication will "drop existing one and create a new one" if TABLE_PUB already exists on S. For B to S, once the snapshot starts, the replication will "keep existing object unchanged" if TABLE_PUB already exists on S.
Recently our developer team add some new columns to TABLE_PUB on both A and B, however, the newly added column sequence for TABLE_PUB on A and B is different. For example, three new columns (C1, C2, C3) are added, but on A, the columns are added in sequence of C1, C2, C3 while on B, the column sequence is C3, C2, C1.
So during the snapshot initiated from B to S, when BCP tries to load \\Server1\repldata\unc\Prod1_DB1_REPORTINGPUB5\20080602182491\Table_Pub_2.bcp into Table_PUB on S, because Table_Pub_2.bcp is created based on Table_Pub on B, thus each column has its specific position, which cannot match the column positions of Table_Pub on S, (Table_Pub on S has same column sequence as that on A after snapshot initiated from A), this caused the error.
Hope this can be of help to you in future.
New findings:
After I posted this blog, two weeks later, I encountered this error again on another publication article, let's call it Table_Pub2. However, after compairng the Table_Pub2 on both publishers A and B (and also on the subscriber S), it seems the table schema is the same, in terms of column sequence, colum data types and sizes, across the servers, But the error persists even after I have run snapshots / subscription reinit multiple times, I even drop the publication / subscription and reset it up, the error simply refused to go. At the very end, I dropped the Table_Pub2 on publisher B, and then scripted out Table_Pub2 from A, and then recreate the Table_Pub2 on B using the scripts from A, and finally did a resnapshot on both A and B for this publication, and then everything works fine ever since.
So the lesson learned here is:
If you see this error and you cannot find the obvious column difference, try to manually recreate the table on one publisher with the script that creates the published table from another publisher.
This is an old topic and well documented, just google it and I believe you will get a lot,
However, there is one difference that seems not being mentioned and actually I find it out myself only two hours ago. So here is the point
If a table is published for replication, it cannot be truncated, while it can still be deleted.
Otherwise, you will get the following msg
Msg 4711, Level 16, State 1, Line 1
Cannot truncate table 'xxx' because it is published for replication.
My environment is SQL Server 2K5 SP2 (developer edition)
Next time, when you are asked this question in an interview, just give this answer to impress your interviewer.
In my last post, I talked about what prompted me to give a deep thought to documentation work. Documentation, to me, is no small work load esp. when the quality bar is set high.
A DBA usually will get involved in four types of documents as listed below
Basic Document Quality Requirements
Except for user requirement and communication-oriented documents, which can be varied hugely different depending on the real scenarios and target audience, the other two types of documents have something in common:
Document Value Formula
It is hard to precisely quantify the value of a document. But still we can make the following assumptions:
Document Value = (Document usage life) * (Audience Number) * (Audience Position Rank) / (Document Generating Time)
Where:
Of the four factors in the formula, as document authors, we usually cannot control the first three factors, but we definitely have the ability to impact the fourth one, i.e. Document Generating Time.
Document Evolution Path
Based on the value formula described above, we can explore some innovative ways to generate a DBA document.
My understanding is that most DBA's documents, which are technical ones (50% to 70% of a DBA's total documents), can be made similar like a financial report in the way that a financial report can be produced dynamically anytime as long as the data is available. A DBA's document can be produced in a concise, informative and fast way similarly too.
For example, if I want to generate a system configuration document, all I need to is to start a scanning application that will get the live data from the system and insert the data into a table with timestamp, and then I get these data from the table and fill the data into a pre-formatted Excel document. The document's title, purpose section, analysis rules can be stored in a table too to facilitate the document generation.
Another example, say I need to give my manager a weekly report regarding a project status, all I need to is to retrieve my project log table as follows:
Select Project_Item_Desc, Finished_Date
From ProjectLog
Where Status = ‘MileStone'
and here ProjectLog table is automatically updated daily using a predefined protocol. For example, if my scanning job finds a document called "User Requirments.doc" in a C:\Star_War_Project folder, I will
Update ProjectLog set Finished_Date = getdate() where item = 'user requriemnt doc done'
I believe documentation was, is and will still be the inseparable work load a DBA need to take for high-quality work. The issue here is how to maximize the document value to be aligned with business values and how to maximize the value of a DBA's time. I have seen documents piled to the ceiling, while many documents were less than 2 years old, and nobody cared to read as the system was already upgraded or replaced. I was feeling so sorry for the authors whose dedidcations and time were lost and wasted in nowhere.
I hope my proposed documentation formula can somehow serve as a rough guideline if you need to evaluate the ROI of any document you want to compose.
As a DBA, documentation work is unavoidable, we need to do documentation about things we have done, for example, what code we have implemented, what configuration value we have changed, what user rights we have changed, what xxxx. In a heavily-audited environment, documentation practice is what an auditor enthusiasitcally pursues
I once worked in a company where our default documentation standard is: "Document for Dummies", i.e. the document should be detailed enough to the extent that a computer-illiterate person (assuming the person knows how to click mouse and type keyboard) should be able to read the document and repeat what I, a senior DBA, has done as described in the document. This standard killed me. Just an example here, say I wanted to document how I set up log-shipping in SQL Server 2000. As a matter of fact, MS has a whitepaper about how to set up log-shipping (http://support.microsoft.com/support/sql/content/2000papers/LogShippingFinal.asp), but no, it did not meet the standard of my management as I was challenged by the question "Do you think Ms. xxx can do the log-shipping by following the instructions on this document?" (note. Ms. xxx is our data entry clerk with no knowledge of database at all). I thouht to myself, "Well, no, Ms. xxx definitely cannot do this with this document as she even does not know where to start SQL Server Enterprise Manager, let alone how to create a shared folder used in the logshipping".
So long story short, I completely reinvented the wheel based on the MS whitepaper, however, I had to add more details, such as pre-implementation guide, details including how to create a sharefolder, each step is companioned with a screen-shot, and how to start a SQL Server Enterprise Manger by saying, "Go to Start, click Program, and click SQL Server 2000 Program folder, and then click SQL Server Enterprise Manager and so and so".... After doing this for some time, I realized that I have more important things to do and my time can be better utilized than doing these so-called "docouments for dummies".
In the following months, I was approached by a friend who works for a staffing company, and I was recommended to a company who was looking for a senior DBA. I went to see the company's hiring manager, and at the end when asked why I left the current employer, I told the truth that I disliked the documentation practice there. Two days later, my friend in the staffing company gave me feedback, saying "Jeffrey, you are the front-runner technically speaking, but they are worried that you are incapable of doing documentation, can you give me something that I can address their concerns?". My friend's request prompted me to give a deep thought regarding documentation, and I try to answer the following questions:
What documentation DBA needs to prepare and who are the targeted readers?
What is the ROI of any documentation? Is there a foruma that we can use as a framework for this ROI calculation?
How can we produce the documentation efficiently? Is there any automatic way to facilitate documentation?
I'll try to address these questions in my next post.
In DBA world, usually two types of DBA are defined depending on what a DBA is focusing on, i.e. development DBA or production DBA, As for the responsibility of the two types, here is a good summary article http://www.craigsmullins.com/dbta_023.htm
However I believe now we should add another type, Test DBA, and I define the TEST DBA as the following:
A test DBA is dedicated DBA who is responsible to design test cases to stress test any target system, and also is reponsible to diagnose any complex issues with a well designed test framework, and last but not the least is responsible to test any off-shelf (or open sources) database tools.
In more details, I outline the following primary tasks for a test DBA
1. Hardware/system pre-deployment test: the test DBA needs to set up and test the performance baseline for a new server, such as raid configuration impact to performance, (my recent experience tells me that I cannot take anything for granted, for example, RAID 10 is always better than RAID 5, in theory, YES, but in reality, esp. in my environment, on a same server, with same disks, RAID 5 outperformed RAID 10 in both read and write performance, and the system adminstrators are still checking why); a test DBA also need to check whether the new hardware can meet the expected future business requirements.
2. Hardware/system evaluation: when we need to retire our current production servers or other hardware such as SAN disks, we usually have to evaluate among various vendor products to find the best candidate in terms of benefit and cost. A test DBA needs to provide the valid data to facilitate the decision.
3. Application database component test: In a complex environment, it is worthwhile to test those critical stored procedures / views / queries from DBA perspective by tuning indexes, configure server options, adding hints, playing around with temp tables and table variables. A test DBA will be able to read the query plan and I/O statistics and make corresponding changes or recommend proper suggestions to developement team.
4. Off-shelf / open-source database admin tools trial and test: this is to facilitate management / production DBAs to get the most suitable tools to monitor and manage the production enviroments.
5. Test case library build up: test case library can be treated as an invalualbe company asset. For example, previous test cases may be used to check against the code performance after the application is revamped, also previous cases can be used to check against any new hardware environments. A well-built test case library can greatly improve the efficiency of the whole application ecosystems.
Skills a Test DBA needs:
A test DBA needs to master various test tools, for example, SQLIO, SQLIOSim and IOMeter, SQL BPA
A test DBA needs to have deep knowledge of sql server tuning (index, isolation level, server configuration options, DMVs etc) to be able to design qualified test cases / environments.
A test DBA is better to master some reporting tools, because you need to present what you have found. I think SQL reporting service is a good skill for a test DBA to learn.
Why do I come to this "Test DBA" idea?
In the last two days, I was asked to do a stress test on a SAN box of RAID 10 type in a new server which will be used as a production server a few months later, I use SQLIOSIM to do the various tests with different configurations, and the interesting thing is oringially we had RAID 5 in the server, and I happened to test RAID 5 before it was reconfigured to RAID 10, after I stress-tested RAID 10 (memory was even upgraded to 16 GB from 8 GB when RAID 5 was there), I did a comparison of the results of RAID 10 and RAID 5, and was surprised to find RAID 5 beat RAID 10 in almost all areas, such as IO response time, drive level read / write time etc. Due to the lack of documented information for SQLIOSIM, it is a pain to do all the tests and then analyze the results. I originally even thought to start performance counters while doing the stress tests and then corelate the performance counters with the performance results in diffeernt test cases. I finally gave up this idea due to time constraints (forget to mention I am currently a production DBA that needs to respond to any requests ASAP), and this leads me to think we need a DBA who can be dedicated to this type of test work, and then document each test case, test trails, test environment, test result to build a test case library for future use.
Of course, there are overlaps among different DBA types, but I believe DBA work scope is broad enough these days that demands dedicated people with specialized skills in different sectors of DBA work spectrum.
Last week, in our production, we noticed some records are missing in the subscriber side while existing on the publisher side. The detailed issue description can be found at:
http://www.mydatabasesupport.com/forums/ms-sqlserver/362107-replicated-commands-not-executed-subscriber-side.html
MS online support suggests to apply the latest hotfix and then check again. I can understand online support's practice but I do not believe this should be a recommended approach. Because most of the time, the issue occurs in a production environment, which for whatever reason is hard to copy to the test environment, for example, production environment is EE edition, production is clustered and some other specific configurations for production etc, etc, all these factors make us hard / impossible to create a comparable test environment. On the other hand, it is not easy to apply the latest hotfix to production because there usually exists a whole test cycle to test the hotfix from the application perspective, esp. when MS cautions that users should weigh the risks of applying not-fully tested hotfix.
After almost one week's debugging and investigation with other team members, I finally come up with a test script that can repeat the issue and thus points out the conditions under which this issue can occur.
To repeat the issue, we need to first create the environment as the following (assume you have SQL 2K5 Developer):
1. Prepare two sql server instances DB1 and DB2, assume you install SQL 2K5 Developer with edition to 3161
2. On server DB1, create a database TestDB and then create a table dbo.T and then create a publication UpdatePub (using dbo.T as the article for UpdatePub) with immediate updatable subscription on DB2's TestDB database
The script to create table dbo.T is:
Use TestDB
go
create table dbo.T (a int primary key, b varchar(100))
3. Create a linked server on DB2 that points to DB1 using the following script
/****** Object: LinkedServer [DB1] Script Date: 01/15/2008 23:46:54 ******/
EXEC
GO
4. Now from SSMS, open a query window pointing to DB2 server, and then run the following script (call it script_A)
set xact_abort on
insert
commit
After running the script, and then we can do: select * from testdb.dbo.T on BOTH DB1 and DB2, we should see there is a record in dbo.T on both DB1 and DB2, this means the record inserted into the DB1.TestDB.T is replicated to DB2.TestDB.T, just as expected, although the insert transaction is initiated from the DB2 side.
Now we add one more sql statement in the above script and the new script, let's call it Script_B
USE TestDB
set
After running this script on DB2, we expect to see three records in DB2.TestDB.dbo.T as follows
the record with a=1 is there because of the script_A, the record with a=2 is a result of script_B, the direct insert sql statement, and the record with a=3 is what we expect because in script_B, we insert the record to the publisher table, and the newly inserted record should be replicate back to the subscriber just as it is done in script_A.
However, the result is: we DO NOT see the record with a=3 in DB2.TestDB.dbo.T, i.e. when we do the query on DB2
select * from TestDB.dbo.T
we can only see two records instead of three records, the real records are as follows:
From replication side, I do not see any error/warning reported, even if I turned on -HistoryVerboseLeve to 3 for distribution agent.
With all these said / tested, I believe SQL 2K5 has a "bug" which will repeat itself if the following minimum conditions are met:
In a SS2k5 publication with updatable subscription, if the following conditions are met,the replication will not work properly even though there is no error reported. 1. There is a distributed transaction started from the subscriber side2. The distributed transaction will contain two sql statement, one is to insert into a local replicated table (call it sub_table) which is the subscription table of a publication table (called it pub_table), another sql statement is to insert into the remote pub_table on the publisher side. When these two conditions are met, the insert sql for remote pub_table will succeed, but the record will not be replicated back from the pub_table to the sub_table as expected.
I welcome your comments / feedback if you have any different thoughts.
Note (2008-01-21) Adams Qu from MS has confirmed this is an issue with SQL 2K5 and it is good to know why, please see his comment here:
http://www.microsoft.com/communities/newsgroups/en-us/default.aspx?dg=microsoft.public.sqlserver.replication&tid=d19f17cd-2598-4c18-a2a3-bad151b0c827&p=1