Change Management is the process of planning and implementing
Includes post-change follow-up: documentation, monitoring, analysis, and reporting.
Following proper change management procedures and documentation requirements
can yield an audit trail.
Any change in any service, including upgrades, hardware replacements, patches,
and configuration changes result in a cut-over, which is
the process of switching users from the old service to the new one.
All changes can have unanticipated consequences (no matter how experienced you
are, or think you are).
This is the main reason why all cut-overs must be scheduled, with plenty of
notice for all potentially affected users.
A flash cut is when all users are switched
(or migrated) to the new service at the same time.
Generally this is a bad idea but may be unavoidable in some situations.
A gradual cut-over is when only a small number of users are migrated to
the new service at a time. (e.g., beta-testers)
Change management is related to configuration management, which means
the management of the configurations for all servers, workstations, and network
devices under your care.
It is also related to patch management, which is concerned with the
scheduling of patching operating systems and applications.
Planning and Testing
When possible use a test system/network to try the change
before any cut-over.
Not only will you be able to see the new service works, but you can
practice the change / cut-over procedure without affecting users on
a live or production system.
The safest way to cut-over a new service is to leave the old server
in place while installing/configuring the new service on a new server.
Then you can flash cut-over by switching
IP addresses, or gradually
cut-over by having a few users access the new service.
(This can be done with modern routers/switches that can re-direct traffic.)
If boot up scripts are modified, test that reboots work.
If new hardware isn't available for each new/updated service,
then have a test lab, which is a testing network and server(s).
The test machine should be as identical to the old server as possible.
If no spare hardware is available then planning and testing are
Sometimes a server is powerful enough to run two copies of some
service (say a web server).
Then you can test the new service (using a different port number)
before turning off the old one, even if no additional hardware is
If a flash-cut must be used, make sure to make copies of all
modified configuration files before actually making changes.
Common techniques include file renaming (coping foo.config
to foo.config.working) and using a file versioning
These techniques are useful even when not using a flash-cut.
File versioning systems are also called source code
revision control systems
or software configuration management systems
Commonly used systems for Unix/Linux are
RCS, Subversion, and especially Git.
Such systems log all changes to a set of files, prevent multiple
updates at once corrupting each other, provide the ability to go back to
any previous version of some file, and may have other features too.
Always test vendor supplied patches, as often they can break
Never be the first to install some patch —
let someone else find the problems with it!
Wait at least a few days even for a critical or security
Wait a few weeks before applying a kernel patch.
Make sure you have the correct software licenses for any patches
you plan on installing.
Configuration files are in text but often have very fussy and
You can write simple shell or Perl scripts to validate
the syntax of configuration files before committing the change
This can be combined with file versioning, to have a script that
checks out some configuration file, puts you into an
editor, then when you quit the editor checks the syntax and only
if okay the file is checked in.
You can also use custom
editors to edit the files safely,
rather than using a text editor.
Always ask yourself what other services and/or applications may be
affected by some change.
(For example, if updating servers results in data or log file
changes, monitoring system may need to be upgraded and/or reconfigured
at the same time.)
Always practice unfamiliar procedures.
After practicing and testing some update procedure, estimate
how long the cut-over should take if things go well.
(This should include the time needed to test and verify the change
worked as expected, and that other services didn't break.)
Then estimate how long the back-out procedure will take (if
things don't go well).
Add these two values plus some
fudge factor1 to obtain
the estimated time of the disruption for the change.
Design a back-out plan for when the change procedure
This is a plan to restore the original service quickly.
Have a firm deadline for successfully completing the cut-over.
Once that deadline passes, activate the back-out plan. Don't give in to the temptation to try just one more thing.
Always have a back-out plan even if you think the change is trivial.
Sometimes the back-out plan can be simple: if attempting to cut-over to
new hardware, usually you make the change and reboot after changing the
new host's IP address to match the old host's address.
In this case the back-out plan is just don't change IP addresses.
Even this type of change needs planning in advance.
Hosts and remote DNS servers will cache IP
address for a long time, usually days.
This cache time is set on your DNS server;
it's called the time to live (TTL) value.
TTL days before an IP change cut-over,
change the TTL to one day.
The day before the cut-over, change the TTL again to
5 minutes or whatever interval of time you deem appropriate for users
to start using the new service.
After the cut-over is deemed successful, change the TTL
back to the original value.
Always schedule service changes.
(Not for trivial changes and not for emergency fixes.)
Never stop or reconfigure any service without
notifying all potentially effected users of the scheduled disruption.
Always allow sufficient time for users to notify you of schedule
conflicts (e.g., That's the day of the demonstration to the
Re-notify users just before starting a procedure that may result
in a service disruption.
Provide a follow up notice after the cut-over, even if you canceled
the cut-over or had to activate the back-out plan.
Provide additional notification to help desk personnel, so they can
prepare for the questions and problems that may arise.
Provide notice to appropriate management, even if they are not
directly affected they may need to know about scheduled changes.
A maintenance window is a period when regular updates,
backups, and other routine maintenance is performed.
This is a scheduled interval when users should expect service
Users should not expect the system to be available during this period.
If some change isn't critical it is a good idea to wait for the
next maintenance window to do it.
Testing is also done during maintenance windows.
Testing means rebooting servers, routers, and switches, and checking
You can also test what happens when multiple equipment is powered off
and rebooted, at the same time.
Often a changed configuration goes unnoticed until some equipment reboots.
Testing can convert unplanned (expensive and embarrassing) outages into
If the change can't be done within a single window consider doing the change
in two (or more) stages so no unscheduled disruptions are needed.
If this isn't possible try to avoid scheduling changes during likely
busy times: during trade-shows, end-of month/quarter/year/semester, or
when required personnel are unavailable (e.g., vacation times).
Coordinate changes with the backup schedule.
Perform backups at the start of a maintenance window, then do other changes.
This can obviously be important if things go badly,
but can be even more important if the change goes well.
(New versions of services often have log file or data format changes.
If the new service starts up immediately after a backup you avoid the
unpleasant situation of log files and/or data files with half old data
and half new data.)
If there are many changes to be made during a maintenance window then they
must be coordinated.
Make sure the necessary resources (software, hardware, passwords/license keys,
and personnel) will be available at the scheduled time.
One way to coordinate changes is to use
which are formal documents outlining the proposed changes.
Change proposals should be required to be submitted
a week (or more or less, depending on the local policy)
in advance of the maintenance window it is planned for.
A good trouble-ticketing system can be used to record these
along with trouble reports and requests for enhancements (or
(A method is needed for users (including management and System Admin staff)
to request new features, new services, changed configurations, and
to report problems that may require reconfiguration,
patches or upgrades.
Such a system is vital to the help/support desk operation and is
commonly known as a trouble-ticketing system.)
Most changes, especially those requiring budget and/or with user-visible
procedure changes, require management approval (or authorization).
The questions to answer in a typical change proposal are:
What changes are to be made?
Who needs to authorize the change?
(Record when authorized and by whom.)
What budget is needed?
(And who gets billed?)
Which network devices (if any) are effected?
What/whose permissions are needed to implement the change?
What notifications must be made, and by when?
Who is responsible for making the notifications?
(This must be done early enough to allow users to inform
you of schedule conflicts.
At some point however, you freeze the schedule
for the next maintenance window.)
What is the priority of making the change?
(This information can be used to decide which changes to make
now and which ones to put off, in the event there isn't time
to do every approved change.)
What are the other service dependencies and due dates (if any)?
What / who might be affected by the change?
What other services/scripts need to be updated (e.g.,
log file monitors, backup procedures, access controls)?
Who will perform the change?
(This is often referred to as assigning the change.)
How long should it take (before we activate the back-out procedure)?
What is the test procedure?
What is the back-out procedure? How long will it take?
What follow-up procedures must be done?
Mental Checklist Before Making Changes
(by Peter Baer Galvin, ;login: Apr 08 pp. 62–67)
Is the command the right one to make the change?
Is the syntax of the command correct?
Is there a better way to make the change?
Are the right options entered or selected?
Is today Friday or some other day on which it would be
exceptionally bad to break something (e.g., the day before leaving
on vacation or for a conference)?
What are the chances that executing this will break something?
If the change would break something, can I undo the action?
Is this a documented way to accomplish the task?
If this is a new way, have I documented it?
What effect might this change have on security or other
services or subsystems?
Before using a new tool:
Do I have a better tool for this?
Is this tool/command multiplatform, or a one-off solution?
Does it work or just cause more (or different) work?
Is the tool maintained?
Does it change too often (causing more work)?
How much does it cost, really?
Do I already know this tool or is it easy to learn?
Is it likely to break something?
Large-scale Web Services:
Cannot use traditional change management, as there is only one system and
it must never go down (at least not completely).
Examples include Facebook, Twitter, Google, etc.
All changes are made to the live system, using a process known as
continuous delivery (also continuous development,
Changes made by developers are checked into a source code repository,
automatically built, and unit and other tests (including code style checks)
are also automatically preformed.
If any of these tests fail, the change is rejected.
Otherwise the change is presented for peer code review and compliance auditing,
and any tests that cannot be easily automated.
If approved, the change is pushed into production (automatically).
Such changes include configuration switches (or
toggles) that can be used to disable the new feature quickly.
(Such switches can also be used for “A-B” testing.)
Any changes must include the changes required by operations
(such as new metrics to monitor and/or new log data to collect).
This way of working (continuous changes without maintenance windows
or official releases) requires understanding of both system administration
(operations) and development, and close cooperation and coordination
between the two groups.
This is called DevOps.
What is a fudge factor?
That is a term used to describe something extra.
In this case it means add extra time.
Say you think something will take one hour.
Don't tell users (or your boss) one hour, add a
fudge factor of, say, 15 minutes or more.
The fudge factor allows for the situation when something takes longer
than you estimated.
You don't want to go over the time you told
others it will take to complete some task.
As a new system administrator you generally add a big fudge factor (say
double your original estimate).
As you gain experience, you generally add smaller and smaller fudge factors.
The concept applies to everything, not just time.
If you have a budget request for new equipment for $122.00,
then you should add a fudge factor and ask for a budget of $130.00
(or even $150.00 if you can get away with that).
When the price at CompUSA changes on you, you won't need to request a
budget all over again as long as the new price is below your original
The concept applies to the maximum load of elevators (only in that case
they call it a safety factor), the amount of cable
you would need to purchase to wire a network, and so on.