Unix/Linux Administration - Overview of Change Management

Overview
- Change Management is the process of planning and implementing system changes.
- Includes post-change follow-up: documentation, monitoring, analysis, and reporting.
- Following proper change management procedures and documentation requirements can yield an audit trail.
- Any change in any service, including upgrades, hardware replacements, patches, and configuration changes result in a cut-over, which is the process of switching users from the old service to the new one.
- All changes can have unanticipated consequences (no matter how experienced you are, or think you are). This is the main reason why all cut-overs must be scheduled, with plenty of notice for all potentially affected users.
- A flash cut is when all users are switched (or migrated) to the new service at the same time. Generally this is a bad idea but may be unavoidable in some situations.
- A gradual cut-over is when only a small number of users are migrated to the new service at a time. (e.g., beta-testers)
- Change management is related to configuration management, which means the management of the configurations for all servers, workstations, and network devices under your care. It is also related to patch management, which is concerned with the scheduling of patching operating systems and applications.
Planning and Testing
- When possible use a test system/network to try the change before any cut-over. Not only will you be able to see the new service works, but you can practice the change / cut-over procedure without affecting users on a live or production system.
- The safest way to cut-over a new service is to leave the old server in place while installing/configuring the new service on a new server. Then you can flash cut-over by switching IP addresses, or gradually cut-over by having a few users access the new service. (This can be done with modern routers/switches that can re-direct traffic.)
- If boot up scripts are modified, test that reboots work.
- If new hardware isn't available for each new/updated service, then have a test lab, which is a testing network and server(s). The test machine should be as identical to the old server as possible.
- If no spare hardware is available then planning and testing are especially important.
- Sometimes a server is powerful enough to run two copies of some service (say a web server). Then you can test the new service (using a different port number) before turning off the old one, even if no additional hardware is available.
- If a flash-cut must be used, make sure to make copies of all modified configuration files before actually making changes. Common techniques include file renaming (coping foo.config to foo.config.working) and using a file versioning system. These techniques are useful even when not using a flash-cut.
- File versioning systems are also called source code control systems (SCCS), revision control systems (RCS), or software configuration management systems (SCMS). Commonly used systems for Unix/Linux are CVS, RCS, Subversion, and especially Git. Such systems log all changes to a set of files, prevent multiple updates at once corrupting each other, provide the ability to go back to any previous version of some file, and may have other features too.
- Always test vendor supplied patches, as often they can break existing applications. Never be the first to install some patch — let someone else find the problems with it! Wait at least a few days even for a critical or security patch. Wait a few weeks before applying a kernel patch.
- Make sure you have the correct software licenses for any patches you plan on installing.
- Configuration files are in text but often have very fussy and unforgiving syntax. You can write simple shell or Perl scripts to validate the syntax of configuration files before committing the change as final. This can be combined with file versioning, to have a script that checks out some configuration file, puts you into an editor, then when you quit the editor checks the syntax and only if okay the file is checked in. You can also use custom GUI editors to edit the files safely, rather than using a text editor.
- Always ask yourself what other services and/or applications may be affected by some change. (For example, if updating servers results in data or log file changes, monitoring system may need to be upgraded and/or reconfigured at the same time.)
- Always practice unfamiliar procedures.
- After practicing and testing some update procedure, estimate how long the cut-over should take if things go well. (This should include the time needed to test and verify the change worked as expected, and that other services didn't break.) Then estimate how long the back-out procedure will take (if things don't go well). Add these two values plus some fudge factor¹ to obtain the estimated time of the disruption for the change.
Back-out Plan
- Design a back-out plan for when the change procedure fails. This is a plan to restore the original service quickly.
- Have a firm deadline for successfully completing the cut-over. Once that deadline passes, activate the back-out plan.
  Don't give in to the temptation to try just one more thing.
- Always have a back-out plan even if you think the change is trivial.
- Sometimes the back-out plan can be simple: if attempting to cut-over to new hardware, usually you make the change and reboot after changing the new host's IP address to match the old host's address. In this case the back-out plan is just don't change IP addresses.
  Even this type of change needs planning in advance. Hosts and remote DNS servers will cache IP address for a long time, usually days. This cache time is set on your DNS server; it's called the time to live (TTL) value. TTL days before an IP change cut-over, change the TTL to one day. The day before the cut-over, change the TTL again to 5 minutes or whatever interval of time you deem appropriate for users to start using the new service. After the cut-over is deemed successful, change the TTL back to the original value.
Notifications
- Always schedule service changes. No exceptions! (Not for trivial changes and not for emergency fixes.)
- Never stop or reconfigure any service without notifying all potentially effected users of the scheduled disruption.
- Always allow sufficient time for users to notify you of schedule conflicts (e.g., That's the day of the demonstration to the customer).
- Re-notify users just before starting a procedure that may result in a service disruption.
- Provide a follow up notice after the cut-over, even if you canceled the cut-over or had to activate the back-out plan.
- Provide additional notification to help desk personnel, so they can prepare for the questions and problems that may arise.
- Provide notice to appropriate management, even if they are not directly affected they may need to know about scheduled changes.
Maintenance Windows
- A maintenance window is a period when regular updates, backups, and other routine maintenance is performed. This is a scheduled interval when users should expect service disruptions.
- Users should not expect the system to be available during this period. If some change isn't critical it is a good idea to wait for the next maintenance window to do it.
- Testing is also done during maintenance windows. Testing means rebooting servers, routers, and switches, and checking the results. You can also test what happens when multiple equipment is powered off and rebooted, at the same time. Often a changed configuration goes unnoticed until some equipment reboots. Testing can convert unplanned (expensive and embarrassing) outages into planned ones.
- If the change can't be done within a single window consider doing the change in two (or more) stages so no unscheduled disruptions are needed. If this isn't possible try to avoid scheduling changes during likely busy times: during trade-shows, end-of month/quarter/year/semester, or when required personnel are unavailable (e.g., vacation times).
- Coordinate changes with the backup schedule. Perform backups at the start of a maintenance window, then do other changes. This can obviously be important if things go badly, but can be even more important if the change goes well. (New versions of services often have log file or data format changes. If the new service starts up immediately after a backup you avoid the unpleasant situation of log files and/or data files with half old data and half new data.)
- If there are many changes to be made during a maintenance window then they must be coordinated. Make sure the necessary resources (software, hardware, passwords/license keys, and personnel) will be available at the scheduled time.
Change Proposals
- One way to coordinate changes is to use change proposals, which are formal documents outlining the proposed changes.
- Change proposals should be required to be submitted a week (or more or less, depending on the local policy) in advance of the maintenance window it is planned for. A good trouble-ticketing system can be used to record these along with trouble reports and requests for enhancements (or RFEs). (A method is needed for users (including management and System Admin staff) to request new features, new services, changed configurations, and to report problems that may require reconfiguration, patches or upgrades. Such a system is vital to the help/support desk operation and is commonly known as a trouble-ticketing system.)
- Most changes, especially those requiring budget and/or with user-visible procedure changes, require management approval (or authorization).
- The questions to answer in a typical change proposal are:
  - What changes are to be made?
  - Who needs to authorize the change? (Record when authorized and by whom.)
  - What budget is needed? (And who gets billed?)
  - Which hosts? Which network devices (if any) are effected?
  - What/whose permissions are needed to implement the change?
  - What notifications must be made, and by when? Who is responsible for making the notifications? (This must be done early enough to allow users to inform you of schedule conflicts. At some point however, you freeze the schedule for the next maintenance window.)
  - What is the priority of making the change? (This information can be used to decide which changes to make now and which ones to put off, in the event there isn't time to do every approved change.)
  - What are the other service dependencies and due dates (if any)?
  - What / who might be affected by the change? What other services/scripts need to be updated (e.g., log file monitors, backup procedures, access controls)?
  - Who will perform the change? (This is often referred to as assigning the change.)
  - How long should it take (before we activate the back-out procedure)?
  - What is the test procedure?
  - What is the back-out procedure? How long will it take?
  - What follow-up procedures must be done?
Mental Checklist Before Making Changes
(by Peter Baer Galvin, ;login: Apr 08 pp. 62–67)
- Is the command the right one to make the change?
- Is the syntax of the command correct?
- Is there a better way to make the change?
- Are the right options entered or selected?
- Is today Friday or some other day on which it would be exceptionally bad to break something (e.g., the day before leaving on vacation or for a conference)?
- What are the chances that executing this will break something?
- If the change would break something, can I undo the action?
- Is this a documented way to accomplish the task?
- If this is a new way, have I documented it?
- What effect might this change have on security or other services or subsystems?
Before using a new tool:
- Do I have a better tool for this?
- Is this tool/command multiplatform, or a one-off solution?
- Does it work or just cause more (or different) work?
- Is the tool maintained?
- Does it change too often (causing more work)?
- How much does it cost, really?
- Do I already know this tool or is it easy to learn?
- Is it likely to break something?
Large-scale Web Services:
- Cannot use traditional change management, as there is only one system and it must never go down (at least not completely).
- Examples include Facebook, Twitter, Google, etc.
- All changes are made to the live system, using a process known as continuous delivery (also continuous development, continuous deployment).
- Changes made by developers are checked into a source code repository, automatically built, and unit and other tests (including code style checks) are also automatically preformed. If any of these tests fail, the change is rejected. Otherwise the change is presented for peer code review and compliance auditing, and any tests that cannot be easily automated.
- If approved, the change is pushed into production (automatically).
- Such changes include configuration switches (or toggles) that can be used to disable the new feature quickly. (Such switches can also be used for “A-B” testing.)
- Any changes must include the changes required by operations (such as new metrics to monitor and/or new log data to collect).
- This way of working (continuous changes without maintenance windows or official releases) requires understanding of both system administration (operations) and development, and close cooperation and coordination between the two groups. This is called DevOps.

Unix/Linux Administration
Overview of Change Management

Footnotes:

Unix/Linux Administration Overview of Change Management

Footnotes:

Unix/Linux Administration
Overview of Change Management