Initiating data collection - AWS Prescriptive Guidance

Initiating data collection

Data collection is the process of gathering metadata from applications and infrastructure. The process is iterative throughout all stages of assessment. In each stage, data quantity and fidelity will increase. At this stage, the focus is on gathering general data that can help to establish an initial inventory. The inventory will be used to create a directional business case and the identification of initial migration candidates.

After the current data sources have been identified, we recommend gathering information from as many systems as possible. For more information, see the data requirements for this stage.

This approach has the benefit of helping to update the current portfolio view and the organization's knowledge of their applications and services. It also helps with determining what is targeted to move. The recommended approach is to review existing data, such as configuration management database (CMDB) outputs and information technology service management (ITSM) systems. Then construct a list of assets targeted for data collection. If your organization has complete clarity of what is in scope and out of scope for the migration, you might restrict data collection to the systems that are in scope.

When building your portfolio, consider the applications and their environments or software release lifecycles. For example, instead of identifying a customer relationship management (CRM) application and specify that it has test, dev, and prod environments, list three applications (for example, CRM-Test, CRM-Dev, CRM-Prod). Alternatively, use the CRM name but assign a unique ID to each environment and present them as separate records in your data repository. This will help with planning and tracking the migration of these environments individually. For example, you might want to migrate non-production environments first. By listing the instances of your application according to the environment, you can clearly manage and govern their transition.

During data collection, there might be uncertainty about which applications or servers are in a given data center or source location. In these cases, obtaining bare-metal and hypervisor lists from existing management tools is helpful. For example, you can connect to a hypervisor to obtain lists of virtual machines to be targeted for data collection.

Note that the initial output, when combining existing data sources, could be incomplete. The key is to perform a gap analysis in terms of data requirements for this stage and what can be obtained from existing sources. It's important to contrast percentage of completeness with level of data fidelity. Higher completeness levels from low-fidelity sources will contain several assumptions that could lead to flawed analysis. While this stage of assessment does not require the maximum data fidelity, we recommend that data sources are at least medium to medium-high fidelity. Contrast these numbers against your organization's tolerance to risk, including the use of assumptions to fill data gaps.

The gap analysis helps you understand the quantity and quality of data you are working with. The analysis also helps you to establish the level of assumptions that must be made to create a directional business case and prioritize applications for migration. Discovery tooling can help to fill the gaps and collect high-fidelity data. To increase the confidence levels in data and accelerate migration outcomes, we recommend deploying discovery tooling as early as possible. Early action is also important because internal procurement, security, and implementation processes for new tools could require several weeks or months to complete.

We recommend establishing a communication plan or cadence and a scope-change control mechanism at this stage. This helps you to keep stakeholders informed so that they can plan ahead and mitigate risks. A key element for clear communications is to define a single source of truth for the application portfolio and associated infrastructure. Avoid keeping multiple systems of record and application and infrastructure lists. Keep data in one place (for example, a database, a tool, or a spreadsheet) that supports versioning and online collaboration, and assign an owner to it.