🔹Collecting and preparing the datasets

At this stage the teams will undertake the search for the datasets needed to fulfill the implementation of the Guide and the consequent assembly of the CIS. It is necessary to use official public data sources, as much as possible, but it may also be necessary to manage the publication of some datasets that have been identified but are in the custody of public authorities.

It is recommended that the advocacy team within the government is used to manage the delivery of the datasets. It is recommended that the datasets are stored in a shared repository so that other teams can view and review them, and the files are controlled by a spreadsheet that is a copy of the one shared in this link.

After obtaining the datasets to be included in the CIS, the data analyst team will clean, process and aggregate the information in the datasets so that the CIS can be ready for public consultation. The link to the datasets contains a suggested computation for each indicator (i.e. for the indicator "Average number of pupils enrolled in the initial and pre-school levels of the initial modality in the public sector by educational units", the operational definition is the "quotient of the sum of enrolment and the number of educational units in the public sector").

Relevant data types

Within the Open Up Guide for the Care Sector, the list of indicators includes a column on the suggested data source to be used. It describes in an abstract way a possible data source from which the indicator could be constructed, and attempts to describe datasets that countries publish through their national statistical offices. However, for some datasets, it will be necessary to seek to complete the indicator through different sources depending on the division of responsibilities of different levels of government. For example, to identify all social programmes, and then those for care, it may be necessary to locate both national and local databases.

Preparing the datasets

The datasets, depending on the state they are in, will need to have different levels of preparation and processing. Ideally the datasets will have characteristics described in the seminal paper Tidy Data by Hadley Wickham, R language creator. In summary, the datasets should have the following characteristics to be used in the CIS:

  • Published in .csv format.

  • Be available in a tabular format.

  • Tabular format must contain column names as the first row.

  • Column names do not contain spaces or special characters. A snake_case nomenclature for columns is recommended (lower case, no special characters, accents, or ñ, no spaces, and using an underscore symbol to denote spaces).

  • Data for the columns must be normalised, in the case of categorical data. For example, for a column with positive and negative values use the same type of value for all responses, e.g. "Yes" and "No" instead of "Yes", "yes", "yes", "Yes", "1", "true", etc.

  • In the case where territorial boundaries are named, standardise them in all cases. For example, if in several datasets there is a column describing the names of municipalities, try to use a single master list of names, to avoid different names in all datasets. For example, for "Buenos Aires", try to use exactly the same name instead of "B.A." "BsAs", etc.

  • Pay special attention to data fields representing null values. A null value is a value that was not answered and ideally is represented in the same way in all CIS sets. A value such as N/A (not applicable) is different from a null value (null, nil).

Last updated