> For the complete documentation index, see [llms.txt](https://open-data-charter.gitbook.io/open-up-guide-for-the-care-sector/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://open-data-charter.gitbook.io/open-up-guide-for-the-care-sector/implementation-milestones/collecting-and-preparing-the-datasets.md).

# Collecting and preparing the datasets

At this stage the teams will undertake the search for the datasets needed to fulfill the implementation of the Guide and the consequent assembly of the CIS. It is necessary to use official public data sources, as much as possible, but it may also be necessary to manage the publication of some datasets that have been identified but are in the custody of public authorities.&#x20;

It is recommended that the advocacy team within the government is used to manage the delivery of the datasets. It is recommended that the datasets are stored in a shared repository so that other teams can view and review them, and the files are controlled by a spreadsheet that is a copy of the one shared in [this link](https://docs.google.com/spreadsheets/d/1V5c2BHxzSFGS09O9uZHu8obIFSsg-ErHiZUdPbN7BSI/edit#gid=0).

After obtaining the datasets to be included in the CIS, the data analyst team will clean, process and aggregate the information in the datasets so that the CIS can be ready for public consultation. The link to the datasets contains a suggested computation for each indicator (i.e. for the indicator "Average number of pupils enrolled in the initial and pre-school levels of the initial modality in the public sector by educational units", the operational definition is the "quotient of the sum of enrolment and the number of educational units in the public sector").

## Relevant data types

Within the [Open Up Guide for the Care Sector](https://airtable.com/app32bSjtCkSedl3h/shrv2ZRLgPx06xkXm/tbl84YWZFWr1ZSDrA), the list of indicators includes a column on the suggested data source to be used. It describes in an abstract way a possible data source from which the indicator could be constructed, and attempts to describe datasets that countries publish through their national statistical offices. However, for some datasets, it will be necessary to seek to complete the indicator through different sources depending on the division of responsibilities of different levels of government. For example, to identify all social programmes, and then those for care, it may be necessary to locate both national and local databases.

## Preparing the datasets

The datasets, depending on the state they are in, will need to have different levels of preparation and processing. Ideally the datasets will have characteristics described in the seminal paper [Tidy Data](https://vita.had.co.nz/papers/tidy-data.pdf) by Hadley Wickham, R language creator. In summary, the datasets should have the following characteristics to be used in the CIS:&#x20;

* Published in .csv format.
* Be available in a tabular format.&#x20;
* Tabular format must contain column names as the first row.
* Column names do not contain spaces or special characters. A snake\_case nomenclature for columns is recommended (lower case, no special characters, accents, or ñ, no spaces, and using an underscore symbol to denote spaces).
* Data for the columns must be normalised, in the case of categorical data. For example, for a column with positive and negative values use the same type of value for all responses, e.g. "Yes" and "No" instead of "Yes", "yes", "yes", "Yes", "1", "true", etc.
* In the case where territorial boundaries are named, standardise them in all cases. For example, if in several datasets there is a column describing the names of municipalities, try to use a single master list of names, to avoid different names in all datasets. For example, for "Buenos Aires", try to use exactly the same name instead of "B.A." "BsAs", etc.
* Pay special attention to data fields representing null values. A null value is a value that was not answered and ideally is represented in the same way in all CIS sets. A value such as N/A (not applicable) is different from a null value (null, nil).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://open-data-charter.gitbook.io/open-up-guide-for-the-care-sector/implementation-milestones/collecting-and-preparing-the-datasets.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
