Skip to main content

Amazon S3

AWS S3 is an object store provided by Amazon. It can store files and structures in any format. Dataedo provides a native connector that can be used to document files in S3 in the following formats:

  • JSON
  • CSV
  • Apache Avro
  • Apache Parquet
  • Apache ORC
  • Delta Lake
  • Microsoft Excel
  • XML

Prerequisites

IAM User

To document objects stored in S3 with Dataedo, you will need an IAM user with S3 read access which will be used to connect to the bucket. To create this user:

  1. Open IAM resource in AWS Console,

  2. Open Users tab,

  3. Click Add Users button,

  4. Set user name,

  5. In Select AWS Credential Type, check Access key – Programmatic access,

    s3-iam-access
  6. In the Permissions section:

    • select Attach existing policies directly,
    • search for AmazonS3ReadOnlyAccess and check the policy
    s3-iam-policies
  7. (Optional) Set tags,

  8. Review options and if everything is correct Create User.

  9. After creating the user, save the Access Key and the Secret Key, as you will need them later to authenticate to S3 when connecting with Dataedo.

    Image title

Amazon Resource Name - ARN

Amazon Resource Name (ARN) is a unique identifier of an Amazon resource. Dataedo will use it to connect to the selected S3 Bucket. To find ARN:

  1. Open S3 Resource in AWS Console,

  2. Open the bucket which contains the file(s) you want to document,

  3. Open the properties tab,

  4. Copy the ARN value.

    S3-ARN

Connecting Dataedo to Amazon S3

Dataedo provides two ways to document file(s) in the S3 bucket. You can either Document an object stored in S3 as a structure in existing documentation or Add new documentation.

Document an object stored in S3 as a structure in existing documentation

Right-click Structures and select Add/Import File/Structure. In the opened window select Import from file.

add structure

Select the format of the file to import. If in the next steps you will select more than one file, this will be used as the default choice, although you will be able to select the format for each of the files.

Select Amazon S3 as the provider:

amazon-s3-provider

In the Select file step, click the Connect button and provide connection details to Amazon S3:

  • ARN - Amazon Resource Name which uniquely identifies S3 Bucket,
  • Access Key - key assigned to IAM user which will be used to connect Dataedo to S3 Bucket,
  • Secret Key - password for IAM user.

Obtaining connection details was described in the Prerequisites section. Click Next.

connect-s3-add-structure

In the next step, select a file or multiple files to import.

s3-files-list

If you selected only one file, Dataedo will try to read this file and if succeeded will open a window with schema and fields to provide details for the structure.

s3-structure

For multiple files, Dataedo will try to figure out the format of each file. If failed, you will see an error and have to select the type of a file manually. You can also change the format of a file if the recognized format is wrong.

s3-multiple-files

Add new connection to S3 bucket

To connect to S3 and create new documentation, click Add documentation and choose Database connection.

Add connection

On the Add documentation window choose Amazon S3:

Amazon S3 on the list

Provide connection details to Amazon S3:

  • ARN - Amazon Resource Name which uniquely identifies S3 Bucket,
  • Access Key - key assigned to IAM user which will be used to connect Dataedo to S3 Bucket,
  • Secret Key - password for IAM user.

Obtaining connection details was described in the Prerequisites section. Click Next.

s3-conn-details

The next screen allows you to change the name of the documentation under which it will be visible in the Dataedo repository.

s3-doc-title

Select the format of the file to import. If in the next steps you will select more than one file, this will be used as the default choice, although you will be able to select the format for each of the files.

In the next step, select a file or multiple files to import.

s3-files-list

If you selected only one file, Dataedo will try to read this file and if succeeded will open a window with schema and fields to provide details for the structure.

s3-structure

For multiple files, Dataedo will try to figure out the format of each file. If failed, you will see an error and have to select the type of a file manually. You can also change the format of a file if the recognized format is wrong.

s3-multiple-files

Outcome

Your S3 objects have been imported to the repository.

s3-outcome

Data profiling

Dataedo does not support profiling objects stored in Amazon S3.

Data lineage

Dataedo does not support data lineage in Amazon S3.