dataroom.dev

Engineering

Large file uploads to a virtual data room API: the S3 presigned URL flow

How presigned-URL uploads work, why they exist, when to use them instead of multipart POST, and a complete worked example with chunked retry and multipart-upload support for files over 5GB: implementation against the Papermark API.

Read full docsRead the full document upload API reference
May 2, 2026·8 min read·By dataroom.dev

The Papermark document API, like most modern object-storage-backed APIs, accepts two distinct upload styles:

  1. Multipart POST through the API itself. Convenient for small files, throughput-capped by the API gateway's request-body limits and the application server's bandwidth.
  2. S3 presigned URL flow: the API hands you a one-shot signed URL, you PUT the bytes directly to S3 (or whichever object store backs the API), then you confirm the upload back to the API. No bytes pass through the application tier at all.

For anything over about 5 MB. Investor decks with embedded video, large financial models with image-rendered charts, legal document packets with high-resolution scans, image-heavy product datasheets. The presigned flow is the right choice. This article walks through it end-to-end, including chunked retry logic for flaky networks, S3 multipart upload for files over 5 GB, and the half-dozen subtle gotchas that bite teams implementing this for the first time.

Why presigned URLs exist

The straightforward multipart/form-data POST through your API works for small files but has four meaningful costs as file size grows:

  1. The application tier becomes a proxy. Every byte of every upload passes through it on the way to object storage. For a 500 MB file, that's 500 MB of compute and bandwidth your application server is responsible for. At scale, this is wasted infrastructure cost. Typically 2-4x more expensive than the storage itself.
  2. Throughput is capped by the API gateway's request-body limit and the application server's connection limits. Most production-grade API gateways cap request bodies at 100 MB or less. Many cap at 10 MB by default.
  3. Latency is doubled. The byte path is client → API → S3 instead of client → S3 directly. For users on long-distance connections (think: bidder in Singapore uploading to a US-region dataroom), the extra hop adds 200-600ms per chunk.
  4. Resume-on-failure is awkward. A network blip 80% of the way through a 500 MB upload restarts the whole thing. With presigned multipart, only the failed part has to be retried.

A presigned URL is a short-lived (typically 15 minutes) cryptographic signature granting permission to PUT an object to a specific S3 key. The client uploads directly to S3, bypassing the application entirely. S3 handles the bytes; the application just hands out the signature and observes the finalization. This is the same pattern Stripe uses for receipts, Slack uses for file shares, Notion uses for image uploads, and AWS itself uses for its console-based S3 uploads.

The three-step flow

1. POST  /v1/documents     { name, size, mime, upload: "presigned" }
         ──▶ returns       { document_id, upload_url, expires_in, headers }

2. PUT   <upload_url>      <bytes>
         (direct to S3 — no application tier involvement)

3. POST  /v1/documents/:id/finalize
         ──▶ returns       { document_id, status: "ready" }

Step 1 reserves a slot in the API's database and generates the signed S3 URL. Step 2 sends the bytes. Step 3 tells the API that bytes are in place, triggering the next stages of processing (virus scan, OCR for text extraction, preview generation, indexing).

A complete TypeScript implementation

import fs from "node:fs";
import { stat } from "node:fs/promises";
import mime from "mime-types";

const PM_API = "https://api.papermark.com/v1";
const TOKEN = process.env.PAPERMARK_TOKEN!;

async function uploadLargeDocument(path: string, dataroomId?: string) {
  const stats = await stat(path);
  const filename = path.split("/").pop()!;
  const contentType = mime.lookup(filename) || "application/octet-stream";

  // Step 1 — reserve a slot and get a presigned URL
  const reserveRes = await fetch(`${PM_API}/documents`, {
    method: "POST",
    headers: {
      Authorization: `Bearer ${TOKEN}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      name: filename,
      size: stats.size,
      mime_type: contentType,
      upload: "presigned",
      dataroom_id: dataroomId,
    }),
  });

  if (!reserveRes.ok) {
    throw new Error(`reserve failed: ${reserveRes.status} ${await reserveRes.text()}`);
  }
  const { data } = await reserveRes.json();
  // data = { document_id, upload_url, expires_in: 900, headers: {...} }

  // Step 2 — PUT the bytes directly to S3
  const stream = fs.createReadStream(path);
  const putRes = await fetch(data.upload_url, {
    method: "PUT",
    headers: {
      "Content-Type": contentType,
      "Content-Length": String(stats.size),
      // Don't add custom headers unless data.headers explicitly includes them
      // — they break the S3 signature
      ...(data.headers ?? {}),
    },
    // @ts-expect-error — node fetch accepts a readable stream
    body: stream,
    duplex: "half",
  });

  if (!putRes.ok) {
    throw new Error(`PUT to S3 failed: ${putRes.status} ${await putRes.text()}`);
  }

  // Step 3 — confirm
  const finalizeRes = await fetch(
    `${PM_API}/documents/${data.document_id}/finalize`,
    {
      method: "POST",
      headers: { Authorization: `Bearer ${TOKEN}` },
    },
  );

  if (!finalizeRes.ok) {
    throw new Error(`finalize failed: ${finalizeRes.status}`);
  }

  return finalizeRes.json();
}

const result = await uploadLargeDocument("./big-deck.pdf", "dr_pelican");
console.log(`Uploaded as ${result.data.document_id}`);

The whole thing is about 40 lines of code, no SDK required. The Papermark TypeScript SDK wraps this with progress callbacks, automatic retry, and stream-friendly APIs if you want them.

Adding retry on flaky networks

A 500 MB upload over a hotel WiFi connection, a moving train's onboard internet, or a developing-country mobile network will sometimes fail mid-PUT. Wrap step 2 in exponential-backoff retry with jitter:

async function putWithRetry(
  url: string,
  path: string,
  contentType: string,
  size: number,
  extraHeaders: Record<string, string> = {},
  maxAttempts = 5,
): Promise<Response> {
  let lastErr: unknown;

  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      const stream = fs.createReadStream(path);
      const r = await fetch(url, {
        method: "PUT",
        headers: {
          "Content-Type": contentType,
          "Content-Length": String(size),
          ...extraHeaders,
        },
        // @ts-expect-error
        body: stream,
        duplex: "half",
      });

      // 2xx is success
      if (r.ok) return r;

      // 5xx and specific 4xx codes are retryable
      const retryable =
        (r.status >= 500 && r.status < 600) ||
        r.status === 408 || // Request Timeout
        r.status === 429;   // Too Many Requests
      if (!retryable) {
        throw new Error(`non-retryable HTTP ${r.status}: ${await r.text()}`);
      }

      // For 429, respect Retry-After if present
      const retryAfter = r.headers.get("Retry-After");
      if (retryAfter) {
        await sleep(parseInt(retryAfter, 10) * 1000);
        continue;
      }

      throw new Error(`retryable HTTP ${r.status}`);
    } catch (e) {
      lastErr = e;
      if (attempt === maxAttempts) break;

      // Exponential backoff with jitter: 1s, 2s, 4s, 8s, 16s plus 0-1s jitter
      const wait = 2 ** (attempt - 1) * 1000 + Math.random() * 1000;
      await sleep(wait);
    }
  }

  throw new Error(`upload failed after ${maxAttempts} attempts: ${lastErr}`);
}

const sleep = (ms: number) => new Promise((r) => setTimeout(r, ms));

In production, this saves about 8-15% of large-file uploads from total failure. The cost is implementation complexity and slightly higher latency on the rare retry path.

For files over 5 GB: s3 multipart upload

The simple presigned-URL flow above caps at S3's single-PUT limit (5 GB). For larger files. Full-resolution video, multi-gigabyte image archives, large dataset exports. You want S3's multipart upload, where the file is split into parts (typically 50-500 MB each) and each part is uploaded with its own presigned URL. The API issues all the URLs at once, you upload in parallel, then you finalize with the list of completed-part ETags.

Request the multipart variant:

const reserveRes = await fetch(`${PM_API}/documents`, {
  method: "POST",
  headers: {
    Authorization: `Bearer ${TOKEN}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    name: "huge-dataset.zip",
    size: 12 * 1024 ** 3, // 12 GB
    upload: "multipart-presigned",
    part_size: 100 * 1024 ** 2, // 100 MB parts → 120 parts total
  }),
});

// Response:
// data = {
//   document_id,
//   upload_id,         // S3 multipart upload ID
//   parts: [
//     { part_number: 1, upload_url, expires_in },
//     { part_number: 2, upload_url, expires_in },
//     ...
//     { part_number: 120, upload_url, expires_in }
//   ]
// }

Upload each part in parallel (bounded concurrency, 4 to 8 simultaneous parts is usually the right balance between throughput and not saturating the user's connection), collect the per-part ETags, then call the finalize-multipart endpoint:

const PARALLEL = 6;

async function uploadParts(parts: Part[], file: string, partSize: number) {
  const completed: Array<{ part_number: number; etag: string }> = [];
  const queue = [...parts];

  await Promise.all(
    Array.from({ length: PARALLEL }, async () => {
      while (queue.length > 0) {
        const part = queue.shift()!;
        const offset = (part.part_number - 1) * partSize;
        const stream = fs.createReadStream(file, {
          start: offset,
          end: offset + partSize - 1,
        });

        const r = await putWithRetry(part.upload_url, file, "application/octet-stream", partSize);
        const etag = r.headers.get("ETag")!.replace(/"/g, "");
        completed.push({ part_number: part.part_number, etag });

        // Progress update
        console.log(`uploaded part ${part.part_number}/${parts.length}`);
      }
    }),
  );

  // Parts must be sorted by part_number for the finalize call
  completed.sort((a, b) => a.part_number - b.part_number);
  return completed;
}

const completed = await uploadParts(data.parts, "./huge-dataset.zip", 100 * 1024 ** 2);

await fetch(`${PM_API}/documents/${data.document_id}/finalize-multipart`, {
  method: "POST",
  headers: {
    Authorization: `Bearer ${TOKEN}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    upload_id: data.upload_id,
    parts: completed,
  }),
});

For a 12 GB file on a fast connection (gigabit symmetric), parallel-6 multipart upload completes in 90-180 seconds. Single-stream upload would take 6-12 minutes. The difference is bigger on slow or high-latency connections.

CLI shortcut

For one-off large uploads from a developer machine, the CLI handles the presigned and multipart flows transparently:

# Auto-detects large files and uses presigned URLs
papermark documents upload ./big-deck.pdf --dataroom dr_pelican

# Force the presigned path even on smaller files (useful for CI consistency)
papermark documents upload ./medium.pdf --dataroom dr_pelican --large

# Force multipart with explicit part size
papermark documents upload ./huge.zip --dataroom dr_pelican --multipart --part-size 200M

The CLI also handles automatic retry and shows a progress bar, which is the part you don't want to re-implement yourself for one-off uploads.

The half-dozen subtle gotchas

In rough order of frequency that they bite teams new to this pattern:

  1. Don't set x-amz-* headers on the PUT unless the API specifically tells you to (via data.headers in the reserve response). Adding stray AWS-specific headers can cause signature mismatch errors that look generic ("signature does not match") and waste hours of debugging.
  2. Match Content-Type exactly to what was sent in step 1. S3 signs based on what the API told it to expect. If you reserve with application/pdf and PUT with application/octet-stream, the signature check fails with a confusing 403.
  3. Don't finalize before the PUT completes. Finalization triggers virus scanning, OCR, preview rendering, and indexing. All of which read the object. Finalizing a partial object means you re-upload from scratch and lose the partial work.
  4. Presigned URLs expire fast (~15 min for Papermark, configurable). If you queue uploads, request the URL at upload time, not at queue time. A URL issued 20 minutes ago is dead even if the user just got around to clicking "upload."
  5. CORS matters in the browser. If you're uploading from a browser-based client, the S3 bucket needs CORS configured to allow PUT from your origin. Server-to-server uploads don't have this problem.
  6. Bandwidth caps still apply. Bypassing the API tier doesn't bypass the user's ISP or corporate firewall. A 5 GB upload on a hotel WiFi will still take a while.
  7. S3-compatible storage (Cloudflare R2, Wasabi, Backblaze B2) sometimes diverges from AWS S3 in subtle ways. Most APIs that say "S3 presigned" mean AWS S3 specifically. If you're self-hosting against a different backend, test the full flow including signature edge cases (URL-encoded special characters in filenames are a classic).

See also

More in Engineering