dataroom.dev

Engineering

Audit logs for virtual data rooms: querying view events, building dashboards, surviving compliance reviews

How to use a modern data room audit log API: event schema, query patterns, retention policy, common compliance reports, anomaly detection, and how to pipe events into your data warehouse. Implementation against the Papermark API.

Read full docsRead the full audit log & analytics API reference
April 28, 2026·8 min read·By dataroom.dev

A virtual data room without an audit log is just file-sharing with extra steps. The audit log is the evidentiary backbone of due diligence, securities-litigation discovery, regulatory inquiries, GDPR subject-access requests, internal governance review, and post-incident forensics. For regulated industries (healthcare under HIPAA, finance under SOX, EU companies under GDPR, US public companies under SOX 404), the audit log is not optional. It's a compliance artifact.

This article covers how the Papermark audit log is structured, how to query it efficiently, what real-world reports to build on top of it, how to detect suspicious access patterns, and how to pipe the events into your data warehouse for long-term analytics and BI integration. Worked examples use the Papermark API; the patterns generalize to any audit-log-API-equipped VDR.

The view event schema

Every visit to a Papermark link produces one view record. Here's the complete shape with every field annotated:

{
  "id": "vw_01HXY7P3K2NQR4",
  "link_id": "lnk_pelican_acme",
  "dataroom_id": "dr_pelican",
  "document_id": "doc_deck_v3",
  "document_name": "Series A Deck v3.pdf",
  "visitor": {
    "id": "vis_01HXY7Q8K2",
    "email": "alice@acme-pe.com",
    "email_verified": true,
    "ip": "203.0.113.42",
    "user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 14_6_1) AppleWebKit/605.1.15",
    "country": "US",
    "region": "California",
    "city": "San Francisco",
    "timezone": "America/Los_Angeles"
  },
  "viewed_at": "2026-04-22T14:11:08.123Z",
  "ended_at": "2026-04-22T14:41:48.456Z",
  "duration_seconds": 1840,
  "pages": [
    { "number": 1, "duration_seconds": 12,  "first_seen_at": "2026-04-22T14:11:08Z" },
    { "number": 2, "duration_seconds": 340, "first_seen_at": "2026-04-22T14:11:20Z" },
    { "number": 3, "duration_seconds": 88,  "first_seen_at": "2026-04-22T14:17:00Z" }
  ],
  "downloads": 0,
  "downloads_attempted": 2,
  "exit_page": 3,
  "watermark_text": "Acme PE · alice@acme-pe.com · 2026-04-22 14:11 UTC",
  "actions": [
    { "type": "right_click_blocked", "page": 2, "at": "2026-04-22T14:14:11Z" },
    { "type": "print_blocked", "page": 3, "at": "2026-04-22T14:17:30Z" }
  ]
}

Every field is queryable. Every event is immutable. Events are retained indefinitely on the standard tier (configurable for self-hosted deployments where you might have organizational retention policies that require deletion).

A few fields worth understanding in depth:

  1. duration_seconds is the time from viewed_at to ended_at, capturing the full session. This is not the same as the sum of per-page durations. Page durations can overlap (multi-tab viewing) and include idle time on a single page.
  2. downloads_attempted vs downloads: a download attempt that was blocked by link policy (allow_download: false) still gets recorded. This is signal. Someone who attempted 2 downloads and was blocked is meaningfully different from someone who viewed without attempting.
  3. actions captures user behaviors the viewer detected and blocked or allowed: right-click attempts, print attempts, copy attempts, screenshot detection (where the browser supports it). These are attempt records, not completed actions.
  4. exit_page is the last page the user reached. If the deck is 22 pages and exit_page is 7, you have a drop-off problem on page 8.

The four common query shapes

1. Views for one link, paginated:

curl "https://api.papermark.com/v1/links/lnk_pelican_acme/views?since=2026-04-01&limit=100" \
  -H "Authorization: Bearer $PAPERMARK_TOKEN"

Returns up to 100 events at a time. The response includes meta.next_cursor for pagination on links with many views.

2. Views for one visitor across all links they accessed:

curl "https://api.papermark.com/v1/visitors/vis_01HXY/views" \
  -H "Authorization: Bearer $PAPERMARK_TOKEN"

Useful for "show me everything Alice has ever looked at across our entire workspace."

3. Single view detail (page-by-page granularity):

curl "https://api.papermark.com/v1/views/vw_01HXY7P3K2NQR4" \
  -H "Authorization: Bearer $PAPERMARK_TOKEN"

Returns the full event including the per-page array and the actions array.

4. Aggregated analytics for a dataroom, link, or document:

curl "https://api.papermark.com/v1/datarooms/dr_pelican/analytics?from=2026-04-01&to=2026-05-31" \
  -H "Authorization: Bearer $PAPERMARK_TOKEN"

Returns engagement summaries (total visitors, total view-seconds, unique visitors, drop-off curves) rather than raw events. Use these for dashboards; use the raw events for forensics.

Patterns that come up in practice

Engagement leaderboard

For an M&A or fundraising process, you typically want a sorted table of bidders by total dwell time on high-signal documents. This is the report deal teams want every Monday morning:

import { Papermark } from "@papermark/sdk";

const pm = new Papermark();

const links = await pm.datarooms.listLinks("dr_pelican");
const board: Array<{
  bidder: string;
  visits: number;
  totalMinutes: number;
  lastViewed: string | null;
  deepestPage: number;
}> = [];

for (const link of links) {
  const analytics = await pm.links.analytics(link.id);
  board.push({
    bidder: link.watermark.split(" · ")[0],
    visits: analytics.view_count,
    totalMinutes: Math.round(analytics.total_duration_seconds / 60),
    lastViewed: analytics.last_view_at,
    deepestPage: analytics.max_page,
  });
}

board.sort((a, b) => b.totalMinutes - a.totalMinutes);
console.table(board);

Pipe this into a Slack channel weekly and the deal team has perpetual situational awareness on bidder engagement without anyone manually checking the dashboard.

Page-level drop-off curve

Where do bidders stop reading? This tells you which slide killed the deck or which document in the dataroom needs work:

const events = await pm.links.views.list("lnk_pelican_acme");
const byPage: Record<number, { count: number; totalSec: number }> = {};

for (const v of events) {
  for (const p of v.pages) {
    byPage[p.number] ||= { count: 0, totalSec: 0 };
    byPage[p.number].count += 1;
    byPage[p.number].totalSec += p.duration_seconds;
  }
}

const heatmap = Object.entries(byPage)
  .map(([n, x]) => ({
    page: +n,
    visitors: x.count,
    avgSeconds: Math.round(x.totalSec / x.count),
  }))
  .sort((a, b) => a.page - b.page);

console.table(heatmap);
// page  visitors  avgSeconds
// 1     47        12
// 2     47        38
// 3     45        84
// ...
// 14    23        4    ← drop-off cliff: page 13 has a problem

The pattern you're looking for: pages where visitors falls dramatically between consecutive numbers, or where avgSeconds is much lower than the neighboring pages. The first indicates abandonment; the second indicates a quick scan that didn't engage.

Compliance export

For an audit committee, regulatory inquiry, or securities-class-action discovery request, you need a structured export covering a defined time window. This is the report counsel will ask for, by email, with a 48-hour turnaround expectation:

papermark datarooms views dr_pelican \
  --since 2026-01-01 \
  --until 2026-06-30 \
  --json > pelican-audit-h1.json

# Convert to CSV for non-technical reviewers (lawyers, paralegals, regulators)
jq -r '.data[] | [
  .id,
  .viewed_at,
  .visitor.email,
  .visitor.ip,
  .visitor.country,
  .document_name,
  .duration_seconds,
  .downloads,
  .exit_page
] | @csv' pelican-audit-h1.json > pelican-audit-h1.csv

For a 90-day M&A process with 20 bidders, expect 200-800 view events. The CSV is typically 50-300 KB.

Suspicious access detection

The audit log makes anomaly detection straightforward. Patterns worth alerting on, with example detection logic:

  1. Geographic anomaly: a view from a country the visitor has never accessed from before. Significant in M&A contexts where bidder identity matters.
  2. Identity mismatch: a visitor opening a link they were not the original recipient of, identified by email-gate vs. link minting record divergence.
  3. High-frequency access: more than N views in a 1-hour window on a single link, suggesting either an attack or a bot.
  4. Bulk download attempt: multiple download_attempted events on documents that don't allow download, especially across multiple documents in quick succession.
  5. Off-hours access: views from a visitor's tracked timezone outside of plausible business hours, repeatedly. Soft signal but useful.
  6. Right-click / print spamming: many right_click_blocked or print_blocked actions in a single session, suggesting the visitor is actively trying to extract content beyond what the link permits.
  7. Watermark stripping attempts: multiple very-short page views in sequence, characteristic of screenshot-each-page workflows aimed at producing un-watermarked copies (the watermark is server-rendered, so this doesn't work, but the attempt is telling).
const recent = await pm.links.views.list(linkId, { since: hoursAgo(1) });

// Pattern 3: high-frequency
if (recent.length > 50) {
  await slack.alert(`⚠️ ${linkId} — ${recent.length} views in last hour (possible bot/scrape)`);
}

// Pattern 4: bulk download attempts
const downloadAttempts = recent.reduce((sum, v) => sum + v.downloads_attempted, 0);
if (downloadAttempts > 10) {
  await slack.alert(`⚠️ ${linkId} — ${downloadAttempts} download attempts blocked in last hour`);
}

// Pattern 6: extraction-pattern detection
const fastClicks = recent.filter(
  (v) => v.actions.filter((a) => a.type === "right_click_blocked").length > 5,
);
if (fastClicks.length > 0) {
  await slack.alert(
    `⚠️ Possible extraction attempt on ${linkId} — ${fastClicks.length} sessions with rapid right-click activity`,
  );
}

Piping to your data warehouse

Two options, neither strictly better:

Webhook-driven (preferred for low-latency dashboards): Subscribe to view.completed events and write directly to your warehouse. Each event lands within ~5 seconds of the view ending. Good for real-time alerting and engagement dashboards that refresh on read.

// /api/papermark-webhook/route.ts
export async function POST(req: Request) {
  const event = await verifiedPayload(req);
  if (event.type === "view.completed") {
    await bigquery.insert("view_events", flatten(event.data));
  }
  return new Response("ok");
}

Pull-based (preferred for backfills and reconciliation): Run a daily job that pages through /v1/views?since=<last_cursor> and writes the deltas. Reliable, easy to backfill historical data, and resilient to webhook delivery failures.

let cursor: string | null = await readWatermark();
let totalInserted = 0;

while (true) {
  const page = await pm.views.list({ since: cursor, limit: 500 });
  if (page.data.length === 0) break;

  await bigquery.insertBatch("view_events", page.data.map(flatten));
  totalInserted += page.data.length;
  cursor = page.meta.next_cursor;
}

await writeWatermark(cursor);
console.log(`Synced ${totalInserted} events through ${cursor}`);

In production, most teams run both: webhooks for the live path, daily pull as belt-and-suspenders to catch any missed events.

Schema for the warehouse

A reasonable flattened schema for BigQuery, Snowflake, Redshift, or Postgres:

CREATE TABLE view_events (
  view_id              STRING        NOT NULL,
  link_id              STRING        NOT NULL,
  dataroom_id          STRING        NOT NULL,
  document_id          STRING        NOT NULL,
  document_name        STRING,

  visitor_email        STRING,
  visitor_ip           STRING,
  visitor_country      STRING,
  visitor_city         STRING,

  viewed_at            TIMESTAMP     NOT NULL,
  ended_at             TIMESTAMP,
  duration_seconds     INT64,
  exit_page            INT64,
  pages_viewed         INT64,
  downloads            INT64,
  downloads_attempted  INT64,

  watermark_text       STRING,
  raw_event            JSON          -- the full event for ad-hoc query later
);

CREATE INDEX idx_view_events_link ON view_events(link_id, viewed_at);
CREATE INDEX idx_view_events_visitor ON view_events(visitor_email, viewed_at);
CREATE INDEX idx_view_events_dataroom ON view_events(dataroom_id, viewed_at);

The raw_event JSON column matters because audit-log-relevant questions are often unpredictable in advance ("did anyone view document X with IP from country Y between dates A and B?"). Keeping the full event lets you query against fields you didn't think to flatten.

Retention and deletion

Audit events are retained indefinitely on the Papermark standard tier. For GDPR right-to-erasure compliance, you can anonymize a visitor's record across all historical events:

papermark visitors delete vis_01HXY --confirm

This nullifies the email, IP, and other PII fields in all historical events, preserving the integrity of the engagement statistics (durations, page numbers, view counts) without identifying the person. The structural audit trail remains intact for governance purposes.

For broader retention policy management (e.g., delete all events older than 7 years to comply with a corporate retention policy), self-hosted deployments can configure automated deletion. The hosted service requires explicit deletion via API.

What you can't do (current limitations worth knowing)

A few things on the roadmap but not in the current API:

  1. Real-time streaming of in-progress views. Currently you see the view after it ends. Sub-second streaming is on the 2026 H2 roadmap.
  2. Sub-page region tracking (where on a page the reader scrolled and lingered). View durations are per-page only. Useful, but coarser than heatmap-style tracking.
  3. Reading SLAs or comparative baselines (e.g., "this is the 80th percentile dwell time across all decks in your workspace"). Compute these yourself in your warehouse. The data is there, the API doesn't pre-aggregate.
  4. Cross-link visitor identity stitching beyond email. If a visitor opens two links with different verified emails, they're tracked as different visitors. Email is the identity key.

See also

More in Engineering