Expressing Information Flow Properties

 September 6, 2023 at 12:06 pm

Abstract

Industries and governments are increasingly compelled by regulations and public pressure to handle sensitive information responsibly. Regulatory requirements and user expectations may be complex and have subtle implications for the use of data. Information flow properties can express complex restrictions on data usage by specifying how sensitive data (and data derived from sensitive data) may flow throughout computation. Controlling these flows of information according to the appropriate specification can prevent both leakage of confidential information to adversaries and corruption of critical data by adversaries. There is a rich literature expressing information flow properties to describe the complex restrictions on data usage required by today’s digital society. This monograph summarizes how the expressiveness of information flow properties has evolved over the last four decades to handle different threat models, computational models, and conditions that determine whether flows are allowed. In addition to highlighting the significant advances of this area, we identify some remaining problems worthy of further investigation.

Considerations

Theoretical Foundations

  • Lattice theory
  • Noninterference
  • Logic models (e.g., temporal logic)

Threat Models

  • Termination: does termination leaks high*?
  • Time: different time
  • Interaction: input/output flow
  • Program code

Computational Models

  • Nondeterminism: the program is not deterministic
  • Composition of systems: e.g., feedback: the output becomes next input. Composition can lead to nondeterminism
  • Concurrency

Re(de)classification

  • Need to consider: what, where, when, and who
  • Introduce Delimited Release to declassify

Assisting Static Analysis with Large Language Models: A ChatGPT Experiment

 September 6, 2023 at 11:49 am

PDF Repo

Abstract

Recent advances of Large Language Models (LLMs), e.g., ChatGPT, exhibited strong capabilities of comprehending and responding to questions across a variety of domains. Surprisingly, ChatGPT even possesses a strong understanding of program code. In this paper, we investigate where and how LLMs can assist static analysis by asking appropriate questions. In particular, we target a specific bug-finding tool, which produces many false positives from the static analysis. In our evaluation, we find that these false positives can be effectively pruned by asking carefully constructed questions about function-level behaviors or function summaries. Specifically, with a pilot study of 20 false positives, we can successfully prune 8 out of 20 based on GPT-3.5, whereas GPT-4 had a near-perfect result of 16 out of 20, where the four failed ones are not currently considered/supported by our questions, e.g., involving concurrency. Additionally, it also identified one false negative case (a missed bug). We find LLMs a promising tool that can enable a more effective and efficient program analysis.

Noticeable Points

Prompt Design

  • Chain of Thought: Step-by-step result
  • Task decomposition: breaking down the huge tasks into smaller pieces
  • Progressive prompt: interactively feed information

Challenges in traditional code analysis

  • Inherent Knowledge Boundaries.

Static analysis requires domain knowledge to model special functions which cannot be analyzed. E.g., assembly code, hardware behaviors, concurrency, and compiler built-ins.

  • Exhaustive Path Exploration.

GPU TEE: Potential Problems

 September 6, 2023 at 11:31 am

Partial GPU

Passing a partial GPU to a VM using vGPU and SRIOV.

This shared case can be dangerous.

Isolation inside the CVM

How exactly does the GPU access the CVM's memory? Part of the memory is marked as shared, while encrypted.

NVIDIA H100: Trust Establishment

 August 21, 2023 at 3:33 pm

Resources from: https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/HCC-Whitepaper-v1.0.pdf

Host Side

  1. Secure Boot: BIOS validates the certificates of the main OS (boot loader)
  2. BL boots kernel, and the kernel validates the firmware (as it has access to the firmware after loaded)
  3. Kernel drivers query the mother board vendor: validate the certificates (stored)
  4. Conduct attestation of the hardware (e.g., firmware configurations)

Guest Side

The CVM still needs to ensure the GPU is trusted.

Constant-Time Foundations for the New Spectre Era

 February 27, 2023 at 5:48 pm

Abstract

The constant-time discipline is a software-based countermeasure used for protecting high assurance cryptographic implementations against timing side-channel attacks. Constant-time is effective (it protects against many known attacks), rigorous (it can be formalized using program semantics), and amenable to automated verification. Yet, the advent of micro-architectural attacks makes constant-time as it exists today far less useful.

This paper lays foundations for constant-time programming in the presence of speculative and out-of-order execution. We present an operational semantics and a formal definition of constant-time programs in this extended setting. Our semantics eschews formalization of microarchitectural features (that are instead assumed under adversary control), and yields a notion of constant-time that retains the elegance and tractability of the usual notion. We demonstrate the relevance of our semantics in two ways: First, by contrasting existing Spectre-like attacks with our definition of constant-time. Second, by implementing a static analysis tool, Pitchfork, which detects violations of our extended constant-time property in real world cryptographic libraries.

Methodology

  • A model to incorporate speculative execution.
  • Abstract machine: fetch, execute, retire.
  • Reorder buffer => out-of-order and speculative execution
  • Directives can be initiated to model the detailed schedule of the reordered micro-ops
  • Observations are also modeled in such process to mimic the leakage visible to the attacker
  • Modeled instructions: op, fence, load, store, br, call, ret

Artifact

  • Symbolic execution based on angr
  • Schedule the worst case reorder buffer => maximize potential leakage
  • Can indeed found threat

Comments

  • Are the threats confirmed by the developers?
  • Is the model formalized in a proof checker?
  • Execution time of instructions is not modeled? (or need not to be modeled?)

PrivGuard: Privacy Regulation Compliance Made Easier

 March 1, 2023 at 4:58 pm

Paper

Abstract

Continuous compliance with privacy regulations, such as GDPR and CCPA, has become a costly burden for companies from small-sized start-ups to business giants. The culprit is the heavy reliance on human auditing in today's compliance process, which is expensive, slow, and error-prone. To address the issue, we propose PrivGuard, a novel system design that reduces human participation required and improves the productivity of the compliance process. PrivGuard is mainly comprised of two components: (1) PrivAnalyzer, a static analyzer based on abstract interpretation for partly enforcing privacy regulations, and (2) a set of components providing strong security protection on the data throughout its life cycle. To validate the effectiveness of this approach, we prototype PrivGuard and integrate it into an industrial-level data governance platform. Our case studies and evaluation show that PrivGuard can correctly enforce the encoded privacy policies on real-world programs with reasonable performance overhead.

Methodology

Users can prescribe their privacy policies, and the analyst can then leverage user data for data analysis tasks. However, the difference privacy policies are automatically enforced and satisfied by PrivGuard, which is executed inside TEE.

The policy is prescribed in a formal language, and the data analysis program is statically analyzed by PrivAnalyzer to check privacy policy compliance. PrivAnalyzer use python interpreter as a abstract interpreter to check if the privacy policies might be broken by the program. Since the python program may use a lot of 3rd party libraries, the authors purpose functions summaries for these functions and over approximate the result.

Weakness

  • Intentional information leakage (analyst is assumed trusted)
  • Language level bypasses

Sidenote

Some papers mentioned in this work is also interesting, especially those related to dealing with loops and branches in static analysis.

This paper may need to be checked again!

Privado: Practical and Secure DNN Inference with Enclaves

 March 1, 2023 at 4:57 pm

Abstract

Cloud providers are extending support for trusted hardware primitives such as Intel SGX. Simultaneously, the field of deep learning is seeing enormous innovation as well as an increase in adoption. In this paper, we ask a timely question: "Can third-party cloud services use Intel SGX enclaves to provide practical, yet secure DNN Inference-as-a-service?" We first demonstrate that DNN models executing inside enclaves are vulnerable to access pattern based attacks. We show that by simply observing access patterns, an attacker can classify encrypted inputs with 97% and 71% attack accuracy for MNIST and CIFAR10 datasets on models trained to achieve 99% and 79% original accuracy respectively. This motivates the need for PRIVADO, a system we have designed for secure, easy-to-use, and performance efficient inference-as-a-service. PRIVADO is input-oblivious: it transforms any deep learning framework that is written in C/C++ to be free of input-dependent access patterns thus eliminating the leakage. PRIVADO is fully-automated and has a low TCB: with zero developer effort, given an ONNX description of a model, it generates compact and enclave-compatible code which can be deployed on an SGX cloud platform. PRIVADO incurs low performance overhead: we use PRIVADO with Torch framework and show its overhead to be 17.18% on average on 11 different contemporary neural networks.

Model

  • Service: In-enclave ML model. model is unpublic
  • Users: data providers, input & output are secret

Attack

Infer the output label from memory access trace collected when the user's input is processing.

  • DNN contains data-dependent branches
  • A ML model (linear reg) is built up from memory access traces and the output label
  • Can achieve high accuracy on inferring output tag from memory trace

Defense

  • Data-dependency usually occurs at activation functions (e.g. ReLU) and max pooling, etc. Other layers merely contains data-dependent memory access.
  • Eliminate the input/secret-dependency in the Torch library
  • End-to-end model compilation

Proof Complexity vs. Code Complexity

 December 26, 2022 at 9:52 pm

Potential Threats of Memory Integrity on SEV(SNP), (Scalable) SGX2, and TDX

 December 6, 2022 at 12:09 am