Introduction

Nikolay Samokhvalov started recently a thread for collecting useful ideas, tools for dealing with PostgreSQL DATA CORRUPTION, and BUGS (see the thread).

One potential corruption could be a bit flip, let me share one utility that could help starting the investigation in such a case.

say you got:

 postgres=# select * from  bdt;
 WARNING:  page verification failed, calculated checksum 20317 but expected 51845
 ERROR:  invalid page in block 0 of relation base/13287/24877

then, copy the block:

 $ dd status=none bs=8192 count=1 if=/usr/local/pgsql11.8-last/data/base/13287/24877 skip=0 of=./for_bit_flip_investigation

launch the flip_bit_and_checksum.bin utility to look for the expected checksum (51845 in this example):

 $ ./flip_bit_and_checksum.bin
 ./flip_bit_and_checksum.bin: Flip one bit one by one and compute the checksum.
 ./flip_bit_and_checksum.bin: The bit that has been flipped is displayed if the computed checksum matches the one in argument.

 Usage:
 ./flip_bit_and_checksum.bin [OPTION] <block_path>
 -c, --checksum=CHECKSUM to look for

 $ ./flip_bit_and_checksum.bin ./for_bit_flip_investigation -c 51845
 Warning: Keep in mind that numbering starts from 0 for both bit and byte
 checksum ca85 (51845) found while flipping bit 1926 (bit 6 in byte 240)

So, by flipping bit 1926 the expected checksum is returned.
It’s an indication that the corruption might be due to a bit flip at that position, that’s a good start for deeper investigations.

Remarks

  • The flip_bit_and_checksum.bin utility can be found here.
  • Having found the expected cheksum while flipping a bit is not a guarantee that a flip bit actually happened and led to the corruption. But it’s a good way to start the investigations with.
  • The utility does not modify the original block.
  • There is only one bit different from the original block at any time.

Conclusion

Thanks to the utility we can look for a bit flip that could generate the expected checksum.