[Gluster-users] Weird file and file truncation problems

Thu Jun 14 19:26:18 UTC 2012

I have a file on a brick with weird permission mode. I thought the "T" 
only appears on zero length pointer files.

-r--r-s--T+ 2 psdatmgr ps-data 98780901596 Jan 18 15:06 
e141-r0001-s02-c01.xtc

lsof show it is being held open and read/write by the glusterfs process 
for the brick.

glusterfs 11336      root   55u      REG               8,48 
98780901596       4133 /brick3/cxi/cxi43212/xtc/e141-r0001-s02-c01.xtc

The file was written back in January, and I don't believe I have any 
client process that have opened the file with write mode.

And the glusterfsd process is consuming ~600 CPU% (out of 8 cores with 
hyperthreading).

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11336 root      20   0 3603m 130m 2512 S 579.1  0.3   8340:01 glusterfsd

I have started a rebalance job earlier yesterday, but later stopped it. 
Does this has something to do with rebalancing?

I would like to restart glusterd on this machine, but is there a way to 
tell if any of the files on this server is opened? I ran glusterfs 
volume top, but can't tell if the files shown are currently open file. I 
don't see this e141-r0001-s02-c01.xtc file from the top command.

A possibly related, and more troubling problem I am having is with file 
truncations.

In the past 6 months running 3.2.5 and 3.2.6, we have seen cases of file 
truncations after unplanned power outage to our storage devices. I have 
suspected write cache on our raid controllers since in those cases it 
took us 2-3 days before we could get power restored and bring the disks 
back up. The battery on the raid controller would not have lasted that long.

Unfortunately, in the last few days, after proper machine shutdowns and 
reboots, we discovered file truncations again. We run a script that take 
down the file size of our data files once a day, and the next day we 
found the file size of some of the files have been reduced. Some of the 
smaller files became zero length. These are files we wrote once and 
never write again, only open for reading.

The troubling thing is, these files aren't freshly written when the 
machines were rebooted. They were 2-3 days old. And yesterday, I found 
one file from the same batch of files as the above mentioned 
e141-r0001-s02-c01.xtc, also written in January became truncated.

Yesterday, I looked at about 10 truncated files from the same brick, 
examined them using xfs_bmap. They all appeared to be using a single 
extend. So, I look at the original untruncated files from the source, 
figured out the correct length, and just built new files based on the 
location returned by xfs_bmap, and the correct length from the original 
files. (something like this:   dd if=/dev/sdh of=/root/newfile bs=512 
skip=39696059392 count=83889655). Turns out, what I was able to extract 
was the same as the original file. So, the data was indeed written to 
disk, and not merely stored in cache somewhere. However, the file size 
had mysteriously changed.

I know this could be a XFS problem. I would appreciate if I can get some 
suggestions on how to reproduce it, and what to look for. I have a test 
machine with similar but smaller amount of disk space, but I have not 
been able to reproduce the problem there.

I have upgraded to 3.3.0. The latest round of file truncation happened 
after I have upgraded, after I stop gluster volume, rebooted the machine 
cleanly.

I am running RHEL6.2 kernel version 2.6.32-220.17.1.el6.x86_64 .

Thanks,
...
ling

Thanks,
...
ling