Skip to content

rclone using too much memory #2157

@calisro

Description

@calisro
Member

rclone v1.39-211-g572ee5ecβ

  • os/arch: linux/amd64
  • go version: go1.10

I recently upgraded from rclone.v1.38-235-g2a01fa9fβ to rclone v1.39-211-g572ee5ecβ

I'm using the following mount:
export RCLONE_CONFIG="/home/robert/.rclone.conf"
export RCLONE_BUFFER_SIZE=0M
export RCLONE_RETRIES=5
export RCLONE_STATS=0
export RCLONE_TIMEOUT=10m
export RCLONE_LOG_LEVEL=INFO
export RCLONE_DRIVE_USE_TRASH=false
/usr/sbin/rclone -vv --log-file /data/log/rmount-gs.log
mount robgs-cryptp:Media $GS_RCLONE
--allow-other
--default-permissions
--gid $gid --uid $uid
--max-read-ahead 1024k
--buffer-size 50M
--dir-cache-time=72h
--umask $UMASK 2>&1 > /data/log/debug &

This worked flawlessly before. In the most recent version, it is using WAY too much memory causing OOM issues on my linux server. The above uses upwards of 7GIGS of resident memory before my OOM killer terminates the mount when activity is being read/written to the mount. I used to be able to crank up the buffer-size to 150M without issues. There is NOTHING else different between the setup. I can restore my old version and re-run the batch process without issues and then replace it with the new version and it consistently terminates due to OOM after consuming everything left.

If I restore the old version and rerun the exact same process, rclone consistently uses no more than 1.8G.

I can provide logs but there is nothing them. They simply show transfers. Even in the fuse-debug there is nothing but normal activity. There seems to be either a leak or the use of memory has changed DRASTICALLY between the versions making things unusable. I've rolled back till this is sorted.

Activity

remusb

remusb commented on Mar 20, 2018

@remusb
Collaborator

Are you using cache? If yes and you're also using Plex, can you disable that feature and see if it does the same thing?

It will be enough to just remove the configs from the section and start again.

LE: Even easier will be to run simply with -v.

calisro

calisro commented on Mar 20, 2018

@calisro
MemberAuthor

no I'm not using cache and it is running with -v normally. The older version didn't use cache as that was not implemented yet I dont think. The new one is using the same remotes and setup as the old. The only change is the executable upgrade.

remusb

remusb commented on Mar 20, 2018

@remusb
Collaborator

Hmm, ok. There was another report in the forum and that person uses cache. I was trying to establish a corellation but it seems to be coming from somewhere else.

ncw

ncw commented on Mar 20, 2018

@ncw
Member

I'd like to replicate this if possible. Can you describe the activity that causes the problem more?

Or maybe you've got a script I could run?

Which provider are you using? And are you using crypt too?

added this to the v1.41 milestone on Mar 20, 2018
calisro

calisro commented on Mar 20, 2018

@calisro
MemberAuthor

So the batch script is actually just a set of rclone commands to sync stuff up against the mount points.

SETUP:
I have a local mount point called /data/Media1
I have the rclone mounted on /data/Media2

This is my mount:
I'm using the following mount:

export RCLONE_CONFIG="/home/robert/.rclone.conf"
export RCLONE_BUFFER_SIZE=0M
export RCLONE_RETRIES=5
export RCLONE_STATS=0
export RCLONE_TIMEOUT=10m
export RCLONE_LOG_LEVEL=INFO
export RCLONE_DRIVE_USE_TRASH=false
/usr/sbin/rclone -vv --log-file /data/log/rmount-gs.log 
mount robgs-cryptp:Media /data/Media2
--allow-other 
--default-permissions 
--gid $gid --uid $uid 
--max-read-ahead 1024k 
--buffer-size 50M 
--dir-cache-time=72h 
--umask $UMASK 2>&1 > /data/log/debug &

BATCH:
The batch script is simply copying new data from one to the other using this among other commands. But this is where most of hte activity starts and where the OOM starts to happen:

export RCLONE_STATS=5m
export RCLONE_CONFIG=/home/robert/.rclone.conf
export RCLONE_ONE_FILE_SYSTEM=true
export RCLONE_RETRIES=10
export RCLONE_BWLIMIT="07:00,5M 23:00,10M"
export RCLONE_TRANSFERS=8
export RCLONE_CHECKERS=16
export RCLONE_SKIP_LINKS=true
export RCLONE_DRIVE_USE_TRASH=false
rclone \
   -v \
   --modify-window 1ms \
   --exclude=**nfo \
 copy /data/Media1 /data/Media2 

I'm using google as a provider and it is using crypt as well. Media2 is simply synced with Media1 but I don't delete so it becomes a superset of what is copied to Media1. New files get put in Media1 and then replicated via this process.

danielloader

danielloader commented on Mar 20, 2018

@danielloader
Contributor

I am experiencing this too, only noticed as my IOWAIT shot up as I hit the swap and rclone pages were being pushed to the disk.
As to replication, can't seem to replicate it other than mount the remote with crypt/cache and try and play media on plex, after about 5 mins it thrashes the disk as memory shoots up from 200MB to the full 2GB on the VPS.

This wasn't the behaviour in the later betas of 1.39, anything changed in that respect?

ncw

ncw commented on Mar 20, 2018

@ncw
Member

I managed to replicate this...

First I made a directory with 1000 files in using

make_test_files -max-size 1000000 test-files

(this is one of the rclone tools)

I then mounted up

rclone mount /tmp/test1 /mnt/tmp/

and ran this little script to copy the files in and out of the mount

#!/bin/bash

for i in `seq 100`; do
    echo "round $i"
    rclone -v move /tmp/test-files/ /mnt/tmp/test-files
    rclone -v move /mnt/tmp/test-files /tmp/test-files/
    echo "waiting for 10s... (round $i)"
    sleep 10s
done

v1.39 uses a pretty constant 4MB, v1.40 goes up and up!

danielloader

danielloader commented on Mar 20, 2018

@danielloader
Contributor

https://i.imgur.com/Lvm01QB.png
Graph of when I replicate your steps, looks like a memory leak alright. Once RAM is exhausted it's just 100% IOWAIT stalled as it's trying to use swap.

ncw

ncw commented on Mar 20, 2018

@ncw
Member

OK here is the pprof memory usage - rclone was at about 2GB RSS when I took this.

So looks like it might be a bug in the fuse library... Though I'm not 100% sure about that.

profile001

(pprof) list allocMessage
Total: 1.83GB
ROUTINE ======================== github.com/ncw/rclone/vendor/bazil.org/fuse.allocMessage in /home/ncw/go/src/github.com/ncw/rclone/vendor/bazil.org/fuse/fuse.go
    1.76GB     1.76GB (flat, cum) 96.03% of Total
         .          .    419:var reqPool = sync.Pool{
         .          .    420:	New: allocMessage,
         .          .    421:}
         .          .    422:
         .          .    423:func allocMessage() interface{} {
    1.76GB     1.76GB    424:	m := &message{buf: make([]byte, bufSize)}
         .          .    425:	m.hdr = (*inHeader)(unsafe.Pointer(&m.buf[0]))
         .          .    426:	return m
         .          .    427:}
         .          .    428:
         .          .    429:func getMessage(c *Conn) *message {
remusb

remusb commented on Mar 20, 2018

@remusb
Collaborator

Was it updated just before the release? It's a bit weird no one noticed this in the last betas.

A simple list of the vendor shows that it wasn't.

calisro

calisro commented on Mar 20, 2018

@calisro
MemberAuthor

I have this issue on 1.39 though.

calisro

calisro commented on Mar 20, 2018

@calisro
MemberAuthor

rclone v1.39-211-g572ee5ecβ

ncw

ncw commented on Mar 20, 2018

@ncw
Member

I used the magic of git bisect to narrow it down to this commit fc32fee

commit fc32fee4add93e7a3bda180c7c1088f93c759d13
Author: Nick Craig-Wood <nick@craig-wood.com>
Date:   Fri Mar 2 16:39:42 2018 +0000

    mount, cmount: add --attr-timeout to control attribute caching in kernel
    
    This flag allows the attribute caching in the kernel to be controlled.
    The default is 0s - no caching - which is recommended for filesystems
    which can change outside the control of the kernel.
    
    Previously this was at the default meaning it was 60s for mount and 1s
    for cmount.  This showed strange effects when files changed on the
    remote not via the kernel.  For instance Caddy would serve corrupted
    files for a while when serving from an rclone mount when a file
    changed on the remote.

Which is both good and bad.

Good because there is a workaround --attr-timeout 1s, but bad because this is the second bug I've bisected to the same commit today

I tried --attr-timeout 1ms and that still leaked a bit of memory, whereas --attr-timeout 1s seems OK.

I suspect this is a bug in the fuse library as rclone cmount (which is based off libfuse) doesn't show the same behaviour - it works fine with --attr-timeout 0.

I'll report a bug upstream in a moment...

37 remaining items

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @hklcf@ncw@seuffert@danielloader@calisro

        Issue actions

          rclone using too much memory · Issue #2157 · rclone/rclone